# Differential Expression Enrichment Tool (DEET)

## Install and load DEET

DEET relies on the following packages. Since they are all CRAN in origin, they should download and install automatically with devtools::install_github or utils::install.packages. The required dependencies are listed below.

• ggplot2 - CRAN
• ActivePathways - CRAN
• stats - CRAN
• utils - CRAN
• glmnet - CRAN
• ggrepel - CRAN
• dplyr - CRAN
• pbapply - CRAN

### Installation

1. Github (Development Version)
devtools::install_github("wilsonlabgroup/DEET")
1. CRAN (Stable Release)
# IN DEVELOPMENT

All processed DEGs, metadata, and enriched pathways in formats compatible with this package as well as other methods such as gene set enrichment analysis are stored here: http://wilsonlab.org/public/DEET_data/

No functions within DEET automatically load data for the user, so the data either needs to be downloaded directly from the ftp, or using the downloader function.

The DEET_data_download function, with possible inputs “ALL”, “metadata”, “enrich”, and “feature_extract” automatically downloads the data required to run DEET_enrich and/or DEET_feature_extract.

We reccomended using:

downloaded <- DEET_data_download("ALL")
metadata <- downloaded$metadata DEET_feature_extract_input <- downloaded$DEET_feature_extract
DEG_processed$coef <- log2fc colnames(DEG_processed) <- c("gene_symbol", "padj", "coef") The DEET R package also contains plotting functions to summarize the most significant studies based on each enrichment test and correlation within DEET_enrich(). The proccess_and_plot_DEET_enrich() function plots barplots of the most enriched studies based on gene set enrichment (ActivePathways) of studies enriched based on overlapping DEGs, pathways, and TF targets. The DEET_plot_correlation() function generates scatterplots of the most enriched studies based on Spearman’s correlation analysis. All plots are generated using ggplot2, and the functions return the ggplot2 objects, allowing researchers to further modify and/or save the plots. Lastly, the DEET R package contains a function called DEET_feature_extract(), which allows researchers to identify genes that are associated with metadata. If the response variable are continuous (e.g., number of DEGs in study, Fold-change of TP53 etc.) then features are extracted by calculating the coefficients from a Gaussian family elastic net regression using the glmnet R package, as well as Spearman’s correlation between every gene and the response variable. If the response variable is categorical (e.g., Source, Category etc.), then features are extracted by calculating the coefficients from a multinomial family elastic net regression, as well as an ANOVA between each category within the response variable. Lastly, in the response variable is ordinal (e.g., enriches for TNFa pathway, Cancer study yes/no etc.), then features are extracted using a binomial family elastic net regression, as well as a Wilcoxon’s test between the two categories within the response variable. ## Breaking down each core function within DEET ### DEET_enrich: querying your own list again the DEGs stored within DEET. #### Inputs • DEET_enrich() expects a list of human gene symbols as either a character vector or as a data frame with the columns c("gene_symbol", "padj", "coef"). If you are inputted DEGs, the log2Fold-change is the most appropriate coefficient, but I left this general for other inputted data types (e.g., effect size from GWAS). Theexample_DEET_enrich_inputhas the structure required for a gene list with p-values and coefficients. An example of the unordered list isexample_DEET_enrich_input$gene_symbol.

• DEET_dataset should either be the example file in DEET_example_data or DEET_enrich_input downloaded by the downloader function or manually from the FTP. other options will likely cause the tool to crash.

• ordered: This parameter only applies when the DEG_list variable is a character vector (not a dataframe). It determins if the inputted list is ordered or not.

• background: a character vector of genes within the universe. For example, if your input is a list of DEGs in RNA-seq, then “background” should be all of your detected genes. Like in traditional pathway enrichment, not seeting a background may bias the enriched studies by tissue and/or cell-type as more highly expressed genes have more power to be detected as DE.

#### Examples

##### Running DEET with an datafame
data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
##### Running DEET with an ordered gene list
data("example_DEET_enrich_input")
data("DEET_example_data")

geneList <- example_DEET_enrich_input$gene_symbol DEET_out <- DEET_enrich(geneList, DEET_dataset = DEET_example_data, ordered = TRUE) ##### Running DEET with an unordered gene list data("example_DEET_enrich_input") data("DEET_example_data") geneList <- example_DEET_enrich_input$gene_symbol
DEET_out <- DEET_enrich(geneList, DEET_dataset = DEET_example_data, ordered =FALSE)
##### Differences in output between the inputted gene list types

The output of these three comparisons will be comparable, however the correlation variable is of note when the input gene set is just a list of genes.

When the gene set is ordered, a Spearman’s correlation is interpretable, as it is simply the rank-order of genes, however a Pearson’s correlation is not interpretable as we do not know the relative difference in coefficient size of your inputted genes

If the gene list is unordered, correlation analysis is entirely uninterpretable and is not run. You are given this message: Input gene list is considered UNORDERED: Correlation analysis will not be run and pathway enrichment will be unordered.

Since it not run the output is No variance in coefs. Cannot proceed with correlation.

#### Outputs

Named list where each element contains 6 objects. Each object will contain the results (enrichment or correlation) and corresponding metadata.

• AP_INPUT_BP_output - Enriched BPs of input gene list. This is the output of ActivePathways. Namely, it is a data table giving the ontology ID, name, FDR-adjusted p-value, ontology size, and a character vector of overlapping genes.
• AP_INPUT_TF_output - Enriched TF of input gene list. The format is identical to AP_INPUT_BP_output but instead of enriched gene ontologies, there is the enriched TFs.
• AP_DEET_DE_output - Enriched pairwise comparisons within DEET based on overlap the researchers inputted gene list and DE genes. This object contains two elements, the first one is results, which is the same data.table as in AP_INPUT_BP_output but instead of pathways, we are enriching against DE comparisons found within DEET. The second is metadata which is a subset of the larger metadata dataframe that corresponds with the gene sets in results.
• AP_DEET_BP_output - Enriched pairwise comparisons within DEET based on the overlap of the pathways enriched by the inputted gene list and the pathways enriched from the pairwise comparisons withing DEET. The format is the same as AP_DEET_DE_output where pathway gene ontology term names are in the place of DE gene names.
• AP_DEET_TF_output - Enriched pairwise comparisons within DEET based on the overlap of the transcription factors enriched by the inputted gene list and the pathways enriched from the pairwise comparisons withing DEET. The format is the same as AP_DEET_DE_output where pathway gene transcription factor term names are in the place of DE gene names.
• DE_correlations - A list containing three objects. The first objects, results is a data frame where the rows are different pairwise comparisons and the columns are the ouput of the correlational analysis between overlapping genes. Columns are the DEET ID and DEET name, the pearson/spearman’s correlation coefficients, as well as their raw and FDR adjusted p-values. The metadata object are the subsetted rows from the metadata data frame that align with the studies containing significant correlations. Lastly, the distributions object is a list where each element is a diffierent significantly associated pairwise comparisons. Each element is populated by a dataframe where rows are genes significant in at least one study, columns are the coefficients from the input study and DEET, and colour designates whether the gene is DE in one study (grey), both studies in the same direction (purple) and both studies in the opposite direction (orange).

### DEET_feature_extract

#### Inputs

• DEET_feature_extract expects three variables: a gene-by-comparison matrix populated with a statistic related to differential expresion (e.g., p-value, fold-change), a response variable (i.e., an output dependent variable), and a category.

• mat - a gene-bycomparison matrix populated with a statistic related to differential expression. The matrix in DEET_data_download() is populated with the log2FC of genes if they were deemed as significant (padj < 0.05). Other matrices, namely those populated by p-value, fold-change, and t-statitic (including the matrices for all DEGs) can be found in http://wilsonlab.org/public/DEET_data/feature_matrices and can be downloaded separately.

• respeonse: a character vector identifying each study, this can be categorical, binomial (e.g. 0/1, T/F), or continuous

• datatype: a character indicating if the response is “binomial”, “categorical”, or “continuous”. This is case sensitive.

#### Example

data(DEET_feature_extract_example_matrix)
data(DEET_feature_extract_example_response)
single1 <- DEET_feature_extract(DEET_feature_extract_example_matrix,
DEET_feature_extract_example_response,"categorical")

#### Outputs

DEET feature extract outputs a list of three objects.

• elastic_net_coefficients - a named list where each element is a different option of the categorical or binomial variable. Each element is populated with a named vector where the name is a gene and the vector is the coefficient of an eleastic net regression. If the datatype is “continuous”, then this will just be a one-element list.
• elastic_net - The direct output of the glmnet elastic net. The parameters of glmnet are default aside from the family, which is determined by datatype and alpha which is set to 0.5 (selecting an elastic net instead of L1 or L2 regularized).
• basic_features - more basic statistical analysis to find genes associated with features. It is a list with “test”, a character vector stating “ANOVA”, “Wilcoxon”, or “correlation” depending on if the datatype input was categorical, binomial or continuous respsectively. The other object is a dataframe showing the significance of each gene, where rows are genes, and columns are stastistics (e.g., F-statistic for ANOVA), P-value, and FDR-adjusted p-value.

### Plotting the outputs of DEET_enrich.

#### Barplots of enrichement

The proccess_and_plot_DEET_enrich() function is a wrapper that generates barplots and a dot plot of enrichment for all of the individual outputs of DEET_enrich() (not the correlations). The outputs are in ggplot2 objects, allowing users to further modify the plots or print however they like.

#### Inputs

• DEET_output - The direct output of DEET_enrich() a named list with the ActivePathways enrichment from each data type. If there wasn’t enrichment of a certain datatype (e.g. AP_DEET_BP_output) the function runs properly but it is not plotted.

The remaining varables are for graphical parameters that are passed into the DEET_enrichment_plot() function.

• width - The number of inches in the barplot or dotplot.
• text_angle - The angle of the enriched studies.
• horizontal - Whether the output barplot is vertical or horizontal
• topn - the top number of studies (by p-value) to be plotted.
• ol_size - the minimum number of overlapping genes (or paths) in an enriched study.
• exclude_domain - Exclude studies enriched based on DEGs, Paths, or TF if the user happened to aggregate the results into a single DF, generally unused.
• cluster_order - Factor to group studies based on the researchers custom annotation.
• colors - Type of color pallete to input into ‘scale_fill_brewer’ of ggplot.

#### Examples

data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
plotting_example <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
horizontal = TRUE, topn=4)

Another example Where AP_DEET_BP_output is not significant, to show the plotting function still works.

data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
DEET_out$AP_DEET_DE_output <- "No enrichment to be plotted" plotting_example <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45, horizontal = TRUE, topn=4) ### Outputs There are up to four outputs assuming everything is significant, each output is a list or a ggplot object. • DEET_Dotplot - an object of class ggplot priting the dotplot of enrichment from DEET using the DE, BP, and TF information. • Pathway Barplot - An object of class ggplot printing the barplots of traditional pathway enrichment using ActivePathwys. • individual_barplot - A list of ggplot2 objects showing the enrichment of every gene set (i.e., every enrichment output from DEET_enrich()). • DEET_output_forplotting - A list of dataframes that are directly compatible with the DEET_enrichment_plot. An example of this is shown below. DE_example <- DEET_out$AP_DEET_DE_output$results # Changes for DEET_example_plot DE_example$term.name <- DEET_out$AP_DEET_DE_output$metadata$DEET.Name DE_example$domain <- "DE"
DE_example$overlap.size <- lengths(DE_example$overlap)
DE_example$p.value <- DE_example$adjusted.p.val

DE_example_plot <- DEET_enrichment_plot(list(DE_example = DE_example), "DE_example")

As shown above, from here you can also just use DEET_enrichment_plot directly to have some more control over these plots.

#### Scatterplots of correlations for DEET enrichment

This function also takes the direct output from DEET_enrich and generates scatterplots of the correlations of studies whose log2FCs are significantly correlated with the input DE list.

#### Input

correlation_input - The DE_correlations object that is the output of the DEET_enrich function. It only works if there was at least one study that was significantly correlated.

#### Examples

data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
correlation_input <- DEET_out$DE_correlations correlation_plots <- DEET_plot_correlation(correlation_input) #### Outputs • A list of ggplot objects showing the scatterplot between the coefs of the inputted DE list and the correlated study within DEET. The X axis shows the correlation coefficient, the Y axis shows the name of the enriched study, and the coloured points are genes that are DE in both studies. ### Using DEET gene lists with other studies and enriching two input lists simultaneously. As mentioned previously, the genesets within DEET are easily transferrable to other gene set enrichment datasets. #### Saving DEET gene set for GSEA, gprofiler, etc. One option is to download the *gmt files diretly from our ftp. http://wilsonlab.org/public/DEET_data/DEET_DE.gmt Is directly compatible with these tools. The other option would be to save the downloaded DEET gmt as a gmt file. This is completed using the ActivePathways R package. Instead of using the example data as shown below, please use the full dataset. Instead of saving to a temporary directory like in this vignette, save the file wherever you want the directory to be saved. DEET_gmt <- DEET_example_data$DEET_gmt_DE
message(paste0("DEET_gmt is an object of class gmt?: ",ActivePathways::is.GMT(DEET_gmt) ))

ActivePathways::write.GMT(DEET_gmt, file = paste0(tempdir(),"/DEET_DEs.gmt"))

#### Enriching two gene sets simultaneously.

If you have two dependent gene lists to input into DEET, you can use ActivePathways directly to find combind DEET-comparison enrichment of the two gene sets.

set.seed(1234) # as I sample p-values to make the toy example

# For example two, I had the same genes but I shuffled the p-value

example_DEET_enrich_input$padj2 <- sample(example_DEET_enrich_input$padj, length(example_DEET_enrich_input$padj), replace = FALSE) # Make a gene-by-input-list matrix of the adjusted p-values from your multiple gene sets AP_matrix <- as.matrix(example_DEET_enrich_input[,c("padj", "padj2")]) # Run activepathways on the combined matrix. # Get gmt file, again from the whole list: DEET_gmt <- DEET_example_data$DEET_gmt_DE

AP_example_out <- ActivePathways::ActivePathways(scores=AP_matrix, gmt=DEET_gmt, geneset.filter = c(5,10000),correction.method = "fdr")
The outputs of using ActivePathways are the same as DEET_enrich() but with a couple extra columns. * evidence: Whether the DEET comparison is enriched because of one gene list, both gene lists, or an integrated version of these gene lists * Genes_colname`: The genes that contributed to enrichment from each inputted gene list.