In this vignette, we demonstrate the NetGSA workflow using a subset of the breast cancer data example downloaded from the Cancer Genome Atlas (TCGA 2012). In particular, we illustrate how to incorporate existing network information (e.g. curated edges from KEGG) to improve the power of pathway enrichment analysis with NetGSA. Details of the method are avaialble in Ma, Shojaie, and Michailidis (2016).
netgsa
provides simple functions that automatically search known gene databases using graphite
for network information and integrate seamlessly with NetGSA
pathway enrichment andigraph
and Cytoscape plotting. In this vignette, we demonstrate the NetGSA workflow explaining proper data types and usage of the netgsa
package.
Our example data set comes from a breast cancer study (TCGA 2012), which consists of gene expression data from 520 subjects including 117 estrogen-receptor-negative (ER-) and 403 estrogen-receptor-positive (ER+). The goal is to generate diagnostic pathway signatures that could distinguish patients with different ER statuses by comparing gene expression data from the two classes. These signatures could provide additional clinical benefit in diagnosing breast cancer.
NetGSA works directly with the expression data matrix. When loading the data, it is important to check the distribution of raw sequencing reads and perform log transformation if necessary. Data in this example were already log transformed. Rows of the data matrix correspond to genes and columns to subjects. Genes in this data matrix were labeled with Entrez IDs, same as those used in KEGG pathways.
library(netgsa)
library(graphite)
library(data.table)
data("breastcancer2012_subset")
ls()
## [1] "edgelist" "group" "nonedgelist" "pathways" "pathways_mat"
## [6] "x"
The objects loaded which are necessary for this vignette are:
x
, the data matrixedgelist
, a data.frame containing user-specified edgesnonedgelist
, a data.frame containing user-specified non-edgesgroup
, a vector mapping columns of x
to conditionspathways
, a list of KEGG pathways,pathways_mat
, an indicator matrix of KEGG pathways to be used directly in NetGSA
x
The data matrix, x
must have rownames that are named as "GENE_ID:GENE_VALUE"
. Since netgsa
allows the user to search for edges in multiple databases in graphite
, each of which may use different identifiers (e.g. KEGG, Reactome, Biocarta, etc.), netgsa
needs to know the identifier for a given gene so we can convert it properly. For example, if you have two genes “ENTREZID: 7186” and “ENTREZID: 329” and want to search for potential edges between these two within Reactome, netgsa
must first convert these genes to their “UNIPROT” IDs and use those to search for edges.
netgsa
uses the AnnotationDbi
package to convert genes. The valid list of "GENE_ID"
’s is:
::keytypes(org.Hs.eg.db::org.Hs.eg.db) AnnotationDbi
##
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
An example of what the rownames for the data matrix, x
, should look like is:
head(rownames(x))
## [1] "ENTREZID:127550" "ENTREZID:53947" "ENTREZID:65985" "ENTREZID:51166"
## [5] "ENTREZID:15" "ENTREZID:60496"
The columns of the data matrix do not need to be named, but it is useful for keeping track of groups.
The group
object is a vector (numeric or character) and must be the same length as ncol(x)
. Each element of the group
tells us which group each column of x
corresponds to. For example, if group[1] = "Positive"
this says that the first column of x
corresponds to the "Positive"
condition.
We can find out the ER status by looking at the group labels.
table(group)
## group
## 1 2
## 403 117
The edgelist and non-edgelist are strings representing file locations and are read in using data.table
’s fread()
command. These are where users can specify known edges/non-edges of their own. Each observation is assumed to be a directed edge (for edgelist) or a directed non-edge (for non-edgelist). They both must have 4 columns in the following order. The columns do not necessarily need to be named properly, they simply must be in this specific order:
Note it is assumed that each edge/non-edge is directed so if you want an undirected edge/non-edge you should put in two observations as in:
## base_gene_src base_id_src base_gene_dest base_id_dest
## 1: 7534 ENTREZID 8607 ENTREZID
## 2: 8607 ENTREZID 7534 ENTREZID
Lastly, as documented in ?NetGSA
we need an indicator matrix of pathways. The rows of this matrix should correspond to pathways and the columns to the genes included in the analysis. The dimension is then npath x p. Values in this matrix should be 1 for genes that are in a given pathway and 0 otherwise.
We consider pathways from KEGG (Kanehisa and Goto 2000). KEGG pathways can be accessed in R using the graphite package.
graphite::pathways('hsapiens','kegg')
paths <-1]] paths[[
## "Glycolysis / Gluconeogenesis" pathway
## Native ID = hsa:00010
## Database = KEGG
## Species = hsapiens
## Number of nodes = 93
## Number of edges = 1146
## Retrieved on = 11-10-2023
## URL = http://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00010
head(nodes(paths[[1]]))
## [1] "ENTREZID:10327" "ENTREZID:124" "ENTREZID:125" "ENTREZID:126"
## [5] "ENTREZID:127" "ENTREZID:128"
For this vignette, we’ll use pathways_mat
:
1:5,7, drop = FALSE] pathways_mat[
## ENTREZID:18
## Adipocytokine signaling pathway 0
## Adrenergic signaling in cardiomyocytes 0
## AGE-RAGE signaling pathway in diabetic complications 0
## Alanine, aspartate and glutamate metabolism 1
## AMPK signaling pathway 0
The 1 shown indicates that "ENTREZID: 18"
belongs to the Alanine, aspartate and glutamate metabolism pathway.
In summary, the data that we’ll need to run our pathway enrichment analysis is:
x
), with rows for genes and columns for samples,group
) for ER status,edgelist
) to be read in with fread()
,nonedgelist
) to be read in with fread()
,pathways_mat
).Now that we have our data set-up properly we can begin with the pathway enrichment analysis. netgsa
has three main functions. The first is prepareAdjMat
which searches for edges in public databases and uses this information to estimate the adjacency matrices needed for NetGSA
. The second is NetGSA
which performs network-based gene set analysis. We will go into further details about the usage of these functions below.
After having the data set-up, the first step in pathway enrichment analysis with netgsa
is to estimate the adjacency matrices. Remember, the rownames of the data matrix X
must be named as "GENE_ID:GENE_VALUE"
as in "ENTREZID:7534"
. The group
vector is defined as above.
The databases
argument can be either (1) the result of obtainEdgeList
or (2) a character vector defining the databases to search. These are essentially the same thing, the only difference is that for the character vector method, obtainEdgeList
is called inside prepareAdjMat
and cannot be saved. Since prepareAdjMat
queries the graphite
databases when it is called and graphite
databases can change overtime, this may not be desirable for reproducibility. Using the obtainEdgeList
method, one can save the edgelist to ensure the same network information is used across iterations or in the future. In both methods, one must specify the databases to search. The options are the databases for homo spaiens available in graphite
or NDEx (only for development version on Github). The graphite databases are:
## [1] "kegg" "panther" "pathbank" "pharmgkb" "reactome"
## [6] "smpdb" "wikipathways"
Note the databases
argument is case sensitive so make sure to pass "reactome"
and not "Reactome"
.
The cluster
argument controls whether or not clustering is used when estimating the adjacency matrix. The default behavior is to use clustering if p > 2,500
. However, the user can override this behavior by setting the cluster
argument. prepareAdjMat
chooses the best clustering method from 6 possible methods in the igraph
package. More details are provided in ?prepareAdjMat
.
User specified edge and non-edge files are specified with the file_e
and file_ne
arguments respectively. For more information on the assumptions and how network information is incorporated from the database edgelists, the user edges, and the user non-edges see the Details section and file_e
and file_ne
parameters in ?prepareAdjMat
.
It is also important to note that prepareAdjMat
will automatically choose the correct network estimation technique based on whether or not the graph is directed so no additional work is needed to determine undirected vs directed graphs. This is documented in ?prepareAdjMat
.
Now let’s get to some example code. Suppose we wanted to estimate the network for our example data using our known edges/non-edges and searching for edges in Reactome, KEGG, and BioCarta. Let’s also use clustering to speed up computation. We could do this simply by:
obtainEdgeList(rownames(x), c("kegg", "reactome"))
database_search <- prepareAdjMat(x, group, database_search,
network_info <-cluster = TRUE, file_e = "edgelist.txt",
file_ne = "nonedgelist.txt")
The main value of interest is network_info[["Adj"]]
, a list of lists. The top level list indicates the condition while the next level is a list of adjacency matrices (one for each cluster). If cluster = FALSE
there is an element for each connected component in the network. Also note that in this example, the object database_search
was created. If desired, this can be saved and used later to ensure the same network information was used. However, the results may not be exactly the same. For example, some of the clustering algorithms are not deterministic so while the network information may be the same the clusters may be different resulting in different adjacency matrices. The adjacency matrix for the first condition and first cluster can be accessed with:
"Adj"]][[1]][[1]][7:9,7:9] network_info[[
## ENTREZID:18 ENTREZID:27 ENTREZID:28
## ENTREZID:18 0 0 0
## ENTREZID:27 0 0 0
## ENTREZID:28 0 0 0
The total number of clusters can be identified with:
length(network_info[["Adj"]][[1]])
## [1] 16
Note we cannot control cluster size which is why we try several methods and implement rules on which to choose. For more information see ?prepareAdjMat
.
Now that the adjacency matrices are assembled we can perform pathway inference using NetGSA
. The returned value from prepareAdjMat
will always be in the correct format regardless of whether or not clustering is used, so one should always be able to pass network_info[["Adj"]]
directly to NetGSA
. The default is to use restricted Haseman-Elston regression (REHE) with sampling to estimate the variance parameters. This allows for significant computational speed up and little to no loss in power. One can also use REML for variance parameter estimation but this can be quite slow for problems with moderate or large dimension. For explanatory purposes, we explicitly write out the function:
NetGSA(network_info[["Adj"]], x, group, pathways_mat,
pathway_tests_rehe <-lklMethod = "REHE", sampling = TRUE,
sample_n = 0.25, sample_p = 0.25)
The sample_n
and sample_p
parameters control the ratio for subsampling observations (columns of data matrix x
) and variables (rows of data matrix x
) respectively. More information on REHE and sampling can be found in the ?NetGSA
Details section.
Note that inference using REML can take several hours depending on the complexity of the problem, while inference using REHE can take minutes. Roughly, REML becomes quite slow for p > 2,000. In this example, with sample_n = 0.25
, sample_p = 0.25
only took about 2 minutes on a 2 CPU personal computer with 16 GB of RAM.
The main inference results are stored in pathway_tests[["results"]]
and contain the pathway names as well as their significance (p-values and q-values are reported) and test statistics.
netgsa
provides several options for visualizing the results from NetGSA
. Cytoscape or igraph is used to display the main results depending on whether the user has the Cytoscape app open.
If the user has Cytoscape open on their computer, calling plot.NetGSA
will create several plots:
Cytoscape plots - the first place plot.NetGSA
generates plots is within Cytoscape. A nested network is created where the main network (Pathway Network) displays pathways as nodes. Edges in this network represent edges between genes contained within those pathways. In this network, the results
object from NetGSA()
is loaded as the node data and the number of edges between genes of separate pathways are loaded as edge data in a variable called “weight”. Node color is mapped to the test statistic. Negative values are orange, values around 0 are white, and positive values are blue. Node border color is mapped to q-values going from red (0) to white (1).
In addition to the main network, a network is generated for each pathway. These are the gene networks underlying each pathway and are referred to as the within pathway networks. The within pathway networks are linked to the nodes in the Pathway Network using Cytoscape’s nested network format. An example will make this more clear. Calling:
plot.NetGSA(pathway_tests_rehe)
The Pathway Network might look something like:
The most interesting part of the Cytoscape nested network framework is the ability to view the within pathway networks. Suppose you were looking at the Pathway Network and noticed the ErbB signaling network was highly significant and you wanted to investigate it by looking at the underlying gene network. To do this, one could simply right click on the ErbB signaling pathway node in the Pathway Network, and go to Nested Networks -> Go To Nested Network. Alternatively, the nested networks have the same name as the pathway they represent so one could simply navigate to the ErbB signaling pathway using the Network tab in the Control Panel on the left side of the Cytoscape GUI. To save time, the nested networks are not formatted. However, we can use the formatPathways
function to format one or multiple using the default style.
formatPathways(pathway_tests_rehe, "ErbB signaling pathway")
The formatPathways
function colors gene nodes based on FDR adjusted q-value. For more information see ?formatPathways
.
To export images from Cytoscape one can either use the GUI or the RCy3::exportImage()
function.
Cytoscape legend generated in R - Legends are difficult to incorporate on the plot within Cytoscape so we generate a legend in R. Alternatively it is also possible to generate an image directly in Cytoscape. The Cytoscape manual page has more information on that: http://manual.cytoscape.org/en/stable/Styles.html.
Igraph plot - Lastly, for users that prefer R, plot.NetGSA
will also generate a plot using igraph
with the same layout and node color/node border color mapping as in Cytoscape.