Gene Set Enrichment Meta-Analysis with GSEMA package

Introduction

GSEMA is a package designed to perform gene set enrichment meta-analysis. The gene set enrichment meta-analysis allows for obtaining the differentially regulated gene sets or pathways that are shared across various studies.
Specifically, GSEMA applies a meta-analysis based on the combination of effect sizes obtained from single sample enrichment (SSE) analysis applied to individual the different studies. The different effect sizes of each study for each of the gene sets are calculated using a difference of means based on the enrichment scores obtained from SSE analysis. GSEMA offers various methods to obtain the statistics for the difference of means.

Subsequently, GSEMA allows for applying the various steps typical of a meta-analysis, in a similar way to the gene expression meta-analysis (Toro-Domínguez et al. (2020)), although different adjustments to the calculated statistics are performed in order to correct the possible existence of false positive bias.

This vignette provides a tutorial-style introduction to all the steps necessary to properly conduct gene set enrichment meta-analysis using the GSEMA package.

Previous steps: Meta-analysis object

GSEMA uses a specific object as input, which is a list of nested lists where each nested list corresponds to a study. This object can be created directly by the users or they can use createObjectMApath() function to create it.

For the examples that are going to be shown, synthetic data will be used. We load the sample data into our R session.

library(GSEMA)
data("simulatedData")
  • study1Ex, study2Ex: two expression matrices.
  • study1Pheno, study2Pheno: two phenodata objects.
  • GeneSets: a list of gene sets with each element are the genes that belong to a pathway.
  • objectMApathSim: the meta-analysis object created from the different expression matrices and phenodatas.

Meta-analysis object creation (objectMApath)

As previously stated, the meta-analysis input in GSEMA is a list of nested lists. Each nested list contains two elements:

  • A gene set matrix with gene sets in rows and samples in columns *A vector of 0 and 1 indicating the group of each sample. 0 represents reference group (usually controls) and 1 represents experimental group (usually cases).

This object can be created directly by the user or we can make use of createObjectMApath() function, which creates the objectMApath after indicating:

  • the reference group (refGroups) and the experimental group (expGroups) in the phenodata.
  • the gene sets (geneSets) that will be consider.
  • the single sample enrichment method to applied to calculate the gene sets matrices

createObjectMApath() function allows to create the object needed to perform meta-analysis. In this case, it is necessary to indicate as input of the function the variables that contain the experimental and reference groups:

  • listEX: a list of dataframes or matrix (genes in rows and samples in columns). A list of ExpressionSets can be used too:
#List of expression matrices
listMatrices <- list(study1Ex, study2Ex)
  • listPheno: a list of phenodatas (samples in rows and covariables in columns). If the object listEX is a list of ExpressionSets this element can be null.
listPhenodata <- list(study1Pheno, study2Pheno)
  • namePheno: a list or vector of the different column names or column positions from the pheno used for perfoming the comparison between groups. Each element of namePheno correspont to its equivalent element in the listPheno. (default a vector of 1, all the first columns of each elements of listPheno are selected)
  • expGroups: a list or vector of the group names or positions from namePheno variable used as experimental group (cases) to perform the comparison (default a vector of 1, all the first groups are selected).
  • refGroups: a list or vector of the group names or positions from namePheno variable used as reference group (controls) to perform the comparison (default a vector of 2, all the second groups are selected).

It is important to note that if any element does not belong to the experimental or the reference group, that sample is not taken into account in the creation of meta-analysis object.

Here, we have included an example to show how exactly the function is used:

Since this function can be a bit complicated if there are many datasets, we recommend creating a vector to keep the column names of the phenodatas that contains the variable that identifies the groups to compare (namePheno argument). Moreover, we should create two others lits to indicate how to identify experimental (cases) and reference (controls) groups in these variables (expGroups and refGroups arguments).

If we look at the example phenodatas we can observed that the groups variable is “condition”. Experimental group is named as “Case” and reference group as “Healthy”.

study1Pheno$Condition
##  [1] "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy"
##  [8] "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy"
## [15] "Healthy" "Case"    "Case"    "Case"    "Case"    "Case"    "Case"   
## [22] "Case"    "Case"    "Case"    "Case"    "Case"    "Case"    "Case"   
## [29] "Case"    "Case"
  • geneSets: a list of gene sets with each element are the genes that belong to a pathway:
head(names(GeneSets))
## [1] "KEGG_ABC_TRANSPORTERS"                          
## [2] "KEGG_ACUTE_MYELOID_LEUKEMIA"                    
## [3] "KEGG_ADHERENS_JUNCTION"                         
## [4] "KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY"           
## [5] "KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM"
## [6] "KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION"
GeneSets[[1]]
##  [1] "ABCA1"  "ABCA10" "ABCA12" "ABCA13" "ABCA2"  "ABCA3"  "ABCA4"  "ABCA5" 
##  [9] "ABCA6"  "ABCA7"  "ABCA8"  "ABCA9"  "ABCB1"  "ABCB10" "ABCB11" "ABCB4" 
## [17] "ABCB5"  "ABCB6"  "ABCB7"  "ABCB8"  "ABCB9"  "ABCC1"  "ABCC10" "ABCC11"
## [25] "ABCC12" "ABCC2"  "ABCC3"  "ABCC4"  "ABCC5"  "ABCC6"  "ABCC8"  "ABCC9" 
## [33] "ABCD1"  "ABCD2"  "ABCD3"  "ABCD4"  "ABCG1"  "ABCG2"  "ABCG4"  "ABCG5" 
## [41] "ABCG8"  "CFTR"   "TAP1"   "TAP2"
  • pathMethod: a character string indicating the method to calculate the gene sets matrices. The available methods are:
    • “GSVA”: Gene Set Variation method (Hänzelmann, Castelo, and Guinney (2013))
    • “Zscore”: Z-score method (Lee et al. (2008))
    • “ssGSEA”: Single Sample Gene Set Enrichment Analysis method (Barbie et al. (2009))
    • “Singscore”: Single sample scoring of molecular phenotypes (Foroutan et al. (2018))
  • minSize: Minimum size of the resulting gene sets after gene identifier mapping. By default, the minimum size is 7.
  • kdcf: Only neccesary for the GSVA method. Character vector of length 1 denoting the kernel to use during the non -parametric estimation of the cumulative distribution function of expression levels across samples. By default, kcdf=“Gaussian” which is suitable when input expression values are continuous, such as microarray. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf=“Poisson”.
  • normalize: boolean specifying if the gen set matrices should be normalized. Default value “TRUE”.
  • n.cores: Number of cores to use in the parallelization of the datsets. By default, n.cores=1.
  • internal.n.cores: Number of cores to use in the parallelization of the single sample enrichment methods. By default internal.n.cores= 1.

In parallelization, several aspects must be considered. “n.cores” refers to the parallelization of studies or datasets. Therefore, if we have 3 studies, the maximum number for “n.cores” will be 3. internal.n.cores refers to the parallelization of single sample enrichment methods. This is especially recommended for the ssGSEA method. For Singscore and GSVA, it may also be advisable. The process is parallelized based on the samples in each study. Therefore, the larger the number of samples, the slower the process will be. The number of cores that the computer will use is the multiplication of both parameters n.cores * internal.n.cores = total cores.

We all this information we can create the vector for namePheno argument and the two list for expGroups and refGroups:

listMatrices <- list(study1Ex, study2Ex)
listPhenodata <- list(study1Pheno, study2Pheno)
phenoGroups <- c("Condition","Condition", "Condition")
phenoCases <- list("Case", "Case", "Case")
phenoControls <- list("Healthy", "Healthy", "Healthy")
objectMApathSim <- createObjectMApath(
    listEX = listMatrices,
    listPheno = listPhenodata, namePheno = phenoGroups,
    expGroups = phenoCases, refGroups = phenoControls,
    geneSets = GeneSets,
    pathMethod = "Zscore",
    n.cores = 1,
    internal.n.cores = 1, minSize = 7)
## Applying the Zscore method

The result obtained is the proper object to perform meta-analysis (objectMApath).

Performing Meta-analysis

GSEMA package has implemented gene set enrichment meta-analysis techniques based on the combination of effect sizes in the metaAnalysisDEpath() function. Although this function can be applied directly, it is advisable to consider some previous steps.

Filtering paths with low expression

Before performing the meta-analysis, it is important to filter out those pathways with low expression in the datasets. This is because the results of the meta-analysis could be biased by the presence of pathways with low expression, which may lead to an increase in the number of false positives. GSEMA provides a function called **filterPaths()* that allows to filter out those pathways with low expression. The inputs of this function are:

  • objectMApath: the meta-analysis object of GSEMA package.
  • threshold:A number that indicates the threshold to eliminate a gene set. For a eliminate a gene set is necessary that the median for both groups are less than the threshold. If threshold = “sd” the threshold will be the standard deviation of the gene set. The default value is 0.65.
objectMApathSim <- filteringPaths(objectMApathSim, threshold = 0.65)

Calculation of the effects sizes

GSEMA has implemented three different estimator for calculating the effect sizes in each study:

  • The “limma” method used the limma package (Smyth et al. (2020)) to calculate the effect size and the variance of the effect size. The effect size is calculated from the moderated Student’s t computed by limma. From it, the estimator of Hedges’g and its corresponding variance are obtained. In this way, some of the false positives obtained by the “SMD” method are reduced.
  • The “SMD” (Standardized mean different) method calculates the effect size using the Hedges’g estimator Hedges (1981).
  • The “MD” (raw mean different) calculates the effects size as the difference between the means of the two groups (Borenstein et al. (2021)).

Performing meta-analysis: metaAnalysisDEpath()

The metaAnalysisDEpath() function allows to perform a meta-analysis in only one step, needing only the meta-analysis object created previously.

This function has as input:

  • objectMApath: The meta-analysis object of GSEMA package.
  • effectS: A list of two elements. The first element is a dataframe with gene sets in rows and studies in columns. Each component of the dataframe is the effect of a gene set in a study. The second element of the list is also a dataframe with the same structure, but in this case each component of the dataframe represent the variance of the effect of a gene set in a study. This argument should be only used in the case that objectMApath argument is null.
  • measure: A character string that indicates the type of effect size to be calculated. The options are “limma”, “SMD” and “MD”. The default value is “limma”. See details for more information.
  • WithinVarCorrect A logical value that indicates if the within variance correction should be applied. The default value is TRUE. See details for more information (cite)
  • typeMethod: a character that indicates the method to be performed:
    • “FEM”: Fixed Effects model.
    • “REM”: Random Effects model.
  • missAllow: a number between 0 and 1 that indicates the maximum proportion of missing values allows in a sample. If the sample has more proportion of missing values, the sample will be eliminated. In the other case, the missing values will be imputed by using the K-NN algorithm included in impute package (Hastie et al. (2019)). In case the objectMA has been previously imputed, this element is not necessary.
  • numData: a number between 0 and 1 that indicates the minimum number of datasets in which a pathway must be contained to be included.

In the following example, we have applied a Random Effect model to the GSEMA object (“objectMApathSim”) we have been working with so far. In addition we have applied the “limma” method to calculate the effect size. Finally, we have allowed a 0.3 proportion of missing values in a sample and a gene set must have been contained in all studies:

results <- metaAnalysisESpath(objectMApath = objectMApathSim,
    measure = "limma", typeMethod = "REM", missAllow = 0.3, 
    numData = length(objectMApathSim))
## Performing Random Effects Model

The output of this function is a dataframe with the results of the meta-analysis where rows are the genes and columns are the different variables provided by the meta-analysis:

results[1,]
##                                                                                       Pathway
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
##                                                  Com.ES    ES.var      Qval
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION -2.07387 0.1217558 0.7608425
##                                                tau2      Zval         Pval
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION    0 -5.943425 2.791278e-09
##                                                         FDR numDatasets
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION 1.544507e-07           2
results[nrow(results),]
##                             Pathway   Com.ES   ES.var     Qval     tau2
## Simulated_Pathway Simulated_Pathway 14.17285 3.270126 1.595479 2.479981
##                       Zval         Pval          FDR numDatasets
## Simulated_Pathway 7.837454 4.662937e-15 7.740475e-13           2

Meta analysis results

The results are provided in a dataframe with the variables:

  • Com.ES: combined effect of the gene set.
  • ES.var: variance of the combined effect of the gene set.
  • Qval: total variance of the gene set.
  • tau2: between-study variance of the gene set.
  • zval: combined effect value for a standard normal. I can be use in order to find out if the gene set is overexpressed (positive value) or underexpressed (negative value).
  • Pval: P-value of the meta-analysis for the gene set.
  • FDR: P-value adjusted of the meta-analysis for the gene set.
  • numDataset: Number of the datasets in which the gene set is included.

Visualization of the results: heatmap

Finally, we can represent in a heatmap the significant gene sets in order to observe how they are regulated in each of the studies. In heatmapPaths() function we have to include both the object that has been used in the meta-analysis, the result of it and the applied method. In addition, this package offers three different scaling approaches (scaling) in order to compare properly the gene set enrichment scores of the studies in the heatmap:

  • “zscor”: It calculates a z-score value for each gene, that is, the mean gene expression from each gene is subtracted from each gene expression value and then it is divided by the standard deviation.
  • “swr”: Scaling relative to reference dataset approach Lazar et al. (2013).
  • “rscale”: It uses the rescale function of the scales package to scale the gene expresion Wickham and D.Siedel (2020).
  • “none”: no scaling approach is applied.

Moreover, in regulation argument, we can choose if we want to represent the over regulated or under regulated gene sets:

  • “up”: only up-regulated gene sets are represented.
  • “down: only down-regulated gene sets are represented
  • “all”: up-regulated and down-regulated gene sets are represented.

We can choose the number of significant gene sets (numSig) that we want to be shown on the graph and the adjusted p-value from which a gene set is considered as significant (fdrSig). In addition, the gene sets that are not presented in one sample are represented in gray.

Here we present an example of the heatmap which have been obtained from the result of applying a random effects model to the object “objectMApathSim” and making use of a “zscor” scaling approach.

res <- heatmapPaths(objectMA=objectMApathSim, results,
            scaling = "zscor", regulation = "all",breaks=c(-2,2),
            fdrSig = 0.05, comES_Sig = 1, numSig=20, fontsize = 5)
## scaling using z-score...

Additional information

Calculating individual Effects size

The calculateESpath() function returns the effects size in each of the studies. Moreover, it calculates the variance of each of the effects.

#Effects <- calculateESpath(objectMApath = objectMApathSim, measure = "limma")
#Effects[[1]]["Simulated_Pathway",]
#Effects[[2]]["Simulated_Pathway",]

#Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GSEMA_0.99.3     BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##   [1] Rdpack_2.6.2                DBI_1.2.3                  
##   [3] pbapply_1.7-2               GSEABase_1.69.0            
##   [5] rlang_1.1.4                 magrittr_2.0.3             
##   [7] matrixStats_1.4.1           compiler_4.4.2             
##   [9] RSQLite_2.3.9               png_0.1-8                  
##  [11] vctrs_0.6.5                 reshape2_1.4.4             
##  [13] stringr_1.5.1               SpatialExperiment_1.17.0   
##  [15] pkgconfig_2.0.3             crayon_1.5.3               
##  [17] fastmap_1.2.0               magick_2.8.5               
##  [19] XVector_0.47.0              utf8_1.2.4                 
##  [21] rmarkdown_2.29              graph_1.85.0               
##  [23] UCSC.utils_1.3.0            purrr_1.0.2                
##  [25] bit_4.5.0.1                 xfun_0.49                  
##  [27] zlibbioc_1.52.0             cachem_1.1.0               
##  [29] beachmat_2.23.4             GenomeInfoDb_1.43.2        
##  [31] jsonlite_1.8.9              progress_1.2.3             
##  [33] blob_1.2.4                  rhdf5filters_1.19.0        
##  [35] DelayedArray_0.33.3         Rhdf5lib_1.29.0            
##  [37] BiocParallel_1.41.0         prettyunits_1.2.0          
##  [39] irlba_2.3.5.1               parallel_4.4.2             
##  [41] singscore_1.27.0            R6_2.5.1                   
##  [43] RColorBrewer_1.1-3          bslib_0.8.0                
##  [45] stringi_1.8.4               limma_3.63.2               
##  [47] numDeriv_2016.8-1.1         GenomicRanges_1.59.1       
##  [49] jquerylib_0.1.4             iterators_1.0.14           
##  [51] Rcpp_1.0.13-1               SummarizedExperiment_1.37.0
##  [53] knitr_1.49                  GSVA_2.1.2                 
##  [55] IRanges_2.41.2              Matrix_1.7-1               
##  [57] tidyselect_1.2.1            abind_1.4-8                
##  [59] yaml_2.3.10                 doParallel_1.0.17          
##  [61] codetools_0.2-20            metafor_4.6-0              
##  [63] lattice_0.22-6              tibble_3.2.1               
##  [65] plyr_1.8.9                  Biobase_2.67.0             
##  [67] KEGGREST_1.47.0             evaluate_1.0.1             
##  [69] Biostrings_2.75.2           pillar_1.9.0               
##  [71] BiocManager_1.30.25         MatrixGenerics_1.19.0      
##  [73] metadat_1.2-0               foreach_1.5.2              
##  [75] stats4_4.4.2                generics_0.1.3             
##  [77] mathjaxr_1.6-0              hms_1.1.3                  
##  [79] S4Vectors_0.45.2            ggplot2_3.5.1              
##  [81] sparseMatrixStats_1.19.0    munsell_0.5.1              
##  [83] scales_1.3.0                xtable_1.8-4               
##  [85] glue_1.8.0                  pheatmap_1.0.12            
##  [87] maketools_1.3.1             tools_4.4.2                
##  [89] sys_3.4.3                   ScaledMatrix_1.15.0        
##  [91] annotate_1.85.0             locfit_1.5-9.10            
##  [93] buildtools_1.0.0            XML_3.99-0.17              
##  [95] rhdf5_2.51.1                grid_4.4.2                 
##  [97] impute_1.81.0               tidyr_1.3.1                
##  [99] rbibutils_2.3               SingleCellExperiment_1.29.1
## [101] AnnotationDbi_1.69.0        edgeR_4.5.1                
## [103] colorspace_2.1-1            nlme_3.1-166               
## [105] GenomeInfoDbData_1.2.13     BiocSingular_1.23.0        
## [107] HDF5Array_1.35.2            cli_3.6.3                  
## [109] rsvd_1.0.5                  fansi_1.0.6                
## [111] S4Arrays_1.7.1              dplyr_1.1.4                
## [113] gtable_0.3.6                sass_0.4.9                 
## [115] digest_0.6.37               BiocGenerics_0.53.3        
## [117] SparseArray_1.7.2           farver_2.1.2               
## [119] rjson_0.2.23                memoise_2.0.1              
## [121] htmltools_0.5.8.1           lifecycle_1.0.4            
## [123] httr_1.4.7                  statmod_1.5.0              
## [125] bit64_4.5.2

References

Barbie, D. A., P. Tamayo, J. S. Boehm, S. Y. Kim, S. E. Moody, I. F. Dunn, and et al. 2009. “Systematic RNA Interference Reveals That Oncogenic KRAS-Driven Cancers Require TBK1.” Nature, 108–12. https://doi.org/10.1371/journal.pcbi.1000217.
Borenstein, M., L. V. Hedges, J. P. T. Higgins, and Rothstein. 2021. “Introduction to Meta-Analysis.” John Wiley and Sons.
Foroutan, M., D. D. Bhuva, R. Lyu, K. Horan, J. Cursons, and M. J. Davis. 2018. “Single Sample Scoring of Molecular Phenotypes.” BMC Bioinformatics, 404. https://doi.org/10.1186/s12859-018-2435-4.
Hänzelmann, S., R. Castelo, and J. Guinney. 2013. “GSVA: Gene Set Variation Analysis for Microarray and RNA-Seq Data.” BMC Bioinformatics, 7. https://doi.org/10.1186/1471-2105-14-7.
Hastie, T., R. Tibshirani, B. Narasimhan, and G. Chu. 2019. “Impute: Imputation for Microarray Data.”
Hedges, L. V. 1981. “Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators.” Journal of Educational Statistics, 107–28. https://doi.org/10.2307/1164588.
Lazar, C., S. Meganck, J. Taminau, and et al. 2013. “Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey.” Briefings in Bioinformatics, 469–90. https://doi.org/10.1093/bib/bbs037.
Lee, E., H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee. 2008. “Inferring Pathway Activity Toward Precise Disease Classification.” PLOS Computational Biology, e1000217. https://doi.org/10.1371/journal.pcbi.1000217.
Smyth, G. K., M. Ritchie, N. Thorne, Hu Yifang, and et al. 2020. “Linear Models for Microarray and RNA-Seq Data User’s Guide.”
Toro-Domínguez, D., J. A. Villatoro-García J. A., J. Martorell-Marugán, and et al. 2020. “A Survey of Gene Expression Meta-Analysis:methods and Applications.” Briefings in Bioinformatics, 1–12. https://doi.org/10.1093/bib/bbaa019.
Wickham, H., and D.Siedel. 2020. “Scale Functions for Visualization.”