GSEMA is a package designed to perform gene set enrichment
meta-analysis. The gene set enrichment meta-analysis allows for
obtaining the differentially regulated gene sets or pathways that are
shared across various studies.
Specifically, GSEMA applies a meta-analysis based on the combination of
effect sizes obtained from single sample enrichment (SSE) analysis
applied to individual the different studies. The different effect sizes
of each study for each of the gene sets are calculated using a
difference of means based on the enrichment scores obtained from SSE
analysis. GSEMA offers various methods to obtain the statistics for the
difference of means.
Subsequently, GSEMA allows for applying the various steps typical of a meta-analysis, in a similar way to the gene expression meta-analysis (Toro-Domínguez et al. (2020)), although different adjustments to the calculated statistics are performed in order to correct the possible existence of false positive bias.
This vignette provides a tutorial-style introduction to all the steps necessary to properly conduct gene set enrichment meta-analysis using the GSEMA package.
GSEMA uses a specific object as input, which is a list of nested lists where each nested list corresponds to a study. This object can be created directly by the users or they can use createObjectMApath() function to create it.
For the examples that are going to be shown, synthetic data will be used. We load the sample data into our R session.
As previously stated, the meta-analysis input in GSEMA is a list of nested lists. Each nested list contains two elements:
This object can be created directly by the user or we can make use of createObjectMApath() function, which creates the objectMApath after indicating:
createObjectMApath() function allows to create the object needed to perform meta-analysis. In this case, it is necessary to indicate as input of the function the variables that contain the experimental and reference groups:
It is important to note that if any element does not belong to the experimental or the reference group, that sample is not taken into account in the creation of meta-analysis object.
Here, we have included an example to show how exactly the function is used:
Since this function can be a bit complicated if there are many datasets, we recommend creating a vector to keep the column names of the phenodatas that contains the variable that identifies the groups to compare (namePheno argument). Moreover, we should create two others lits to indicate how to identify experimental (cases) and reference (controls) groups in these variables (expGroups and refGroups arguments).
If we look at the example phenodatas we can observed that the groups variable is “condition”. Experimental group is named as “Case” and reference group as “Healthy”.
## [1] "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy"
## [8] "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy"
## [15] "Healthy" "Case" "Case" "Case" "Case" "Case" "Case"
## [22] "Case" "Case" "Case" "Case" "Case" "Case" "Case"
## [29] "Case" "Case"
## [1] "KEGG_ABC_TRANSPORTERS"
## [2] "KEGG_ACUTE_MYELOID_LEUKEMIA"
## [3] "KEGG_ADHERENS_JUNCTION"
## [4] "KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY"
## [5] "KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM"
## [6] "KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION"
## [1] "ABCA1" "ABCA10" "ABCA12" "ABCA13" "ABCA2" "ABCA3" "ABCA4" "ABCA5"
## [9] "ABCA6" "ABCA7" "ABCA8" "ABCA9" "ABCB1" "ABCB10" "ABCB11" "ABCB4"
## [17] "ABCB5" "ABCB6" "ABCB7" "ABCB8" "ABCB9" "ABCC1" "ABCC10" "ABCC11"
## [25] "ABCC12" "ABCC2" "ABCC3" "ABCC4" "ABCC5" "ABCC6" "ABCC8" "ABCC9"
## [33] "ABCD1" "ABCD2" "ABCD3" "ABCD4" "ABCG1" "ABCG2" "ABCG4" "ABCG5"
## [41] "ABCG8" "CFTR" "TAP1" "TAP2"
In parallelization, several aspects must be considered. “n.cores” refers to the parallelization of studies or datasets. Therefore, if we have 3 studies, the maximum number for “n.cores” will be 3. internal.n.cores refers to the parallelization of single sample enrichment methods. This is especially recommended for the ssGSEA method. For Singscore and GSVA, it may also be advisable. The process is parallelized based on the samples in each study. Therefore, the larger the number of samples, the slower the process will be. The number of cores that the computer will use is the multiplication of both parameters n.cores * internal.n.cores = total cores.
We all this information we can create the vector for namePheno argument and the two list for expGroups and refGroups:
listMatrices <- list(study1Ex, study2Ex)
listPhenodata <- list(study1Pheno, study2Pheno)
phenoGroups <- c("Condition","Condition", "Condition")
phenoCases <- list("Case", "Case", "Case")
phenoControls <- list("Healthy", "Healthy", "Healthy")
objectMApathSim <- createObjectMApath(
listEX = listMatrices,
listPheno = listPhenodata, namePheno = phenoGroups,
expGroups = phenoCases, refGroups = phenoControls,
geneSets = GeneSets,
pathMethod = "Zscore",
n.cores = 1,
internal.n.cores = 1, minSize = 7)
## Applying the Zscore method
The result obtained is the proper object to perform meta-analysis (objectMApath).
GSEMA package has implemented gene set enrichment meta-analysis techniques based on the combination of effect sizes in the metaAnalysisDEpath() function. Although this function can be applied directly, it is advisable to consider some previous steps.
Before performing the meta-analysis, it is important to filter out those pathways with low expression in the datasets. This is because the results of the meta-analysis could be biased by the presence of pathways with low expression, which may lead to an increase in the number of false positives. GSEMA provides a function called **filterPaths()* that allows to filter out those pathways with low expression. The inputs of this function are:
GSEMA has implemented three different estimator for calculating the effect sizes in each study:
The metaAnalysisDEpath() function allows to perform a meta-analysis in only one step, needing only the meta-analysis object created previously.
This function has as input:
In the following example, we have applied a Random Effect model to the GSEMA object (“objectMApathSim”) we have been working with so far. In addition we have applied the “limma” method to calculate the effect size. Finally, we have allowed a 0.3 proportion of missing values in a sample and a gene set must have been contained in all studies:
results <- metaAnalysisESpath(objectMApath = objectMApathSim,
measure = "limma", typeMethod = "REM", missAllow = 0.3,
numData = length(objectMApathSim))
## Performing Random Effects Model
The output of this function is a dataframe with the results of the meta-analysis where rows are the genes and columns are the different variables provided by the meta-analysis:
## Pathway
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
## Com.ES ES.var Qval
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION -2.07387 0.1217558 0.7608425
## tau2 Zval Pval
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION 0 -5.943425 2.791278e-09
## FDR numDatasets
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION 1.544507e-07 2
## Pathway Com.ES ES.var Qval tau2
## Simulated_Pathway Simulated_Pathway 14.17285 3.270126 1.595479 2.479981
## Zval Pval FDR numDatasets
## Simulated_Pathway 7.837454 4.662937e-15 7.740475e-13 2
The results are provided in a dataframe with the variables:
Finally, we can represent in a heatmap the significant gene sets in order to observe how they are regulated in each of the studies. In heatmapPaths() function we have to include both the object that has been used in the meta-analysis, the result of it and the applied method. In addition, this package offers three different scaling approaches (scaling) in order to compare properly the gene set enrichment scores of the studies in the heatmap:
Moreover, in regulation argument, we can choose if we want to represent the over regulated or under regulated gene sets:
We can choose the number of significant gene sets (numSig) that we want to be shown on the graph and the adjusted p-value from which a gene set is considered as significant (fdrSig). In addition, the gene sets that are not presented in one sample are represented in gray.
Here we present an example of the heatmap which have been obtained from the result of applying a random effects model to the object “objectMApathSim” and making use of a “zscor” scaling approach.
res <- heatmapPaths(objectMA=objectMApathSim, results,
scaling = "zscor", regulation = "all",breaks=c(-2,2),
fdrSig = 0.05, comES_Sig = 1, numSig=20, fontsize = 5)
## scaling using z-score...
The calculateESpath() function returns the effects size in each of the studies. Moreover, it calculates the variance of each of the effects.
#Effects <- calculateESpath(objectMApath = objectMApathSim, measure = "limma")
#Effects[[1]]["Simulated_Pathway",]
#Effects[[2]]["Simulated_Pathway",]
#Session info
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GSEMA_0.99.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.2 DBI_1.2.3
## [3] pbapply_1.7-2 GSEABase_1.69.0
## [5] rlang_1.1.4 magrittr_2.0.3
## [7] matrixStats_1.4.1 compiler_4.4.2
## [9] RSQLite_2.3.9 png_0.1-8
## [11] vctrs_0.6.5 reshape2_1.4.4
## [13] stringr_1.5.1 SpatialExperiment_1.17.0
## [15] pkgconfig_2.0.3 crayon_1.5.3
## [17] fastmap_1.2.0 magick_2.8.5
## [19] XVector_0.47.0 utf8_1.2.4
## [21] rmarkdown_2.29 graph_1.85.0
## [23] UCSC.utils_1.3.0 purrr_1.0.2
## [25] bit_4.5.0.1 xfun_0.49
## [27] zlibbioc_1.52.0 cachem_1.1.0
## [29] beachmat_2.23.4 GenomeInfoDb_1.43.2
## [31] jsonlite_1.8.9 progress_1.2.3
## [33] blob_1.2.4 rhdf5filters_1.19.0
## [35] DelayedArray_0.33.3 Rhdf5lib_1.29.0
## [37] BiocParallel_1.41.0 prettyunits_1.2.0
## [39] irlba_2.3.5.1 parallel_4.4.2
## [41] singscore_1.27.0 R6_2.5.1
## [43] RColorBrewer_1.1-3 bslib_0.8.0
## [45] stringi_1.8.4 limma_3.63.2
## [47] numDeriv_2016.8-1.1 GenomicRanges_1.59.1
## [49] jquerylib_0.1.4 iterators_1.0.14
## [51] Rcpp_1.0.13-1 SummarizedExperiment_1.37.0
## [53] knitr_1.49 GSVA_2.1.2
## [55] IRanges_2.41.2 Matrix_1.7-1
## [57] tidyselect_1.2.1 abind_1.4-8
## [59] yaml_2.3.10 doParallel_1.0.17
## [61] codetools_0.2-20 metafor_4.6-0
## [63] lattice_0.22-6 tibble_3.2.1
## [65] plyr_1.8.9 Biobase_2.67.0
## [67] KEGGREST_1.47.0 evaluate_1.0.1
## [69] Biostrings_2.75.2 pillar_1.9.0
## [71] BiocManager_1.30.25 MatrixGenerics_1.19.0
## [73] metadat_1.2-0 foreach_1.5.2
## [75] stats4_4.4.2 generics_0.1.3
## [77] mathjaxr_1.6-0 hms_1.1.3
## [79] S4Vectors_0.45.2 ggplot2_3.5.1
## [81] sparseMatrixStats_1.19.0 munsell_0.5.1
## [83] scales_1.3.0 xtable_1.8-4
## [85] glue_1.8.0 pheatmap_1.0.12
## [87] maketools_1.3.1 tools_4.4.2
## [89] sys_3.4.3 ScaledMatrix_1.15.0
## [91] annotate_1.85.0 locfit_1.5-9.10
## [93] buildtools_1.0.0 XML_3.99-0.17
## [95] rhdf5_2.51.1 grid_4.4.2
## [97] impute_1.81.0 tidyr_1.3.1
## [99] rbibutils_2.3 SingleCellExperiment_1.29.1
## [101] AnnotationDbi_1.69.0 edgeR_4.5.1
## [103] colorspace_2.1-1 nlme_3.1-166
## [105] GenomeInfoDbData_1.2.13 BiocSingular_1.23.0
## [107] HDF5Array_1.35.2 cli_3.6.3
## [109] rsvd_1.0.5 fansi_1.0.6
## [111] S4Arrays_1.7.1 dplyr_1.1.4
## [113] gtable_0.3.6 sass_0.4.9
## [115] digest_0.6.37 BiocGenerics_0.53.3
## [117] SparseArray_1.7.2 farver_2.1.2
## [119] rjson_0.2.23 memoise_2.0.1
## [121] htmltools_0.5.8.1 lifecycle_1.0.4
## [123] httr_1.4.7 statmod_1.5.0
## [125] bit64_4.5.2