Title: | Gene Set Analysis Using the Gene Set Ordinal Association Test |
---|---|
Description: | Perform gene set enrichment analyses using the Gene set Ordinal Association Test (GOAT) algorithm and visualize your results. Koopmans, F. (2024) <doi:10.1101/2023.12.10.570979>. |
Authors: | Frank Koopmans [aut, cre] |
Maintainer: | Frank Koopmans <[email protected]> |
License: | Apache License (>= 2) |
Version: | 1.0 |
Built: | 2024-11-27 06:55:02 UTC |
Source: | CRAN |
test_genesets()
by geneset similarity (separately for each 'geneset source')cluster significant genesets from test_genesets()
by geneset similarity (separately for each 'geneset source')
cluster_genesets(x, genelist, hclust_method = "ward.D2")
cluster_genesets(x, genelist, hclust_method = "ward.D2")
x |
results from |
genelist |
should be the same as provided to |
hclust_method |
hierarchical clustering method, any of; 'ward.D', 'ward.D2' (default), 'single', 'complete', 'average' |
a list with elements genesets (param x
), similarity, hc_row, hc_col
naively darken a color by mixing in black
darken_color(color, frac = 0.1)
darken_color(color, frac = 0.1)
color |
input colors |
frac |
fraction of black; >0 and <1 |
adjusted value for input color
while the Bioconductor respository is extensive, contains data for many species and is a part of a larger infrastructure, it might contain outdated GO data when the user is not using the latest R version. If users are on an R version that is a few years old, so will the GO data from Bioconductor be.
As an alternative, we store gene2go data from NCBI (for Human genes only!) at the GOAT GitHub repository. This function allows for a convenient way to download this data and then parse the genesets.
Alternatively you can browse the file in the data branch of the GOAT GitHub repository and download these files manually,
then load them via the GOAT R function load_genesets_go_fromfile()
.
To view all available data you can open this URL in a browser; https://github.com/ftwkoopmans/goat/tree/data
New data is automatically added biannually. The first available version is 2024-01-01, the next 2024-06-01, then 2025-01-01, and so on.
download_genesets_goatrepo( output_dir, type = "GO", version = "2024-01-01", ignore_cache = FALSE )
download_genesets_goatrepo( output_dir, type = "GO", version = "2024-01-01", ignore_cache = FALSE )
output_dir |
full path to the directory where the downloaded files should be stored. Directory is created if it does not exist.
e.g. |
type |
the type of genesets to download. Currently, only "GO" is supported (default) |
version |
the dataset version. This must be a date in format YYYY-MM-DD. Example: "2024-01-01" (default). View all available versions at https://github.com/ftwkoopmans/goat/tree/data |
ignore_cache |
boolean, set to TRUE to force re-download and ignore cached data, if any. Default: FALSE |
result from respective geneset parser function. e.g. if parameter type
was set to"GO" (default), this function returns the result of load_genesets_go_fromfile()
. These data returned by this function is typically used as input for filter_genesets()
, c.f. full example at documentation for test_genesets()
# note: this example will download 2 files of approx 10MB in total # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/go" on unix systems or output_dir="C:/data/go" on Windows output_dir = tempdir() # download data files with GO annotations, version 2024-01-01 (default parameter) # these are then parsed with the load_genesets_go_fromfile() function # if the files are already available at output_dir, the download step is skipped genesets_asis = download_genesets_goatrepo(output_dir) ### for a basic example on how to use the data obtain here, ### refer to the example included at function documentation of: test_genesets()
# note: this example will download 2 files of approx 10MB in total # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/go" on unix systems or output_dir="C:/data/go" on Windows output_dir = tempdir() # download data files with GO annotations, version 2024-01-01 (default parameter) # these are then parsed with the load_genesets_go_fromfile() function # if the files are already available at output_dir, the download step is skipped genesets_asis = download_genesets_goatrepo(output_dir) ### for a basic example on how to use the data obtain here, ### refer to the example included at function documentation of: test_genesets()
Downloads OMICs-based datasets that were used in the GOAT manuscript from the GOAT GitHub page. This file is cached in the output directory and only needs to be downloaded once. Multiple datasets are included and their names include the respective PubMed identifiers (PMID).
If you encounter technical difficulties, try to;
download the file by copy/pasting this URL into your browser: https://github.com/ftwkoopmans/goat/raw/main/analyses/goat_manuscript_datasets.rda
load the data in R using the following 2 lines of code, here assuming you stored the downloaded file at C:/data/goat_manuscript_datasets.rda
load("C:/data/goat_manuscript_datasets.rda")
genelist = goat_manuscript_datasets.rda[["Wingo 2020:mass-spec:PMID32424284"]]
download_goat_manuscript_data(output_dir, ignore_cache = FALSE)
download_goat_manuscript_data(output_dir, ignore_cache = FALSE)
output_dir |
full path to the directory where the downloaded files should be stored. Directory is created if it does not exist.
e.g. |
ignore_cache |
boolean, set to TRUE to force re-download and ignore cached data, if any. Default: FALSE |
a list of genelist data tables. The names of the list represent the datasets, values in the list are data tables that can be used as a "genelist" in the GOAT R package
filter a geneset table; intersect with an array of genes-of-interest then apply cutoffs on min/max genes per geneset
filter_genesets( genesets, genelist, min_overlap = 10L, max_overlap = 1500L, max_overlap_fraction = 0.5, min_signif = NA, max_size = NA, dedupe = FALSE )
filter_genesets( genesets, genelist, min_overlap = 10L, max_overlap = 1500L, max_overlap_fraction = 0.5, min_signif = NA, max_size = NA, dedupe = FALSE )
genesets |
tibble with genesets, must contain columns 'id', 'genes' and 'ngenes' |
genelist |
tibble with genes, must contain column 'gene' and 'signif'. gene = character column, which are matched against list column 'genes' in genesets tibble. signif = boolean column (you can set all to FALSE if not performing Fisher-exact or hypergeometric test downstream) |
min_overlap |
integer, minimum number of genes in the |
max_overlap |
integer, maximum number of genes in the |
max_overlap_fraction |
analogous to |
min_signif |
expert setting for debugging and algorithm evaluation/benchmarking, NOT for regular geneset analyses. integer, minimum number of genes in the |
max_size |
integer, maximum number of genes in the geneset (i.e. prior to intersect with user's gene list provided as |
dedupe |
boolean, remove duplicate genesets (as determined after intersection with |
the input genesets
filtered for the subset of rows that match user's filter parameters
https://stackoverflow.com/a/8197703
gg_color_hue(n)
gg_color_hue(n)
n |
number of colors |
a color code (string)
note that this file lacks parent/child relations, so we only learn 'direct annotations'
go_gene2go(f, taxid_filter = 9606)
go_gene2go(f, taxid_filter = 9606)
f |
full path to gene2go file stored on the computer, e.g. previously downloaded from https://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz |
taxid_filter |
taxonomy id, integer |
a tibble with columns; source, source_version, id, name, genes, ngenes
note that we remove links between GO terms that are across GO domains (e.g. no CC to MF relations)
The only supported relations are those that match this regex;
"^(is_a:|relationship: part_of|relationship: regulates|relationship: positively_regulates|relationship: negatively_regulates)"
go_obo(f, rename_namespace = TRUE, remove_obsolete = TRUE)
go_obo(f, rename_namespace = TRUE, remove_obsolete = TRUE)
f |
full path to go.obo file stored on the computer, e.g. previously downloaded from http://current.geneontology.org/ontology/go.obo |
rename_namespace |
boolean; rename official namespace values like 'cellular_component' to CC? (analogous for BP and MF) |
remove_obsolete |
boolean; remove obsoleted terms? |
tibble with ontology terms and their relations
ASCII logo for this package
goat_logo()
goat_logo()
package logo as a string
there parameters are used by goat to efficiently perform geneset testing without bootstrapping
goat_nulldistributions
goat_nulldistributions
goat_nulldistributions
a data.frame with precomputed GOAT null distribution parameters
Print package version and logo to console
goat_print_version()
goat_print_version()
prints to console without returning a value
simple wrapper around utils::packageVersion()
goat_version()
goat_version()
package version as a string
download link: https://www.genenames.org/download/statistics-and-files/ table: "Complete dataset download links" –>> "Complete HGNC approved dataset" –>> download the "TXT" table filename is typically something like hgnc_complete_set.txt URL; https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txt
hgnc_idmap_table(filename)
hgnc_idmap_table(filename)
filename |
full path to the downloaded table (expected to be tsv format, typically has .txt or .tsv file extension) |
a long-format table with columns; hgnc_id, hgnc_symbol, type, value
table: "Total Approved Symbols" –>> "TXT" / "text file in TSV format" filename is typically something like non_alt_loci_set.txt
naively lighten a color by mixing in white
lighten_color(color, frac = 0.1)
lighten_color(color, frac = 0.1)
color |
input colors |
frac |
fraction of white; >0 and <1 |
adjusted value for input color
parse genesets in GMT format where gene identifiers are numeric Entrez gene IDs
load_genesets_gmtfile(filename, label)
load_genesets_gmtfile(filename, label)
filename |
input file for this function should be the full path to genesets defined in GMT format |
label |
a shortname for the genesets in this file, for example "GO_CC", "KEGG", "MY_DB_V1". This will be stored in the 'source' column of the resulting table. Importantly, multiple testing correction in GOAT is grouped by this 'source' column so you probably want to use a different label for each collection-of-genesets that you load. Must not be empty, only allowed characters are; upper/lower-case letter, numbers 0-9 and underscore |
tibble with columns; source (character), source_version (character), id (character), name (character), genes (list), ngenes (int)
URL: https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp#C5 download this data: KEGG subset of curated pathways –>> NCBI (Entrez) Gene IDs filename should be something like "c2.cp.kegg.v2023.1.Hs.entrez.gmt"
# TODO: update the filename to your downloaded file f = "C:/DATA/c2.cp.kegg.v2023.1.Hs.entrez.gmt" if(file.exists(f)) genesets_asis = load_genesets_gmtfile(f, label = "KEGG")
# TODO: update the filename to your downloaded file f = "C:/DATA/c2.cp.kegg.v2023.1.Hs.entrez.gmt" if(file.exists(f)) genesets_asis = load_genesets_gmtfile(f, label = "KEGG")
Download and import genesets from the GO database using the Bioconductor infrastructure.
Use the goat::load_genesets_go_fromfile
function for more fine-grained control over the GO database version that you use; it allows you to import NCBI gene2go files
load_genesets_go_bioconductor(include_child_annotations = TRUE)
load_genesets_go_bioconductor(include_child_annotations = TRUE)
include_child_annotations |
boolean; include annotations against child terms? In most situations, TRUE (default) is the desired setting |
Note that org.Hs.eg.db pulls data semi-annually from NCBI gene2go, but the GO database version returned by this function is tied to the version of the org.Hs.eg.db on your computer (this is controlled by the Bioconductor infrastructure).
The actual GO database version that is retrieved is returned by this function in the source_version
column.
table with columns; source (character), source_version (character), id (character), name (character), genes (list), ngenes (int)
This function is used to load Gene Ontology (GO) genesets from files that you manually downloaded from the links below. This enables the use of the latest data from GO (in contrast, Bioconductor GO data may lag behind current data considerably). To construct genesets from available raw data, download the "gene2go" file (the gene annotations) from below NCBI link and download the GO OBO (ontology terms and relations to respective parent/child terms) from below geneontology.org link. Provide the full path to the downloaded file to this function. Both "gzipped" and "uncompressed" files are supported.
We encourage you to rename the files after your downloaded them such that the date of download in incorporated; this ensures you can always keep track of the GO database version that was used! For example, rename the downloaded "gene2go.gz" file to "gene2go_2024-01-31.gz".
Download link for gene2go file; https://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Download link for gene ontology OBO file; http://current.geneontology.org/ontology/go.obo
load_genesets_go_fromfile( file_gene2go, file_goobo, include_child_annotations = TRUE )
load_genesets_go_fromfile( file_gene2go, file_goobo, include_child_annotations = TRUE )
file_gene2go |
full path to the gene2go file from NCBI. Also works with the gzipped file gene2go.gz |
file_goobo |
full path to the OBO file from geneontology.org |
include_child_annotations |
boolean; include annotations against child terms? In most situations, TRUE (default) is the desired setting |
table with columns; source (character), source_version (character), id (character), name (character), genes (list), ngenes (int)
# TODO: update the filenames to your downloaded files file_gene2go = "C:/DATA/gene2go_2024-01-01.gz" file_goobo = "C:/DATA/go_2024-01-01.obo" if(file.exists(file_gene2go) && file.exists(file_goobo)) { genesets_asis = load_genesets_go_fromfile(file_gene2go, file_goobo) }
# TODO: update the filenames to your downloaded files file_gene2go = "C:/DATA/gene2go_2024-01-01.gz" file_goobo = "C:/DATA/go_2024-01-01.obo" if(file.exists(file_gene2go) && file.exists(file_goobo)) { genesets_asis = load_genesets_go_fromfile(file_gene2go, file_goobo) }
Workflow;
obtain the input file from; https://www.syngoportal.org
click "bulk download SynGO release ..." for SynGO release of interest
unzip
call this function with the full file path to the 'syngo_ontologies.xlsx' file
load_genesets_syngo(filename, gene_database = "entrez")
load_genesets_syngo(filename, gene_database = "entrez")
filename |
full path to the "syngo_ontologies.xlsx" file that was extracted from a SynGO bulk download ZIP archive |
gene_database |
gene IDs to return. must be any of; "entrez" (default), "hgnc", "ensembl" |
table with columns; source (character), source_version (character), id (character), name (character), genes (list), ngenes (int)
# TODO: update the filename to your downloaded file f = "C:/DATA/SynGO_bulk_download_release_20210225/syngo_ontologies.xlsx" if(file.exists(f)) genesets_asis = load_genesets_syngo(f)
# TODO: update the filename to your downloaded file f = "C:/DATA/SynGO_bulk_download_release_20210225/syngo_ontologies.xlsx" if(file.exists(f)) genesets_asis = load_genesets_syngo(f)
-log10 transform a vector of p-values, replacing zeros with some limit/threshold
minlog10_fixzero(x, limit = 2.22e-16)
minlog10_fixzero(x, limit = 2.22e-16)
x |
p-value vector to transform to -log10 |
limit |
value to replace zero's in |
input parameter x
transformed to -log10
pval = c(0, 10^-6, 0.001, 0.01, 1, NA, -Inf, Inf, NaN) cbind( input = pval, # default; replace zeros with typical R machine precision for doubles minlog10_default = minlog10_fixzero(pval), # alternatively, replace zero with lowest non-zero pvalue in input minlog10_limit_from_data = minlog10_fixzero(pval, limit = NA) )
pval = c(0, 10^-6, 0.001, 0.01, 1, NA, -Inf, Inf, NaN) cbind( input = pval, # default; replace zeros with typical R machine precision for doubles minlog10_default = minlog10_fixzero(pval), # alternatively, replace zero with lowest non-zero pvalue in input minlog10_limit_from_data = minlog10_fixzero(pval, limit = NA) )
Adjust p-values for all genesets, grouped by 'source' then adjust for the number of 'sources'
padjust_genesets( genesets, method = "BH", cutoff = 0.01, correct_sources = TRUE )
padjust_genesets( genesets, method = "BH", cutoff = 0.01, correct_sources = TRUE )
genesets |
tibble with genesets, must contain column 'pvalue' |
method |
method for multiple testing correction, must be any of |
cutoff |
numeric cutoff value for adjusted p-value, |
correct_sources |
apply Bonferroni adjustment to all p-values according to the number of geneset sources that were tested. Boolean parameter, set TRUE to enable (default) or FALSE to disable |
updated genesets
table
This can be convenient to prepare the significant/test/foreground set for classical ORA,
e.g. test_genesets()
with parameter method = "fisherexact"
. Note that the GOAT geneset
enrichment algorithm does not use data in the 'signif' column of the input genelist.
partition_genes( genes, col, decreasing = FALSE, use_abs = FALSE, cutoff = NULL, fraction = NULL, topn = NULL )
partition_genes( genes, col, decreasing = FALSE, use_abs = FALSE, cutoff = NULL, fraction = NULL, topn = NULL )
genes |
gene tibble where each row is a unique gene, must contain column name |
col |
column name in |
decreasing |
order |
use_abs |
use absolute values (default FALSE), e.g. when setting a threshold on effect-sizes |
cutoff |
threshold for values in |
fraction |
fraction of rows in |
topn |
number of rows in |
input table genes
with results in the "signif" column
# note: this example will download 1 files of approx 4MB # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() # Download an example gene list, i.e. one of the datasets analyzed in the GOAT manuscript. datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` # example 1: significant hits genelist = partition_genes(genelist, col="pvalue_adjust", decreasing=FALSE, cutoff=0.01) cat(sum(genelist$signif), "/", nrow(genelist), "are signif\n") # example 2: abs(effectsize) >= 5 genelist = partition_genes(genelist, col="effectsize", decreasing=TRUE, use_abs=TRUE, cutoff=5) cat(sum(genelist$signif), "/", nrow(genelist), "are signif\n") # example 3: top 10% 'best' p-values genelist = partition_genes(genelist, col="pvalue", decreasing=FALSE, fraction = 0.1) cat(sum(genelist$signif), "/", nrow(genelist), "are signif\n")
# note: this example will download 1 files of approx 4MB # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() # Download an example gene list, i.e. one of the datasets analyzed in the GOAT manuscript. datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` # example 1: significant hits genelist = partition_genes(genelist, col="pvalue_adjust", decreasing=FALSE, cutoff=0.01) cat(sum(genelist$signif), "/", nrow(genelist), "are signif\n") # example 2: abs(effectsize) >= 5 genelist = partition_genes(genelist, col="effectsize", decreasing=TRUE, use_abs=TRUE, cutoff=5) cat(sum(genelist$signif), "/", nrow(genelist), "are signif\n") # example 3: top 10% 'best' p-values genelist = partition_genes(genelist, col="pvalue", decreasing=FALSE, fraction = 0.1) cat(sum(genelist$signif), "/", nrow(genelist), "are signif\n")
plot the geneset similarity matrix as a heatmap
plot_heatmap( x, output_dir, colors = grDevices::hcl.colors(100, "Viridis", rev = FALSE), fontsize = 10 )
plot_heatmap( x, output_dir, colors = grDevices::hcl.colors(100, "Viridis", rev = FALSE), fontsize = 10 )
x |
result from |
output_dir |
set to NA to directly show the figures instead of writing them to file.
Otherwise, this is the full path to the directory where the downloaded files should be stored. Directory is created if it does not exist.
e.g. |
colors |
a vector of 100 colors to be used for the heatmap (101 breaks are computed between 0 and the max value in the distance matrix) |
fontsize |
parameter sent to pheatmap::pheatmap(); control the size of labels in the plot, defaults to 10. Note that you can also change the plot device size, see examples |
does not return a value, plots are printed to device or files depending on output_dir
parameter
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # prior to running this function, cluster the genesets clusters = cluster_genesets(result, genelist) # use the plot heatmap function and try various color palettes plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Viridis", rev = FALSE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Inferno", rev = FALSE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Lajolla", rev = TRUE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Mako", rev = FALSE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Turku", rev = TRUE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Grays", rev = TRUE))
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # prior to running this function, cluster the genesets clusters = cluster_genesets(result, genelist) # use the plot heatmap function and try various color palettes plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Viridis", rev = FALSE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Inferno", rev = FALSE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Lajolla", rev = TRUE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Mako", rev = FALSE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Turku", rev = TRUE)) plot_heatmap(clusters, output_dir, colors = hcl.colors(100, "Grays", rev = TRUE))
Lollipop chart or barplot visualization of geneset enrichment testing results
plot_lollipop( x, output_dir, only_reduced = FALSE, plot_type = "lollipop", show_pvalue = FALSE, score_xaxis = "minlogp", score_color = ifelse(is.data.frame(x) && "score_type" %in% colnames(x) && is.character(x$score_type) && any(x$score_type %in% c("effectsize_up", "effectsize_down")), "updown", "minlogp"), score_color_limits = "source", score_color_updown = c("#E57373", "#5E7ABC"), max_ngenes = NA, topn = NA, padj_cutoff = NA )
plot_lollipop( x, output_dir, only_reduced = FALSE, plot_type = "lollipop", show_pvalue = FALSE, score_xaxis = "minlogp", score_color = ifelse(is.data.frame(x) && "score_type" %in% colnames(x) && is.character(x$score_type) && any(x$score_type %in% c("effectsize_up", "effectsize_down")), "updown", "minlogp"), score_color_limits = "source", score_color_updown = c("#E57373", "#5E7ABC"), max_ngenes = NA, topn = NA, padj_cutoff = NA )
x |
results from function |
output_dir |
set to NA to directly show the figures instead of writing them to file.
Otherwise, this is the full path to the directory where the downloaded files should be stored. Directory is created if it does not exist.
e.g. |
only_reduced |
only show the reduced/summarized set of significant genesets. This requires that you first applied the |
plot_type |
Options: "barplot", "lollipop" (default) |
show_pvalue |
boolean parameter that indicates whether adjusted p-values should be shown next to each data point |
score_xaxis |
type of score to show on the x-axis. Options: "minlogp" to show -log10(adjusted pvalue), which is default. Use "oddsratio" to show the enrichment of significant genes.
For further details on this score and its computation, see the |
score_color |
analogous to |
score_color_limits |
defines the limits for the color scales. options; |
score_color_updown |
array of 2 strings that describe the colors for up- and down-regulation (used when |
max_ngenes |
only plot terms with less than N genes (quick way to get rid of large/unspecific terms) |
topn |
topn terms to show after sorting genesets by p-value. For example, this makes it easy to plot the top10 GO terms. Set to NA to ignore this filter (default) |
padj_cutoff |
adjusted pvalue cutoff for terms shown in the plot. If set to NA (default), all significant genesets are used (i.e. 'signif' column in the input geneset table) |
if output_dir
is NA
, a list of ggplot2 objects. Otherwise, write plots to file and do not return any value
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/go" on unix systems or output_dir="C:/data/go" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # generate lollipop charts for each GO domain (CC/BP/MF), with geneset -log10 # adjusted p-value on the x-axis and color-coding by geneset up/down-regulation plot_lollipop( result, output_dir, plot_type = "lollipop", topn = 50, score_xaxis = "minlogp", score_color = "updown" ) # alternatively, as a barplot plot_lollipop( result, output_dir, plot_type = "barplot", topn = 50, score_xaxis = "minlogp", score_color = "updown" ) # alternatively, color-code genesets by enrichment of significant genes using # parameter `score_color="oddsratio"` . See further `score_geneset_oddsratio` # function documentation for definition/computation of this score. plot_lollipop( result, output_dir, plot_type = "lollipop", topn = 50, score_xaxis = "minlogp", score_color = "oddsratio" )
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/go" on unix systems or output_dir="C:/data/go" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # generate lollipop charts for each GO domain (CC/BP/MF), with geneset -log10 # adjusted p-value on the x-axis and color-coding by geneset up/down-regulation plot_lollipop( result, output_dir, plot_type = "lollipop", topn = 50, score_xaxis = "minlogp", score_color = "updown" ) # alternatively, as a barplot plot_lollipop( result, output_dir, plot_type = "barplot", topn = 50, score_xaxis = "minlogp", score_color = "updown" ) # alternatively, color-code genesets by enrichment of significant genes using # parameter `score_color="oddsratio"` . See further `score_geneset_oddsratio` # function documentation for definition/computation of this score. plot_lollipop( result, output_dir, plot_type = "lollipop", topn = 50, score_xaxis = "minlogp", score_color = "oddsratio" )
plot geneset distance matrix as a network
plot_network( clusters, src, show_clusters = TRUE, show_text = FALSE, topn_edges = 5, clr_default = "#29b6f6", node_color_palette = goat::gg_color_hue )
plot_network( clusters, src, show_clusters = TRUE, show_text = FALSE, topn_edges = 5, clr_default = "#29b6f6", node_color_palette = goat::gg_color_hue )
clusters |
result from |
src |
source property (e.g. "GO_CC") |
show_clusters |
boolean value |
show_text |
boolean value |
topn_edges |
topN edges to retain per geneset (typically 5~8) |
clr_default |
default color for the network, used only when |
node_color_palette |
function with 1 parameter, N, that returns N colors (default is goat function |
a ggplot2 object
For each provided geneset, a volcano plot of all genelist log2fc and p-values with respective geneset constituents highlighted
plot_volcano( x, genelist, plot = TRUE, topn_labels = 0, color_default = "#B0B0B0", color_highlight = "#ef5350", color_label = "#000000", pointsize = 2, pointalpha = 0.75, labelsize = 7 )
plot_volcano( x, genelist, plot = TRUE, topn_labels = 0, color_default = "#B0B0B0", color_highlight = "#ef5350", color_label = "#000000", pointsize = 2, pointalpha = 0.75, labelsize = 7 )
x |
a subset of the results from |
genelist |
input genelist, must contain columns 'gene', 'log2fc' and 'pvalue_adjust' (not! log transformed). If parameter topn_labels is provided, also include a character column 'symbol' that contains gene names/symbols/labels |
plot |
if |
topn_labels |
for how many genes that overlap between genelist and a geneset should we plot the gene symbol? This requires a column 'symbol' in the genelist parameter (default: 0) |
color_default |
color for genes that are not part of a geneset (default: grey) |
color_highlight |
color used to highlight geneset constituents (default: red) |
color_label |
provided that topn_labels is set, this is the color of the text labels (default: black) |
pointsize |
size of the dots, this parameter is passed to geom_point (default: 2) |
pointalpha |
alpha of the dots, this parameter is passed to geom_point (default: 0.75) |
labelsize |
provided that topn_labels is set, this is the text size (in pt) for the labels (default: 7) |
if plot==FALSE
, a list of ggplot2 objects. Otherwise, does not return any value
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) ## example 1; select top10 GO CC terms from the geneset testing results result_subset = result |> filter(source == "GO_CC") |> arrange(pvalue) |> head(n = 10) pdf(paste0(output_dir, "/volcano_CC_top10.pdf"), width = 4, height = 4) plot_volcano(result_subset, genelist) dev.off() ## example 2;, select small genesets that are significant and have ## near-exclusive enrichment in either up up/down-regulated genes # first, add geneset directionality scores to our results result = score_geneset_directionality(result, genelist) # next, subset the geneset result table result_subset = result |> filter(signif & ngenes <= 50 & abs(score_directionality_rank) > 0.6) |> arrange(pvalue_adjust) # finally, create plots. Note that the genelist contains a column 'symbol' # which we use here to print labels for the topN genes per plotted geneset pdf(paste0(output_dir, "/volcano_signif_ngenes50_directionality06.pdf"), width = 4, height = 4) plot_volcano(result_subset, genelist, topn_labels = 10) dev.off()
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) ## example 1; select top10 GO CC terms from the geneset testing results result_subset = result |> filter(source == "GO_CC") |> arrange(pvalue) |> head(n = 10) pdf(paste0(output_dir, "/volcano_CC_top10.pdf"), width = 4, height = 4) plot_volcano(result_subset, genelist) dev.off() ## example 2;, select small genesets that are significant and have ## near-exclusive enrichment in either up up/down-regulated genes # first, add geneset directionality scores to our results result = score_geneset_directionality(result, genelist) # next, subset the geneset result table result_subset = result |> filter(signif & ngenes <= 50 & abs(score_directionality_rank) > 0.6) |> arrange(pvalue_adjust) # finally, create plots. Note that the genelist contains a column 'symbol' # which we use here to print labels for the topN genes per plotted geneset pdf(paste0(output_dir, "/volcano_signif_ngenes50_directionality06.pdf"), width = 4, height = 4) plot_volcano(result_subset, genelist, topn_labels = 10) dev.off()
rank^2 can yield huge values, e.g. a large genelist of N=50000 genes would imply max gene score = 50000^2 = 2.5e+09. Thus we rescale these scores between 0~1000 so downstream applications (e.g. sum of 10000 gene scores) don't explode to huge numbers.
rankscore(x, sort1, sort2, sort3, colname)
rankscore(x, sort1, sort2, sort3, colname)
x |
data.frame to sort (i.e. the genelist) |
sort1 |
numeric vector of length |
sort2 |
numeric vector of length |
sort3 |
numeric vector of length |
colname |
name for result column in |
data.frame x
with added column colname
, containing gene scores between 0 and 1000
x = data.frame(gene=c(1, 2, 3, 4, 5), pvalue=c(0.01, 1, 1, 0.1, 1), effectsize=c(-2, 0.25, 0.5, 1, 0.25)) print(x, row.names = FALSE) print(rankscore(x, sort1 = -1*x$pvalue, sort2 = abs(x$effectsize), sort3 = x$gene, colname="score") |> arrange(desc(score)), row.names = FALSE)
x = data.frame(gene=c(1, 2, 3, 4, 5), pvalue=c(0.01, 1, 1, 0.1, 1), effectsize=c(-2, 0.25, 0.5, 1, 0.25)) print(x, row.names = FALSE) print(rankscore(x, sort1 = -1*x$pvalue, sort2 = abs(x$effectsize), sort3 = x$gene, colname="score") |> arrange(desc(score)), row.names = FALSE)
Gene score array, from low to high scores
rankscore_fixed_order(n)
rankscore_fixed_order(n)
n |
genelist length |
result of (1:n)^2 / (n^2 / 1000)
Analyses are performed independently per 'source' of genesets. The result of this function is the geneset table with a newly appended column 'signif_and_reduced'
reduce_genesets( clusters, simscore_threshold = 0.9, universe_fraction = 0.25, signifgenes_fraction = 0.9 )
reduce_genesets( clusters, simscore_threshold = 0.9, universe_fraction = 0.25, signifgenes_fraction = 0.9 )
clusters |
results from |
simscore_threshold |
similarity score (0~1) that is required to consider one geneset to be a "parent term" of another. Setting a lower value will yield fewer genesets / stronger summarization. Typical settings for this parameter are 0.8~0.99 (0.9 is default) |
universe_fraction |
discard genesets that cover more than X fraction of all genes in the universe (unique set of genes covered by all significant genesets). Setting this to 0.25 will deprioritize genesets that cover 25% of all genes (in significant genesets). This prevents very generic GO terms like "protein-containing complex" to be included in results. Typical settings for this parameter are 0.1~0.5 (0.25 is default) |
signifgenes_fraction |
the minimum fraction of "foreground genes" ('genes_signif' column) found across all significant genesets that should be covered by the reduced geneset collection. This parameter doesn't do anything if there are fewer than 5 "foreground genes" alltogether. Typical settings for this parameter are 0.75~0.95 (0.9 is default) |
the genesets table from the clusters
parameter, with results in column "signif_and_reduced"
Works for any filtered geneset table generated by this package, e.g. results from any of these functions;
filter_genesets()
, test_genesets()
or simplify_genesets()
.
The genelist table is required to 1) lookup gene symbols and 2) determine the ordering of genes within-geneset. The latter makes it such that the output table shows the most important genes (in context of user's data) first.
All output columns that list gene identifiers or symbols are trimmed to 25000 characters. Depending on your input gene identifier type and dataset filtering settings (e.g. max geneset size), this might lead to some truncation. For example, 15 characters per ensembl gene ID (if these are used in provided genesets instead of default NCBI Entrez) and 1 char as delimiter, only the top ~1500 ensembl gene identifiers are included. If your geneset filtering allows for larger genesets and you want to check if any gene identifiers are lost upon saving, keep an eye out for the trailing ellipsis ('...'). Because genes in each geneset of the output table are sorted by order of importance as indicated in the genelist table, this shouldn't be an issue in typical use-cases where the output table is to check the topN most important/significant genes of a geneset of interest.
save_genesets(x, genelist, filename, arrange_genes = TRUE)
save_genesets(x, genelist, filename, arrange_genes = TRUE)
x |
result from |
genelist |
same as provided for |
filename |
full path to the output file. Supported file extensions; csv, tsv, xlsx. Optionally, set to NA to not write to disk and return the result table instead |
arrange_genes |
set to TRUE (default) to arrange the genelist table by best p-value on top (if column 'pvalue' exists), alternatively by descending absolute effectsize (if no pvalue but effectize is available). Set FALSE to use sorting of the genelist table as-is |
if filename
is NA
, returns the validated and formatted geneset result table. Otherwise, writes the table to file and does not return a value
Importantly, the scope/utility of this score is limited to help users post-hoc filter for genesets that contain mostly up/down-regulated genes.
However, this might not coincide with the geneset pvalues / significance. For example,
genesets may exclusively contain genes with a positive effectsize but at the same time these can all be minor effects/values and thus the geneset
is not significant or less significant than other genesets with the exact same "directionality score".
For example, genesets may contain both up- and down-regulated genes but still be significant when testing with GOAT and using score_type='effectsize'
The scores computed with this function can help in post-hoc interpretation of GOAT results to further narrow down all significant genesets to a
subset with strong directionality. For example, after test_genesets()
we can filter the results for
A) significant genesets and B) that contain at most N genes and C) that are near-exclusively up/down-regulated.
Bringing this all together (also useful for other types of geneset testing, like ORA, score_type="pvalue", etc);
result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05)
result = score_geneset_directionality(result, genelist)
result |> filter(signif & ngenes <= 100 & abs(score_directionality_rank) >= 0.95)
score_geneset_directionality(genesets, genelist)
score_geneset_directionality(genesets, genelist)
genesets |
tibble with genesets, must contain columns 'source', 'id', 'genes' |
genelist |
tibble with genes, must contain columns 'gene', 'effectsize' |
input genesets
with results in 3 columns;
score_directionality_rank
is the weighted gene score where gene values are the sign of their effectsize and weights are linearly proportional to their inverse ranks in the input genelist.
score_directionality_rank2
is similar, but now using rank^2 gene weights to boost the influence of most-important genes in the input genelist.
score_directionality_value
uses the absolute gene effectsizes as gene weights
Note that the latter is least robust as it depends on the scaling of input data!
geneset directionality score = weighted mean of respective genes, where gene weights are 1 minus their relative rank in up/down-regulation (depending on negative/positive effectsize) and the value for each gene is -1 or 1 depending on up/down-regulation (sign of effectsize).
gene values and weights A) gene weight between 0 and 1 for the subset of upregulated genes / positive effectsizes;
r = for the subset of genes with effectsize > 0, compute gene rank (1 = highest effectsize, N = smallest effectsize that is greater than zero)
weight = 1 - r/max(r) B) analogous to (A) for the subset of genes with negative effectsize C) result per gene: value = sign of its effectsize, weight = 0 if effectsize 0, otherwise respective weights from (A) or (B)
geneset score_directionality = weighted mean over values/weights of respective genes
gs_signif = number of significant genes in geneset G that intersect with user's genelist (i.e. foreground genes in G) gs_all = number of genes in geneset G that intersect with user's genelist (i.e. foreground+background genes in G) k_signif = total number of significant genes in user's genelist k_all = total number of genes in user's genelist
gs_signif/gs_all = ratio of foreground genes in geneset G k_signif/k_all = ratio of overall foreground genes (i.e. expected value for a random geneset)
oddsratio = (gs_signif/gs_all) / (k_signif/k_all)
score_geneset_oddsratio(genesets, genelist)
score_geneset_oddsratio(genesets, genelist)
genesets |
tibble with genesets, must contain columns 'source', 'id', 'ngenes', 'ngenes_signif' |
genelist |
tibble with genes, must contain columns 'gene', 'signif' |
input genesets
with results in column "score_oddsratio"
replacement for stringr::trunc() so we don't need a package dependency for just 1 function (our code was adapter therefrom)
string_trunc_right(string, width, trim_left = FALSE)
string_trunc_right(string, width, trim_left = FALSE)
string |
string that should be truncated |
width |
desired max length |
trim_left |
instead of right trunc (default), do left instead |
truncated variant of input string
Map the the symbol column in a table to HGNC human gene IDs by matching official gene symbols and synonyms
symbol_to_entrez(x, hgnc)
symbol_to_entrez(x, hgnc)
x |
a data.table with a column symbol |
hgnc |
HGNC lookup table from |
entrez gene IDs are returned in the "gene" column of table x
. Additionally, columns "entrez_id", "hgnc_id" and "hgnc_symbol"
# TODO: update the filename to your downloaded file # download instructions in the documentation of `hgnc_idmap_table()` f = "C:/DATA/HGNC/hgnc_complete_set.txt" if(file.exists(f)) { df = data.frame(symbol = c("vamp2", "STXBP1", "UNC18", NA, "PSD95", "NOT-A-GENE")) hgnc = hgnc_idmap_table(f) df = symbol_to_entrez(df, hgnc) print(df) }
# TODO: update the filename to your downloaded file # download instructions in the documentation of `hgnc_idmap_table()` f = "C:/DATA/HGNC/hgnc_complete_set.txt" if(file.exists(f)) { df = data.frame(symbol = c("vamp2", "STXBP1", "UNC18", NA, "PSD95", "NOT-A-GENE")) hgnc = hgnc_idmap_table(f) df = symbol_to_entrez(df, hgnc) print(df) }
Perform geneset enrichment testing using any supported method
test_genesets( genesets, genelist, method, padj_method = "BH", padj_sources = TRUE, padj_cutoff = 0.01, padj_min_signifgenes = 0L, ... )
test_genesets( genesets, genelist, method, padj_method = "BH", padj_sources = TRUE, padj_cutoff = 0.01, padj_min_signifgenes = 0L, ... )
genesets |
tibble with genesets, must contain columns 'source', 'source_version', 'id', 'name', 'genes', 'ngenes', 'ngenes_signif' |
genelist |
tibble with genes, must contain column 'gene' and 'test'. gene = character column, which are matched against list column 'genes' in genesets tibble. test = boolean column (you can set all to FALSE if not performing Fisher-exact or hypergeometric test downstream) |
method |
method for overrepresentation analysis. Options: "goat", "hypergeometric", "fisherexact", "fisherexact_ease", "gsea", "idea" |
padj_method |
first step of multiple testing correction; method for p-value adjustment, passed to |
padj_sources |
second step of multiple testing correction; apply Bonferroni adjustment to all p-values according to the number of geneset sources that were tested. Boolean parameter, set TRUE to enable (default) or FALSE to disable |
padj_cutoff |
cutoff for adjusted p-value, |
padj_min_signifgenes |
if a value larger than zero is provided, this will perform additional post-hoc filtering; after p-value adjustment, set the |
... |
further parameters are passed to the respective stats method |
After application of the enrichment testing algorithm (e.g. GOAT, ORA or GSEA), multiple testing correction is applied to obtain adjusted p-values using padjust_genesets
.
That function will first apply the specified pvalue adjustment procedure in the padj_method
parameter within each 'source' in the genesets table. Second, it applies Bonferroni adjustment to all p-values according to the number of different geneset sources that were tested (or set padj_sources = FALSE
to disable).
For example, if the input is a genesets table that contains GO_CC, GO_BP and GO_MF genesets, first multiple testing correction is applied within each source (e.g. using FDR if so desired) and afterwards a Bonferroni correction is applied based on 3 repeated analyses.
Note that this is more rigorous than typical GO tools; hypothetically, one could split all GO_CC pathways into 1000 different databases/'sources' and then run enrichment testing. Consequently, the multiple testing burden is reduced if one doesn't adjust p-values for the number of 'sources' as we do here.
the input genesets
, with results stored in columns 'pvalue', 'pvalue_adjust' and 'signif'
#' # note; this example downloads data when first run, and typically takes ~60seconds ## Basic example for a complete GOAT workflow # Downloads test data to your computer and stores it at current working directory # Refer to the GitHub documentation for elaborate documentation and a worked example # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/go" on unix systems or output_dir="C:/data/go" on Windows output_dir = tempdir() # download an example gene list datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` # download GO genesets genesets_asis = download_genesets_goatrepo(output_dir) # filter genesets for sufficient overlap with the genelist, then apply GOAT genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # print first 10 rows of the result table print(result |> select(source, name, ngenes, pvalue_adjust) |> utils::head(n=10))
#' # note; this example downloads data when first run, and typically takes ~60seconds ## Basic example for a complete GOAT workflow # Downloads test data to your computer and stores it at current working directory # Refer to the GitHub documentation for elaborate documentation and a worked example # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/go" on unix systems or output_dir="C:/data/go" on Windows output_dir = tempdir() # download an example gene list datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` # download GO genesets genesets_asis = download_genesets_goatrepo(output_dir) # filter genesets for sufficient overlap with the genelist, then apply GOAT genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # print first 10 rows of the result table print(result |> select(source, name, ngenes, pvalue_adjust) |> utils::head(n=10))
In most cases, it's more convenient to call the more generic test_genesets
function which also deals with multiple-testing correction (per geneset source)
It is assumed that the genesets
and genelist
parameters are in sync, i.e. genesets
provided
here is the result of the filter_genesets()
function (using the same genelist
table)
Same as hypergeometric for non-EASE, but slower because stats::fisher.test isn't vectorized
Only genesets with at least 1 significant gene are subjected to statistical testing (e.g. NA is returned for genesets without significant genes)
test_genesets_fisherexact( genesets, genelist, require_nsignif = 1L, use_ease = FALSE )
test_genesets_fisherexact( genesets, genelist, require_nsignif = 1L, use_ease = FALSE )
genesets |
tibble with genesets, must contain columns 'id', 'ngenes' and 'ngenes_signif' |
genelist |
tibble with genes, must contain column 'signif'. The number of rows in this table (where signif is not NA) is assumed to be the total number of tested genes, the number of rows where signif==TRUE is assumed the total number of significant genes. |
require_nsignif |
minimum number of 'signif genes' that overlap with a geneset; |
use_ease |
use a more conservative score coined by DAVID online tools @ https://david.ncifcrf.gov/helps/functional_annotation.html#fisher |
input genesets
table with results in the "pvalue" column
test_genesets
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain input data datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) ## example: same results between Fisher-exact and hypergeometric tests result_hg = test_genesets_hypergeometric(genesets_filtered, genelist, require_nsignif = 3L) result_fe = test_genesets_fisherexact(genesets_filtered, genelist, require_nsignif = 3L) all.equal(result_hg$pvalue, result_fe$pvalue)
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain input data datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) ## example: same results between Fisher-exact and hypergeometric tests result_hg = test_genesets_hypergeometric(genesets_filtered, genelist, require_nsignif = 3L) result_fe = test_genesets_fisherexact(genesets_filtered, genelist, require_nsignif = 3L) all.equal(result_hg$pvalue, result_fe$pvalue)
In typical use-cases, one applies test_genesets()
instead with parameter method="goat"
,
which in turn will use test_genesets_goat_precomputed
(and not this function).
test_genesets_goat_bootstrap( genesets, genelist, score_type, niter = 5e+05, verbose = FALSE )
test_genesets_goat_bootstrap( genesets, genelist, score_type, niter = 5e+05, verbose = FALSE )
genesets |
see |
genelist |
see |
score_type |
see |
niter |
integer number of bootstrap iterations; at least 10000, at most 5000000 |
verbose |
boolean, create debug plots |
see test_genesets_goat_precomputed
test_genesets_goat_precomputed
that does not use previously prepared parametersIn typical use-cases, one applies test_genesets()
instead with parameter method="goat"
,
which in turn will use test_genesets_goat_precomputed
(and not this function).
test_genesets_goat_fitfunction( genesets, genelist, score_type, niter = 5e+05, verbose = FALSE )
test_genesets_goat_fitfunction( genesets, genelist, score_type, niter = 5e+05, verbose = FALSE )
genesets |
see |
genelist |
see |
score_type |
see |
niter |
integer number of bootstrap iterations; at least 10000, at most 5000000 |
verbose |
boolean, create debug plots |
Optionally, use this function to ignore precomputed/bundled null distribution estimates and perform new permutations and function fitting (e.g. because you want to test the effect of huge niter parameter, but beware of RAM requirements)
see test_genesets_goat_precomputed
In most cases, it's more convenient to call the more generic test_genesets
function which also applies multiple-testing correction (per geneset source) to the geneset p-values computed by this function.
This is the canonical geneset test function for GOAT that uses precomputed null distributions that are bundled with the GOAT package
test_genesets_goat_precomputed(genesets, genelist, score_type)
test_genesets_goat_precomputed(genesets, genelist, score_type)
genesets |
genesets data.frame, must contain columns; "source", "id", "genes", "ngenes" |
genelist |
genelist data.frame, must contain columns "gene" and "pvalue"/"effectsize" (depending on parameter |
score_type |
how to compute gene scores?
Option "pvalue" uses values from the pvalue column in |
input genesets
table with results in the "pvalue", "score_type" columns.
"zscore" column:
A standardized z-score is computed from geneset p-values + effectsize direction (up/down) if tested.
Importantly, we here return standardized z-scores because the GOAT geneset score (mean of gene scores) is relative to the respective geneset-size-matched null distributions (a skewed normal)!
In contrast, the standardized z-scores are comparable between genesets (as are the pvalues obviously).
Only if either (or both) the effectsize-up/down was tested, the direction of regulation has been tested (effectsize_abs and pvalue score types are agnositic to up/down regulation). So when score_type was set to any of effectsize/effectsize_down/effectsize_up, the z-scores are negative values in case the "score_type" output column is "effectsize_down".
test_genesets
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain input data datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) ### we here compare GOAT with precomputed null distributions against ### a GOAT function that performs bootstrapping to compute null distributions on-demand # apply goat with precomputed null (default) and goat with on-demand bootstrapping result_precomputed = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) |> # undo sorting by p-value @ test_genesets(), instead sort by stable IDs arrange(source, id) result_bootstrapped = test_genesets(genesets_filtered, genelist, method = "goat_bootstrap", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05, verbose = TRUE) |> arrange(source, id) # tables should align stopifnot(result_precomputed$id == result_bootstrapped$id) # no missing values stopifnot(is.finite(result_precomputed$pvalue) & is.finite(is.finite(result_bootstrapped$pvalue))) # compare results plot(result_precomputed$pvalue, result_bootstrapped$pvalue) abline(0, 1, col=2) plot(minlog10_fixzero(result_precomputed$pvalue), minlog10_fixzero(result_bootstrapped$pvalue)) abline(0, 1, col=2) summary(minlog10_fixzero(result_precomputed$pvalue) - minlog10_fixzero(result_bootstrapped$pvalue))
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain input data datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) ### we here compare GOAT with precomputed null distributions against ### a GOAT function that performs bootstrapping to compute null distributions on-demand # apply goat with precomputed null (default) and goat with on-demand bootstrapping result_precomputed = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) |> # undo sorting by p-value @ test_genesets(), instead sort by stable IDs arrange(source, id) result_bootstrapped = test_genesets(genesets_filtered, genelist, method = "goat_bootstrap", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05, verbose = TRUE) |> arrange(source, id) # tables should align stopifnot(result_precomputed$id == result_bootstrapped$id) # no missing values stopifnot(is.finite(result_precomputed$pvalue) & is.finite(is.finite(result_bootstrapped$pvalue))) # compare results plot(result_precomputed$pvalue, result_bootstrapped$pvalue) abline(0, 1, col=2) plot(minlog10_fixzero(result_precomputed$pvalue), minlog10_fixzero(result_bootstrapped$pvalue)) abline(0, 1, col=2) summary(minlog10_fixzero(result_precomputed$pvalue) - minlog10_fixzero(result_bootstrapped$pvalue))
In most cases, it's more convenient to call the more generic test_genesets
function which also deals with multiple-testing correction (per geneset source)
test_genesets_gsea( genesets, genelist, score_type = NULL, parallel_threads = 1L, gseaParam = 1, nPermSimple = 50000, gsea_genelist_col = NULL, gsea_scoretype = NULL, random_seed = 123 )
test_genesets_gsea( genesets, genelist, score_type = NULL, parallel_threads = 1L, gseaParam = 1, nPermSimple = 50000, gsea_genelist_col = NULL, gsea_scoretype = NULL, random_seed = 123 )
genesets |
data.frame/tibble with geneset and gene columns |
genelist |
data.frame/tibble with gene and score columns. Should contain columns gene and either pvalue or effectsize, depending on |
score_type |
how to compute gene scores? options: "pvalue", "effectsize", "custom".
Option "pvalue" uses -log10 transformed values from the pvalue column in |
parallel_threads |
number of threads to use for parallel processing. Set to 0 to automatically select all available processors/cores, set to 1 to disable (default) or to N to use N processes. Note that multiprocessing sometimes breaks on RStudio + Windows, hence this parameter is set to 1 to disable multiprocessing by default for now |
gseaParam |
passed to |
nPermSimple |
passed to |
gsea_genelist_col |
optional, only used for |
gsea_scoretype |
optional, only used for |
random_seed |
the random seed that is passed to |
input genesets
table with results in the "pvalue", "score_type" and "gsea_nes" columns
test_genesets
In most cases, it's more convenient to call the more generic test_genesets
function which also deals with multiple-testing correction (per geneset source)
It is assumed that the genesets
and genelist
parameters are in sync, i.e. genesets
provided
here is the result of the filter_genesets()
function (using the same genelist
table)
Only genesets with at least 1 significant gene are subjected to statistical testing (e.g. NA is returned for genesets without significant genes)
test_genesets_hypergeometric(genesets, genelist, require_nsignif = 1L)
test_genesets_hypergeometric(genesets, genelist, require_nsignif = 1L)
genesets |
tibble with genesets, must contain columns 'id', 'ngenes' and 'ngenes_signif' |
genelist |
tibble with genes, must contain column 'signif'. The number of rows in this table (where signif is not NA) is assumed to be the total number of tested genes, the number of rows where signif==TRUE is assumed the total number of significant genes. |
require_nsignif |
minimum number of 'signif genes' that overlap with a geneset; |
input genesets
table with results in the "pvalue" column
test_genesets
refer to the goat::treemap_plot() function for a complete example
treemap_data( geneset_ids, genesets, genesets_test_result, simplify = "none", toplevel_max_ngenes = NA )
treemap_data( geneset_ids, genesets, genesets_test_result, simplify = "none", toplevel_max_ngenes = NA )
geneset_ids |
vector of geneset identifiers |
genesets |
entire geneset table; typically the complete GO database |
genesets_test_result |
geneset testing results; the output from |
simplify |
strategy for reducing the genesets returned in the treemap. Options; "leaf_only" (most stringent, returns only leafs in the tree structure) "prune_singletons" (remove parent terms that have exactly 1 child) "pvalue" (remove parent terms where the child term p-value is at least 4 times better) "none" (default; return all significant genesets that are not a "grouping term" in the treemap) |
toplevel_max_ngenes |
groups in the treemap should not have more than this many genes ('ngenes' in geneset test results). If not set, this defaults to 50% of the total number of unique genes in the geneset test results |
data structure needed for treemap_plot()
simple wrapper around the treemap R package. To customize this plot, copy/paste its code and tweak parameters as desired
treemap_plot(x, label_group = FALSE)
treemap_plot(x, label_group = FALSE)
x |
|
label_group |
set TRUE to show only group-level labels |
a ggplot2 object constructed by treemap::treemap()
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # subset GO CC results x = result |> filter(signif & source == "GO_CC") tm = treemap_data( geneset_ids = x$id, genesets = genesets_filtered, genesets_test_result = x, simplify = "leaf_only" # options: none/leaf_only/prune_singletons/pvalue ) treemap_plot(tm$treemap_plotdata)
# note; this example downloads data when first run, and typically takes ~60seconds # store the downloaded files in the following directory. Here, the temporary file # directory is used. Alternatively, consider storing this data in a more permanent location. # e.g. output_dir="~/data/goat" on unix systems or output_dir="C:/data/goat" on Windows output_dir = tempdir() ## first run the default example from test_genesets() to obtain geneset results datasets = download_goat_manuscript_data(output_dir) genelist = datasets$`Wingo 2020:mass-spec:PMID32424284` genesets_asis = download_genesets_goatrepo(output_dir) genesets_filtered = filter_genesets(genesets_asis, genelist) result = test_genesets(genesets_filtered, genelist, method = "goat", score_type = "effectsize", padj_method = "bonferroni", padj_cutoff = 0.05) # subset GO CC results x = result |> filter(signif & source == "GO_CC") tm = treemap_data( geneset_ids = x$id, genesets = genesets_filtered, genesets_test_result = x, simplify = "leaf_only" # options: none/leaf_only/prune_singletons/pvalue ) treemap_plot(tm$treemap_plotdata)