| Title: | Transcriptional Disruption Score for Gene Sets |
|---|---|
| Description: | Computes pathway-level transcriptional disruption scores from differential expression p-values using permutation tests with Generalized Pareto Distribution (GPD) tail extrapolation and false discovery rate (FDR) correction. Reference: Guo P, Li H (2026) DSGE: Disruption Score of Gene Expression for Gene-Set Enrichment Analysis. <https://github.com/LHJLab/DSGE>. |
| Authors: | Pingwei Guo [aut, cre], Huijuan Li [aut] |
| Maintainer: | Pingwei Guo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.3.1 |
| Built: | 2026-06-30 13:19:14 UTC |
| Source: | https://github.com/cran/DSGEr |
Computes the mean absolute z-score for all genes passing filters from differential expression analysis results. Higher DSGE indicates stronger global transcriptional perturbation.
calc_dsge(pvalue, base_mean = NULL, base_mean_cutoff = 0.1)calc_dsge(pvalue, base_mean = NULL, base_mean_cutoff = 0.1)
pvalue |
Numeric vector of p-values from differential expression analysis (e.g., DESeq2, edgeR, Seurat FindMarkers). |
base_mean |
Numeric vector of mean expression values (same length as pvalue), e.g. DESeq2 baseMean or Seurat avg_log2FC corresponding expression level. If NULL, no expression filtering is applied. |
base_mean_cutoff |
Expression filter threshold, default 0.1. Genes with mean expression at or below this value are excluded as lowly expressed. Ignored when base_mean = NULL. |
A list with elements:
dsge |
scalar, genome-wide DSGE |
n_genes |
integer, number of genes passing filters |
z_scores |
named numeric vector of per-gene raw z-scores |
# Generate random p-values (simulating differential expression results) set.seed(42) pvals <- runif(1000) z <- calc_dsge(pvals) head(z) # With baseMean filtering base_mean <- rexp(1000) z <- calc_dsge(pvals, base_mean) head(z)# Generate random p-values (simulating differential expression results) set.seed(42) pvals <- runif(1000) z <- calc_dsge(pvals) head(z) # With baseMean filtering base_mean <- rexp(1000) z <- calc_dsge(pvals, base_mean) head(z)
Tests whether a target gene set has a significantly higher DSGE than expected by chance. Generates a null distribution via sampling without replacement and computes an empirical right-tail p-value.
dsge_perm_test( gene_list, pvalue, base_mean, gene_names, base_mean_cutoff = 0.1, n_perm = 10000L, seed = NULL, progress = TRUE, heterogeneity = FALSE, directional = FALSE, direction_vec = NULL, use_std = TRUE, use_gpd = TRUE, gpd_threshold = 0.99, gpd_method = "mle", safety_margin = 1.6, nds_top_frac = 0.25 )dsge_perm_test( gene_list, pvalue, base_mean, gene_names, base_mean_cutoff = 0.1, n_perm = 10000L, seed = NULL, progress = TRUE, heterogeneity = FALSE, directional = FALSE, direction_vec = NULL, use_std = TRUE, use_gpd = TRUE, gpd_threshold = 0.99, gpd_method = "mle", safety_margin = 1.6, nds_top_frac = 0.25 )
gene_list |
Character vector of target gene identifiers (must match values in gene_names). |
pvalue |
Numeric vector of p-values from differential expression analysis (e.g., DESeq2, edgeR, Seurat FindMarkers) for all genes in the background pool. |
base_mean |
Numeric vector of mean expression values, same length as pvalue. Pass NULL to skip expression-level filtering. |
gene_names |
Character vector of gene names, same length as pvalue, must be unique. |
base_mean_cutoff |
baseMean filter threshold, default 0.1. |
n_perm |
Number of permutations, default 10000. |
seed |
Optional integer random seed for reproducibility. |
progress |
Whether to show a progress bar, default TRUE. |
heterogeneity |
Whether to compute perturbation heterogeneity
indices (Gini, CV) and two-sided p-values. Default |
directional |
Whether to compute Normalized Direction Score
(NDS) for each pathway. Default |
direction_vec |
Numeric vector of direction indicators
(e.g., log fold changes), same length as |
use_std |
Whether to compute and use standardised DSGE.
Default
When |
use_gpd |
Whether to use GPD tail extrapolation for extreme
p-values. Default |
gpd_threshold |
Tail quantile threshold for GPD fitting, between 0 and 1. Default 0.99. Lower = more tail samples (lower variance, higher bias); higher = fewer samples (higher variance, lower bias). |
gpd_method |
GPD estimation method passed to
|
safety_margin |
Safety margin for GPD support-constrained adjustment.
Default |
nds_top_frac |
Fraction of most-perturbed genes retained for
NDS calculation. Default |
Algorithm steps:
Filter gene pool: baseMean > cutoff, non-missing p-value.
Convert p-values to per-gene raw z-scores.
Match gene_list to the filtered gene pool; compute
observed DSGE (mean z-score).
Generate null distribution: randomly sample (without replacement) equally-sized gene sets from the pool, computing random DSGE via the same mean z-score formula, repeated n_perm times.
Empirical right-tail p-value: count(DSGE_null > DSGE_obs) / n_perm.
A list with elements:
observed |
observed DSGE value |
n_genes |
number of target genes matched |
null |
permutation null distribution vector |
p_value |
right-tail p-value (GPD tail extrapolation when observed falls above the 'gpd_threshold' quantile (default 99th); empirical ECDF otherwise) |
ecdf |
empirical cumulative distribution function of the null |
dsge_std |
(only when |
nds |
(only when |
gini_observed |
(only when |
cv_observed |
(only when |
null_gini |
(only when |
null_cv |
(only when |
het_p_value |
(only when |
# Toy example with simulated data set.seed(42) pvals <- runif(500) base_mean <- rexp(500) gene_names <- paste0("gene", 1:500) forebrain_like <- paste0("gene", 1:30) dsge_perm_test(forebrain_like, pvals, base_mean, gene_names, n_perm = 100, seed = 42, progress = FALSE)# Toy example with simulated data set.seed(42) pvals <- runif(500) base_mean <- rexp(500) gene_names <- paste0("gene", 1:500) forebrain_like <- paste0("gene", 1:30) dsge_perm_test(forebrain_like, pvals, base_mean, gene_names, n_perm = 100, seed = 42, progress = FALSE)
Lines starting with ! in GAF files contain metadata (database
version, date, column definitions, etc.). This function extracts them
for inspecting file version and provenance.
get_gaf_header(file)get_gaf_header(file)
file |
Path to the GAF file. |
Character vector, one string per header line (with leading
! removed).
# Create a temporary GAF file for demonstration gaf_file <- tempfile(fileext = ".gaf") writeLines(c( "!gaf-version: 2.2", "!generated-by: R example", paste0("UniProtKB\tP12345\tGENE_A\t\tGO:0003674\t", "PMID:123456\tIDA\t\tF\tGene A\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t"), paste0("UniProtKB\tP67890\tGENE_B\t\tGO:0005575\t", "PMID:789012\tIBA\t\tC\tGene B\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t") ), gaf_file) get_gaf_header(gaf_file)# Create a temporary GAF file for demonstration gaf_file <- tempfile(fileext = ".gaf") writeLines(c( "!gaf-version: 2.2", "!generated-by: R example", paste0("UniProtKB\tP12345\tGENE_A\t\tGO:0003674\t", "PMID:123456\tIDA\t\tF\tGene A\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t"), paste0("UniProtKB\tP67890\tGENE_B\t\tGO:0005575\t", "PMID:789012\tIBA\t\tC\tGene B\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t") ), gaf_file) get_gaf_header(gaf_file)
Splits the parsed GAF data.frame returned by
read_gaf() into a named list by GO term. Supports
filtering by qualifier, evidence code, ontology aspect, and minimum
pathway size. Optionally attaches GO term names from an OBO file.
get_pathway_genes( gaf_data, genes = c("db_object_id", "db_object_symbol"), unique = TRUE, min_size = 5L, qualifier = NULL, evidence = NULL, aspect = NULL, go_names = NULL )get_pathway_genes( gaf_data, genes = c("db_object_id", "db_object_symbol"), unique = TRUE, min_size = 5L, qualifier = NULL, evidence = NULL, aspect = NULL, go_names = NULL )
gaf_data |
A |
genes |
Character vector of column names identifying genes.
Default |
unique |
Logical. Whether to remove duplicate gene entries within
a GO term. Default |
min_size |
Minimum gene count threshold; pathways with fewer
genes are discarded. Default |
qualifier |
Qualifier filter, e.g. |
evidence |
Evidence code filter, e.g. |
aspect |
Ontology aspect filter: |
go_names |
A |
A named list where each element name is a GO term ID
(e.g. "GO:0005515") and the value is a data.frame of
associated genes. The list is sorted alphabetically by GO ID.
# Create a minimal GAF data.frame for demonstration toy_gaf <- data.frame( db_object_id = c("1", "2", "3", "1", "2"), db_object_symbol = c("GENE_A", "GENE_B", "GENE_C", "GENE_A", "GENE_D"), go_id = c("GO:0003674", "GO:0003674", "GO:0005575", "GO:0005575", "GO:0005575"), aspect = c("F", "F", "C", "C", "C"), evidence_code = c("IDA", "IBA", "IDA", "IDA", "IEA"), stringsAsFactors = FALSE ) pathway_genes <- get_pathway_genes(toy_gaf, min_size = 1) pathway_genes[["GO:0003674"]] # Create a temporary GAF file for demonstration gaf_file <- tempfile(fileext = ".gaf") writeLines(c( "!gaf-version: 2.2", paste0("UniProtKB\tP12345\tGENE_A\t\tGO:0003674\t", "PMID:123456\tIDA\t\tF\tGene A\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t"), paste0("UniProtKB\tP67890\tGENE_B\t\tGO:0005575\t", "PMID:789012\tIBA\t\tC\tGene B\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t") ), gaf_file) gaf <- read_gaf(gaf_file) # Create a temporary OBO file obo_file <- tempfile(fileext = ".obo") writeLines(c( "format-version: 1.2", "", "[Term]", "id: GO:0003674", "name: molecular_function", "namespace: molecular_function", "", "[Term]", "id: GO:0005575", "name: cellular_component", "namespace: cellular_component" ), obo_file) go <- read_obo(obo_file) # Build pathway-gene mapping with GO names pathway_genes <- get_pathway_genes(gaf, go_names = go, min_size = 1) pathway_genes[["GO:0003674"]]# Create a minimal GAF data.frame for demonstration toy_gaf <- data.frame( db_object_id = c("1", "2", "3", "1", "2"), db_object_symbol = c("GENE_A", "GENE_B", "GENE_C", "GENE_A", "GENE_D"), go_id = c("GO:0003674", "GO:0003674", "GO:0005575", "GO:0005575", "GO:0005575"), aspect = c("F", "F", "C", "C", "C"), evidence_code = c("IDA", "IBA", "IDA", "IDA", "IEA"), stringsAsFactors = FALSE ) pathway_genes <- get_pathway_genes(toy_gaf, min_size = 1) pathway_genes[["GO:0003674"]] # Create a temporary GAF file for demonstration gaf_file <- tempfile(fileext = ".gaf") writeLines(c( "!gaf-version: 2.2", paste0("UniProtKB\tP12345\tGENE_A\t\tGO:0003674\t", "PMID:123456\tIDA\t\tF\tGene A\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t"), paste0("UniProtKB\tP67890\tGENE_B\t\tGO:0005575\t", "PMID:789012\tIBA\t\tC\tGene B\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t") ), gaf_file) gaf <- read_gaf(gaf_file) # Create a temporary OBO file obo_file <- tempfile(fileext = ".obo") writeLines(c( "format-version: 1.2", "", "[Term]", "id: GO:0003674", "name: molecular_function", "namespace: molecular_function", "", "[Term]", "id: GO:0005575", "name: cellular_component", "namespace: cellular_component" ), obo_file) go <- read_obo(obo_file) # Build pathway-gene mapping with GO names pathway_genes <- get_pathway_genes(gaf, go_names = go, min_size = 1) pathway_genes[["GO:0003674"]]
Extracts GO term to gene mappings from an OrgDb annotation
package (e.g. org.Hs.eg.db) or an AnnotationHub record.
Returns a named list in the same format as
get_pathway_genes(), ready to pass to
pathway_dsge().
get_pathway_genes_db( orgdb, keytype = "ENTREZID", gene_id_col = "db_object_id", gene_symbol_col = "db_object_symbol", min_size = 5L, aspect = NULL, evidence = NULL, attach_go_names = TRUE, use_goall = FALSE )get_pathway_genes_db( orgdb, keytype = "ENTREZID", gene_id_col = "db_object_id", gene_symbol_col = "db_object_symbol", min_size = 5L, aspect = NULL, evidence = NULL, attach_go_names = TRUE, use_goall = FALSE )
orgdb |
An |
keytype |
Key type used to query the |
gene_id_col |
Name for the gene ID column in the output
data.frames. Default |
gene_symbol_col |
Name for the gene symbol column in the output
data.frames. Default |
min_size |
Minimum gene count per pathway; pathways below this
are discarded. Default |
aspect |
Ontology aspect filter. |
evidence |
Evidence code filter (e.g. |
attach_go_names |
Logical. Whether to fetch GO term names and
ontology classifications via |
use_goall |
Logical. If |
This function provides an alternative to get_pathway_genes()
for users who prefer Bioconductor's annotation infrastructure over
GAF + OBO files.
A named list where each element is a data.frame with
columns go_name (if attached), go_namespace
(if attached), gene_id_col, and gene_symbol_col.
The list names are GO term IDs. This matches the output format of
get_pathway_genes() and can be passed directly to
pathway_dsge().
This function requires the AnnotationDbi package. The
GO.db package is required when attach_go_names = TRUE
(the default). Both are Bioconductor packages; install with
BiocManager::install(c("AnnotationDbi", "GO.db")).
get_pathway_genes for the GAF-based equivalent.
# Requires additional Bioconductor database packages if (require("org.Hs.eg.db", quietly = TRUE)) { pw <- get_pathway_genes_db(org.Hs.eg.db) str(pw[1:2]) }# Requires additional Bioconductor database packages if (require("org.Hs.eg.db", quietly = TRUE)) { pw <- get_pathway_genes_db(org.Hs.eg.db) str(pw[1:2]) }
Extracts KEGG pathway to gene mappings from an OrgDb annotation
package (e.g. org.Hs.eg.db) or an AnnotationHub record.
Returns a named list in the same format as
get_pathway_genes(), ready to pass to
pathway_dsge().
get_pathway_genes_kegg( orgdb, keytype = "ENTREZID", gene_id_col = "kegg_gene_id", gene_symbol_col = "kegg_gene_symbol", min_size = 5L, attach_path_names = TRUE )get_pathway_genes_kegg( orgdb, keytype = "ENTREZID", gene_id_col = "kegg_gene_id", gene_symbol_col = "kegg_gene_symbol", min_size = 5L, attach_path_names = TRUE )
orgdb |
An |
keytype |
Key type used to query the |
gene_id_col |
Name for the gene ID column in the output
data.frames. Default |
gene_symbol_col |
Name for the gene symbol column in the output
data.frames. Default |
min_size |
Minimum gene count per pathway; pathways below this
are discarded. Default |
attach_path_names |
Logical. Whether to fetch pathway names via
|
KEGG pathway IDs are stored in the PATH column of OrgDb objects.
IDs include an organism prefix (e.g. "hsa00010" for human,
"mmu00010" for mouse). The organism code is auto-detected from
the OrgDb species and a built-in lookup table of 15 common model
organisms.
Pathway names are fetched online via KEGGREST::keggList().
When the KEGGREST package is not available or network access
fails, names are set to NA with a warning.
A named list where each element is a data.frame with
columns kegg_name (if attached), organism_code,
gene_id_col, and gene_symbol_col. The list names are
full KEGG pathway IDs including the organism prefix (e.g.
"hsa00010"). S3 class "kegg_pathway" is set on list
elements for source auto-detection by pathway_dsge().
Requires the AnnotationDbi package. The KEGGREST
package is required only when attach_path_names = TRUE.
Install with:
BiocManager::install(c("AnnotationDbi", "KEGGREST")).
get_pathway_genes_db for the GO-based equivalent.
get_pathway_genes_reactome for the Reactome equivalent.
# Requires additional Bioconductor database packages if (require("org.Hs.eg.db", quietly = TRUE)) { pw <- get_pathway_genes_kegg(org.Hs.eg.db) str(pw[1:2]) }# Requires additional Bioconductor database packages if (require("org.Hs.eg.db", quietly = TRUE)) { pw <- get_pathway_genes_kegg(org.Hs.eg.db) str(pw[1:2]) }
Extracts Reactome pathway to gene mappings, using reactome.db
for pathway-to-gene associations and an OrgDb for gene symbol
resolution. Returns a named list ready to pass to
pathway_dsge().
get_pathway_genes_reactome( orgdb, keytype = "ENTREZID", gene_id_col = "reactome_gene_id", gene_symbol_col = "reactome_gene_symbol", min_size = 5L, attach_path_names = TRUE, species_prefix = "R-HSA" )get_pathway_genes_reactome( orgdb, keytype = "ENTREZID", gene_id_col = "reactome_gene_id", gene_symbol_col = "reactome_gene_symbol", min_size = 5L, attach_path_names = TRUE, species_prefix = "R-HSA" )
orgdb |
An |
keytype |
Key type used to query the |
gene_id_col |
Name for the gene ID column in the output
data.frames. Default |
gene_symbol_col |
Name for the gene symbol column in the output
data.frames. Default |
min_size |
Minimum gene count per pathway; pathways below this
are discarded. Default |
attach_path_names |
Logical. Whether to fetch pathway names via
|
species_prefix |
Reactome species prefix for filtering pathways.
Default |
Reactome pathway IDs follow the pattern R-HSA-<number>
(e.g. R-HSA-177929). Pathway names are fetched locally from
the reactome.db Bioconductor package in a single batched query.
A named list where each element is a data.frame with
columns reactome_name (if attached), gene_id_col, and
gene_symbol_col. The list names are Reactome pathway IDs
(e.g. "R-HSA-177929"). S3 class "reactome_pathway"
is set on list elements for source auto-detection by
pathway_dsge().
Requires reactome.db and AnnotationDbi. Install
with: BiocManager::install(c("AnnotationDbi", "reactome.db")).
get_pathway_genes_db for the GO-based equivalent.
get_pathway_genes_kegg for the KEGG equivalent.
# Requires additional Bioconductor database packages if (require("org.Hs.eg.db", quietly = TRUE) && require("reactome.db", quietly = TRUE)) { pw <- get_pathway_genes_reactome(org.Hs.eg.db) str(pw[1:2]) }# Requires additional Bioconductor database packages if (require("org.Hs.eg.db", quietly = TRUE) && require("reactome.db", quietly = TRUE)) { pw <- get_pathway_genes_reactome(org.Hs.eg.db) str(pw[1:2]) }
Computes DSGE for each GO pathway, generates permutation null distributions grouped by number of matched genes, computes p-values using GPD tail extrapolation, and applies Benjamini-Hochberg FDR correction.
pathway_dsge( pathway_genes, pvalue, base_mean = NULL, gene_names, gene_id_col = NULL, source = NULL, base_mean_cutoff = 0.1, min_size = 5L, max_size = 500L, n_perm = 10000L, seed = NULL, return_null = FALSE, progress = TRUE, heterogeneity = FALSE, directional = FALSE, direction_vec = NULL, use_std = TRUE, use_gpd = TRUE, gpd_threshold = 0.99, gpd_method = "mle", safety_margin = 1.6, n_cores = 1L, dsge_std = NULL, p_adjust_method = "BY", nds_top_frac = 0.25 )pathway_dsge( pathway_genes, pvalue, base_mean = NULL, gene_names, gene_id_col = NULL, source = NULL, base_mean_cutoff = 0.1, min_size = 5L, max_size = 500L, n_perm = 10000L, seed = NULL, return_null = FALSE, progress = TRUE, heterogeneity = FALSE, directional = FALSE, direction_vec = NULL, use_std = TRUE, use_gpd = TRUE, gpd_threshold = 0.99, gpd_method = "mle", safety_margin = 1.6, n_cores = 1L, dsge_std = NULL, p_adjust_method = "BY", nds_top_frac = 0.25 )
pathway_genes |
Named list returned by
|
pvalue |
Numeric vector of p-values from differential expression analysis (e.g., DESeq2, edgeR, Seurat FindMarkers) for all genes in the background pool. |
base_mean |
Numeric vector of mean expression values, same length as pvalue. Pass NULL to skip expression-level filtering. |
gene_names |
Character vector of gene names (same length as pvalue), must be unique. |
gene_id_col |
Column name in pathway_genes data.frames used to
match gene names. When |
source |
Pathway source. One of |
base_mean_cutoff |
baseMean filter threshold, default 0.1. |
min_size |
Minimum number of matched genes; pathways below this are not tested. Default 5. |
max_size |
Maximum number of matched genes; pathways above this are excluded. Default 500. Set to Inf to retain all. |
n_perm |
Number of permutations per size group, default 10000. |
seed |
Optional integer random seed. |
return_null |
Whether to include null distribution data in the
result (for plotting). Default |
progress |
Whether to show a progress bar, default TRUE. |
heterogeneity |
Whether to compute perturbation heterogeneity
indices (Gini, CV) and two-sided p-values. Default |
directional |
Whether to compute Normalized Direction Score
(NDS) for each pathway. Default |
direction_vec |
Numeric vector of direction indicators
(e.g., log fold changes), same length as |
use_std |
Whether to compute and use standardised DSGE.
Default
When |
use_gpd |
Whether to use GPD tail extrapolation for extreme
p-values. Default |
gpd_threshold |
Tail quantile threshold for GPD fitting, between 0 and 1. Default 0.99. Lower = more tail samples (lower variance, higher bias); higher = fewer samples (higher variance, lower bias). |
gpd_method |
GPD estimation method passed to
|
safety_margin |
Safety margin for GPD support-constrained adjustment.
Default |
n_cores |
Number of CPU cores for parallel null distribution
generation. Default |
dsge_std |
[Deprecated]. Use |
p_adjust_method |
Multiple testing correction method. Passed to
|
nds_top_frac |
Fraction of most-perturbed genes retained for
NDS calculation. Default |
Pathways sharing the same number of matched genes reuse the same null distribution, greatly reducing computation.
**Algorithm steps (9 steps)**
Build gene pool: filter DESeq2 results (baseMean > cutoff), compute per-gene raw z-scores.
Match pathway genes: match each pathway's genes to
the gene pool via the gene_id_col column.
Filter pathways: retain pathways with
min_size <= n_matched <= max_size.
Compute observed DSGE: per pathway as the mean of member gene z-scores: DSGE = mean(z_i).
Generate size-grouped null distributions: each unique matched-gene count gets its own permutation run (vectorized batch sampling), with simultaneous GPD tail fitting.
Compute standardised DSGE (step 6): when
use_std = TRUE, compute dsge_std = (observed -
mean(null)) / sd(null) for each pathway, using the
size-grouped null distribution.
Compute p-values: when use_std = TRUE,
empirical ECDF of the standardised null vs. dsge_std; when
use_std = FALSE, GPD tail extrapolation (above the
gpd_threshold quantile, default 99th) + empirical ECDF on the raw null distribution.
BH FDR correction: Benjamini-Hochberg multiple testing correction on all pathway p-values.
Sort and return: ordered by p_adj ascending.
By default, a data.frame sorted by p_adj
ascending. The pathway ID and name columns depend on the source:
For GO: go_id, go_name, aspect
For KEGG: kegg_id, kegg_name
For REACTOME: reactome_id, reactome_name
All sources include: n_pathway, n_matched,
dsge, p_value, p_adj.
For GO sources, aspect is the ontology classification:
"BP" (Biological Process), "MF" (Molecular Function),
"CC" (Cellular Component).
When heterogeneity = TRUE, additionally includes:
gini, cv, het_p_value, het_p_adj.
When directional = TRUE, additionally includes:
nds (Normalized Direction Score, ranging from -1 to +1).
When use_std = TRUE (default), additionally includes:
dsge_std = (observed DSGE - mean(null)) / sd(null),
standardised using the size-grouped null distribution.
When return_null = TRUE, returns a list with:
table |
the data.frame described above |
null_raw |
named list, keys are pathway sizes (as character), values are DSGE null distribution vectors |
null_gpd |
named list, keys are pathway sizes, values are GPD fit parameters (or NULL) |
When both heterogeneity = TRUE and return_null = TRUE,
the list also includes:
null_gini_raw |
named list, keys are pathway sizes, values are Gini null distribution vectors |
null_cv_raw |
named list, keys are pathway sizes, values are CV null distribution vectors |
# Toy example with simulated data set.seed(42) pvals <- runif(500) base_mean <- rexp(500) gene_names <- paste0("gene", 1:500) pw <- list( pathway_A = data.frame(db_object_symbol = paste0("gene", 1:20), go_name = "Pathway A", stringsAsFactors = FALSE), pathway_B = data.frame(db_object_symbol = paste0("gene", 21:40), go_name = "Pathway B", stringsAsFactors = FALSE) ) result <- pathway_dsge(pw, pvals, base_mean, gene_names, min_size = 1, n_perm = 100, seed = 42) head(result)# Toy example with simulated data set.seed(42) pvals <- runif(500) base_mean <- rexp(500) gene_names <- paste0("gene", 1:500) pw <- list( pathway_A = data.frame(db_object_symbol = paste0("gene", 1:20), go_name = "Pathway A", stringsAsFactors = FALSE), pathway_B = data.frame(db_object_symbol = paste0("gene", 21:40), go_name = "Pathway B", stringsAsFactors = FALSE) ) result <- pathway_dsge(pw, pvals, base_mean, gene_names, min_size = 1, n_perm = 100, seed = 42) head(result)
For pathways in a pathway_dsge() result, plots the
density of the permutation null distribution with a dashed red line
marking the observed DSGE. If GPD tail fitting is available for the
size group, the tail region is highlighted in semi-transparent orange.
plot_dsge( result, n = 9L, pathway_ids = NULL, go_ids = NULL, col_null = "#2166AC", col_tail = "#E41A1C", col_obs = "#D73027", col_thr = "#999999", safety_margin = NULL, cex_main = 0.85, use_std = NULL )plot_dsge( result, n = 9L, pathway_ids = NULL, go_ids = NULL, col_null = "#2166AC", col_tail = "#E41A1C", col_obs = "#D73027", col_thr = "#999999", safety_margin = NULL, cex_main = 0.85, use_std = NULL )
result |
List returned by |
n |
When |
pathway_ids |
Optional character vector of pathway IDs to plot
(e.g. GO IDs, KEGG IDs, or Reactome IDs depending on source). When
provided, directly plots these pathways, overriding |
go_ids |
[Deprecated]. Use |
col_null |
Color of the null distribution density curve.
Default |
col_tail |
Color of the GPD tail region highlight.
Default |
col_obs |
Color of the observed DSGE vertical line.
Default |
col_thr |
Color of the GPD threshold vertical line.
Default |
safety_margin |
Safety margin for visual GPD tail extension
in the plot. Defaults to the value from |
cex_main |
Title font scaling. Default |
use_std |
Whether to plot standardised DSGE. Defaults to
|
No return value; called for its side effect (plotting).
# Build a result object with simulated data for demonstration set.seed(42) pvals <- runif(500) base_mean <- rexp(500) gene_names <- paste0("gene", 1:500) pw <- list( pw1 = data.frame(db_object_symbol = paste0("gene", 1:30), go_name = "Pathway 1", stringsAsFactors = FALSE), pw2 = data.frame(db_object_symbol = paste0("gene", 31:60), go_name = "Pathway 2", stringsAsFactors = FALSE) ) result <- pathway_dsge(pw, pvals, base_mean, gene_names, min_size = 1, n_perm = 100, seed = 42, return_null = TRUE) plot_dsge(result, n = 2)# Build a result object with simulated data for demonstration set.seed(42) pvals <- runif(500) base_mean <- rexp(500) gene_names <- paste0("gene", 1:500) pw <- list( pw1 = data.frame(db_object_symbol = paste0("gene", 1:30), go_name = "Pathway 1", stringsAsFactors = FALSE), pw2 = data.frame(db_object_symbol = paste0("gene", 31:60), go_name = "Pathway 2", stringsAsFactors = FALSE) ) result <- pathway_dsge(pw, pvals, base_mean, gene_names, min_size = 1, n_perm = 100, seed = 42, return_null = TRUE) plot_dsge(result, n = 2)
Plots a focused volcano plot showing only the genes belonging to a specified GO pathway. Each point is one gene from the pathway, with log2 fold change on the x-axis and statistical significance (-log10 p-value) on the y-axis. This allows visual inspection of the direction, magnitude, and distribution of perturbation within a single pathway.
plot_dsge_volcano( de_results, dsge_result = NULL, pathway_genes, go_id, logFC_col = "log2FoldChange", pval_col = "pvalue", padj_col = NULL, gene_col = "geneName", gene_id_col = "db_object_symbol", threshold = 0.05, padj_threshold = 0.05, lfc_threshold = NULL, color_up = "#CC3333", color_down = "#3366CC", color_nom = "#CC9933", color_ns = "#AAAAAA", alpha_sig = 0.9, alpha_ns = 0.5, label = NULL, label_genes = NULL, label_sig = FALSE, cex_label = 0.7, cex_point = 1.6, xlab = NULL, ylab = NULL, main = NULL, go_name = NULL, ... )plot_dsge_volcano( de_results, dsge_result = NULL, pathway_genes, go_id, logFC_col = "log2FoldChange", pval_col = "pvalue", padj_col = NULL, gene_col = "geneName", gene_id_col = "db_object_symbol", threshold = 0.05, padj_threshold = 0.05, lfc_threshold = NULL, color_up = "#CC3333", color_down = "#3366CC", color_nom = "#CC9933", color_ns = "#AAAAAA", alpha_sig = 0.9, alpha_ns = 0.5, label = NULL, label_genes = NULL, label_sig = FALSE, cex_label = 0.7, cex_point = 1.6, xlab = NULL, ylab = NULL, main = NULL, go_name = NULL, ... )
de_results |
Data.frame of differential expression results. Must contain at least log2FC, p-value, and gene identifier columns. |
dsge_result |
Optional result from |
pathway_genes |
A named list of pathway-gene mappings, as used in
|
go_id |
A single character string specifying which pathway (GO ID)
to plot. Must be a name in |
logFC_col |
Column name in |
pval_col |
Column name in |
padj_col |
Optional column name in |
gene_col |
Column name in |
gene_id_col |
Column name in |
threshold |
p-value significance threshold for the horizontal
reference line. Default |
padj_threshold |
Adjusted p-value threshold for significance
classification. Default |
lfc_threshold |
Numeric vector of logFC thresholds for vertical
reference lines (e.g. |
color_up |
Color for significantly up-regulated genes.
Default |
color_down |
Color for significantly down-regulated genes.
Default |
color_nom |
Color for nominally significant genes (raw p-value
below |
color_ns |
Color for non-significant genes.
Default |
alpha_sig |
Transparency for significant points.
Default |
alpha_ns |
Transparency for non-significant points.
Default |
label |
Whether to label genes with their names. Auto-set to
|
label_genes |
Optional character vector of specific gene names
within the pathway to label. Overrides |
label_sig |
When |
cex_label |
Text size for gene labels. Default |
cex_point |
Point size for significant genes. Non-significant
genes are plotted at |
xlab, ylab
|
Axis labels. Auto-generated when |
main |
Plot title. Default |
go_name |
Optional GO term name for the title. When |
... |
Additional arguments passed to |
Invisibly returns the subset of de_results for the
pathway genes (data.frame).
# Build input objects with simulated data for demonstration set.seed(42) de_results <- data.frame( log2FoldChange = rnorm(100, mean = 0, sd = 1.5), pvalue = runif(100), geneName = paste0("GENE", 1:100), stringsAsFactors = FALSE ) pw <- list( "GO:0042101" = data.frame( db_object_symbol = paste0("GENE", 1:15), go_name = "T cell receptor complex", stringsAsFactors = FALSE ) ) dsge <- pathway_dsge(pw, de_results$pvalue, de_results$pvalue, de_results$geneName, min_size = 1, n_perm = 100, seed = 42) plot_dsge_volcano(de_results, dsge, pw, go_id = "GO:0042101", logFC_col = "log2FoldChange", pval_col = "pvalue", gene_col = "geneName")# Build input objects with simulated data for demonstration set.seed(42) de_results <- data.frame( log2FoldChange = rnorm(100, mean = 0, sd = 1.5), pvalue = runif(100), geneName = paste0("GENE", 1:100), stringsAsFactors = FALSE ) pw <- list( "GO:0042101" = data.frame( db_object_symbol = paste0("GENE", 1:15), go_name = "T cell receptor complex", stringsAsFactors = FALSE ) ) dsge <- pathway_dsge(pw, de_results$pvalue, de_results$pvalue, de_results$geneName, min_size = 1, n_perm = 100, seed = 42) plot_dsge_volcano(de_results, dsge, pw, go_id = "GO:0042101", logFC_col = "log2FoldChange", pval_col = "pvalue", gene_col = "geneName")
Efficiently reads a GAF 2.x format gene ontology annotation file using
data.table::fread(). Comment header lines starting with
! are automatically skipped. Returns a data.frame with all 17
standard GAF columns.
read_gaf(file, col_names = GAF_COLUMNS, ...)read_gaf(file, col_names = GAF_COLUMNS, ...)
file |
Path to the GAF file. |
col_names |
Character vector of column names. Defaults to the
standard 17 GAF 2.2 column names. Pass |
... |
Additional arguments passed to |
A data.frame containing the GAF annotation data.
# Create a temporary GAF file for demonstration gaf_file <- tempfile(fileext = ".gaf") writeLines(c( "!gaf-version: 2.2", paste0("UniProtKB\tP12345\tGENE_A\t\tGO:0003674\t", "PMID:123456\tIDA\t\tF\tGene A\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t"), paste0("UniProtKB\tP67890\tGENE_B\t\tGO:0005575\t", "PMID:789012\tIBA\t\tC\tGene B\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t") ), gaf_file) gaf <- read_gaf(gaf_file) head(gaf)# Create a temporary GAF file for demonstration gaf_file <- tempfile(fileext = ".gaf") writeLines(c( "!gaf-version: 2.2", paste0("UniProtKB\tP12345\tGENE_A\t\tGO:0003674\t", "PMID:123456\tIDA\t\tF\tGene A\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t"), paste0("UniProtKB\tP67890\tGENE_B\t\tGO:0005575\t", "PMID:789012\tIBA\t\tC\tGene B\t\tprotein\t", "taxon:9606\t20240101\tGO\t\t") ), gaf_file) gaf <- read_gaf(gaf_file) head(gaf)
Parses a Gene Ontology OBO format file (version 1.2 / 1.4), extracting
the id, name, and namespace from each [Term]
stanza. [Typedef] stanzas are skipped.
read_obo(file)read_obo(file)
file |
Path to the OBO file (e.g. |
A data.frame with columns:
id |
GO identifier (e.g. |
name |
Human-readable term name (e.g. |
namespace |
Ontology classification: |
# Create a temporary OBO file for demonstration obo_file <- tempfile(fileext = ".obo") writeLines(c( "format-version: 1.2", "", "[Term]", "id: GO:0003674", "name: molecular_function", "namespace: molecular_function", "", "[Term]", "id: GO:0005575", "name: cellular_component", "namespace: cellular_component", "", "[Term]", "id: GO:0008150", "name: biological_process", "namespace: biological_process" ), obo_file) go_names <- read_obo(obo_file) head(go_names)# Create a temporary OBO file for demonstration obo_file <- tempfile(fileext = ".obo") writeLines(c( "format-version: 1.2", "", "[Term]", "id: GO:0003674", "name: molecular_function", "namespace: molecular_function", "", "[Term]", "id: GO:0005575", "name: cellular_component", "namespace: cellular_component", "", "[Term]", "id: GO:0008150", "name: biological_process", "namespace: biological_process" ), obo_file) go_names <- read_obo(obo_file) head(go_names)