| Title: | Comprehensive Database for Pathway Enrichment Analysis |
|---|---|
| Description: | Provides access to large-scale genomics data from the South Dakota State University's bioinformatics database, a unified platform for pathway analysis of over 13,000 organisms. It includes various gene mappings, gene characteristics, and pathway mapping data from KEGG, GOBP, GOCC, and many more pathway databases. Also provides various helper functions for processing RNA-Seq data for differential expression analysis and pathway enrichment analysis, occasionally sourced from code from Integrated Differential Expression & Pathway analysis (iDEP), developed by Ge, S.X., Son, E.W. & Yao, R. (2018) <doi:10.1186/s12859-018-2486-6>. |
| Authors: | Aidan Frederick [aut, cre], Xijin Ge [fnd] |
| Maintainer: | Aidan Frederick <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.0 |
| Built: | 2026-06-03 18:35:02 UTC |
| Source: | https://github.com/cran/pathdb |
Retrieves database (.db file) from the SDSU bioinformatics database, creates a connection via SQLite
connect_database(species_id = NULL)connect_database(species_id = NULL)
species_id |
ID of species selected (Loads organism info data if NULL) |
SQLite connection to the downloaded file
# Connect to organism information database conn <- connect_database() # Query information using connection x <- DBI::dbGetQuery( conn = conn, statement = "select * from orgInfo;" ) head(x) # Disconnect from database file DBI::dbDisconnect(conn = conn) # Connect to species information database (e.g. Indian Cobra) conn <- connect_database(species_id = 99) # Query information using connection x <- DBI::dbGetQuery( conn = conn, statement = "select * from geneInfo;" ) head(x) # Disconnect from database file DBI::dbDisconnect(conn = conn)# Connect to organism information database conn <- connect_database() # Query information using connection x <- DBI::dbGetQuery( conn = conn, statement = "select * from orgInfo;" ) head(x) # Disconnect from database file DBI::dbDisconnect(conn = conn) # Connect to species information database (e.g. Indian Cobra) conn <- connect_database(species_id = 99) # Query information using connection x <- DBI::dbGetQuery( conn = conn, statement = "select * from geneInfo;" ) head(x) # Disconnect from database file DBI::dbDisconnect(conn = conn)
Queries the database to map user-provided gene identifiers to Ensembl/Entrez IDs. To ensure best matching and conversion, please verify that all gene identifiers have no whitespace and are at least 2 characters long.
convert_id(genes, data = NULL, species_id, id_type = "ens")convert_id(genes, data = NULL, species_id, id_type = "ens")
genes |
A vector or character string of gene identifiers to convert. |
data |
Optional data frame or matrix. If provided, the function attempts
to match |
species_id |
Numeric. The ID of the species for the database connection. |
id_type |
Character. The type of ID to convert to:
|
A data frame.
If data is NULL: Returns a mapping table with original IDs and
IDs of selected type.
If data is provided: Returns data merged with the IDs of selected
type. Returns NULL if species_id is missing or no matches are found.
Any whitespace found in original IDs will be removed.
# CAUTION: The human database is very large, running these examples require # the download of the human database. # View our experimental gene IDs head(rownames(hypoxia_reads)) # Convert IDs to Ensembl format for further analysis ens_conv <- convert_id(genes = rownames(hypoxia_reads), species_id = 96) # Yields a conversion table for our genes head(ens_conv) # Can also convert to Entrez IDs, if needed entrez_conv <- convert_id(genes = rownames(hypoxia_reads), species_id = 96, id_type = "entrez") # Yields a conversion table for our genes head(entrez_conv) # We want to automatically convert our IDs within our data ens_hypoxia <- convert_id(genes = rownames(hypoxia_reads), species_id = 96, data = hypoxia_reads) # Original data head(hypoxia_reads) # Converted data head(ens_hypoxia)# CAUTION: The human database is very large, running these examples require # the download of the human database. # View our experimental gene IDs head(rownames(hypoxia_reads)) # Convert IDs to Ensembl format for further analysis ens_conv <- convert_id(genes = rownames(hypoxia_reads), species_id = 96) # Yields a conversion table for our genes head(ens_conv) # Can also convert to Entrez IDs, if needed entrez_conv <- convert_id(genes = rownames(hypoxia_reads), species_id = 96, id_type = "entrez") # Yields a conversion table for our genes head(entrez_conv) # We want to automatically convert our IDs within our data ens_hypoxia <- convert_id(genes = rownames(hypoxia_reads), species_id = 96, data = hypoxia_reads) # Original data head(hypoxia_reads) # Converted data head(ens_hypoxia)
Retrieves gene information (e.g., Ensembl IDs, positions) for a specific species, optionally filtered by a list of user-provided gene identifiers after converting to Ensembl IDs.
get_genes(species_id, genes = NULL)get_genes(species_id, genes = NULL)
species_id |
Numeric. The ID of the desired species. |
genes |
A vector or list of gene identifiers to filter by.
If |
A data frame containing gene information (from the geneInfo table).
If genes are provided, the result is filtered to match the converted Ensembl IDs.
# CAUTION: The human database is very large, running these examples require # the download of the human database. # We have gene IDs that are not commonly recognized head(rownames(hypoxia_reads)) # Retrieve gene information for genes in our sample # Converts to Ensembl IDs first genes <- get_genes(species_id = 96, genes = rownames(hypoxia_reads)) head(genes) # Retrieve all genes for desired species all_genes <- get_genes(species_id = 96) head(all_genes) # This is the same as running get_table(96, "geneInfo") all(get_genes(96) == get_table(96, "geneInfo"), na.rm = TRUE)# CAUTION: The human database is very large, running these examples require # the download of the human database. # We have gene IDs that are not commonly recognized head(rownames(hypoxia_reads)) # Retrieve gene information for genes in our sample # Converts to Ensembl IDs first genes <- get_genes(species_id = 96, genes = rownames(hypoxia_reads)) head(genes) # Retrieve all genes for desired species all_genes <- get_genes(species_id = 96) head(all_genes) # This is the same as running get_table(96, "geneInfo") all(get_genes(96) == get_table(96, "geneInfo"), na.rm = TRUE)
Retrieves pathway information for a specific species and optionally filters for specific genes.
get_pathways(species_id, genes = NULL, category = "GOBP")get_pathways(species_id, genes = NULL, category = "GOBP")
species_id |
Numeric. The ID of the desired species
(e.g., from |
genes |
A vector or column of a data frame containing gene IDs of interest.
If |
category |
Character. A vector or character constant of pathway categories/databases (e.g. KEGG, GOBP, GOCC, etc.). It is not recommended to use all categories, as some species have many, leading to performance issues |
The function first retrieves the pathway and pathwayInfo tables for the
specified species. If a list of genes is provided, it converts the IDs to
Ensembl IDs, matches them against the pathway map, and joins the results
with pathway metadata.
A data frame containing pathway information. If genes are provided,
the data frame is filtered to include only pathways containing those genes
and joined with gene mapping data.
# CAUTION: The human database is very large, running these examples require # the download of the human database. # Get GOBP pathways for our genes of interest path_info <- get_pathways( species_id = 96, genes = rownames(hypoxia_reads), category = "GOBP" ) head(path_info)# CAUTION: The human database is very large, running these examples require # the download of the human database. # Get GOBP pathways for our genes of interest path_info <- get_pathways( species_id = 96, genes = rownames(hypoxia_reads), category = "GOBP" ) head(path_info)
Retrieves a specific table from the database for a selected species.
get_table(species_id = NULL, table = NULL)get_table(species_id = NULL, table = NULL)
species_id |
Numeric. The selected species ID.
If |
table |
Character. The name of the table to retrieve (e.g., "geneInfo", "pathway").
If |
A data frame containing the data from the selected table.
list_tables to see available tables for a species.
# Retrieve geneInfo table for Indian Cobra Species cobra_genes <- get_table(species_id = 99, table = "geneInfo") # View table head(cobra_genes)# Retrieve geneInfo table for Indian Cobra Species cobra_genes <- get_table(species_id = 99, table = "geneInfo") # View table head(cobra_genes)
Results of performing differential expression analysis (DESeq2) on gene counts gathered in the following experiment: RNAseq transcriptomic profile of glioblastoma stem-like cells derived from U87MG cell line treated with a selective A3 adenosine receptor antagonist (MRS1220) under hypoxia.
hypoxia_deseqhypoxia_deseq
hypox_deseqA data frame with 13,818 rows and 6 columns:
Mean of normalized counts for all samples
Log2 fold change between treated and control
Standard error estimate for the log2 fold change estimate
Wald statistic
Wald test p-value
Benjamini-Hochberg adjusted p-value
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100146
Gene counts gathered in the following experiment: RNAseq transcriptomic profile of glioblastoma stem-like cells derived from U87MG cell line treated with a selective A3 adenosine receptor antagonist (MRS1220) under hypoxia.
hypoxia_readshypoxia_reads
hypoxia_readsA data frame with 35,238 rows and 4 columns:
Treatment replication 1 counts
Treatment replication 2 counts
Control replication 1 counts
Control replication 2 counts
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100146
An example TERM2GENE mapping tailored for the hypox_deseq dataset,
demonstrating the T2G_prep function's output.
hypoxia_T2Ghypoxia_T2G
A data frame with 2 columns:
Pathway ID or description
Ensembl Gene ID
Lists all available tables within the database for a specific species.
list_tables(species_id = NULL)list_tables(species_id = NULL)
species_id |
Numeric. The selected species ID.
If |
A character vector of table names available in the database connection.
# List all tables available for species 99 (Indian Cobra) list_tables(species_id = 99)# List all tables available for species 99 (Indian Cobra) list_tables(species_id = 99)
Retrieves pathway category options (e.g. KEGG, GOBP, etc.) for a given species. May take longer for well-documented species (i.e. Human)
path_categories(species_id = NULL)path_categories(species_id = NULL)
species_id |
Numeric. The ID of a desired species from database, found
using |
Data frame of pathway categories for given species
# Get pathway categories for species 99 (Indian Cobra) categories <- path_categories(species_id = 99) head(categories)# Get pathway categories for species 99 (Indian Cobra) categories <- path_categories(species_id = 99) head(categories)
Retrieves pathway mapping information for a specific species, filtered by one or more pathway categories (e.g., "GOBP", "KEGG"). Optionally, the results can be further restricted to a specific list of genes.
path_filter(species_id, genes = NULL, category = "GOBP")path_filter(species_id, genes = NULL, category = "GOBP")
species_id |
Numeric. The ID of the species to search for. |
genes |
Character vector (optional). A vector of gene IDs to filter the pathways.
If |
category |
Character or character vector. The pathway category or categories
to filter by (e.g., |
A data frame containing the pathway mapping information (such as gene, pathway ID, and description) for the specified categories and genes.
# Get all GO Biological Process pathways for Human (ID 96) gobp_paths <- path_filter(species_id = 96, category = "GOBP") # Get KEGG pathways for specific genes in a dataset data(hypoxia_reads) kegg_paths <- path_filter( species_id = 96, genes = rownames(hypoxia_reads)[1:100], category = "KEGG" )# Get all GO Biological Process pathways for Human (ID 96) gobp_paths <- path_filter(species_id = 96, category = "GOBP") # Get KEGG pathways for specific genes in a dataset data(hypoxia_reads) kegg_paths <- path_filter( species_id = 96, genes = rownames(hypoxia_reads)[1:100], category = "KEGG" )
Performs pre-processing, missing value imputation, filtering, and transformation on gene expression count data.
process_data( data, missing_value = "geneMedian", min_cpm = 0.5, n_min_samples = 1, rescale = FALSE )process_data( data, missing_value = "geneMedian", min_cpm = 0.5, n_min_samples = 1, rescale = FALSE )
data |
A numeric matrix or data frame (> 1 columns) of gene expression counts. |
missing_value |
Character. Method to handle missing values. Options:
|
min_cpm |
Numeric. Minimum counts per million threshold for filtering genes. |
n_min_samples |
Numeric. Minimum number of samples that must meet
the |
rescale |
Logical. TRUE allows for rescaling if values are exceedingly large. |
The processed and transformed data matrix.
# Check example data summary(pathdb::hypoxia_reads) nrow(pathdb::hypoxia_reads) # YOU decide how your data is transformed. # Here, we want to: # Replace missing values with median # Set minimum counts-per-million of 0.4 # Meet CPM threshold in 2 samples # Keep raw counts hypox_filtered <- process_data(data = pathdb::hypoxia_reads, missing_value = "geneMedian", min_cpm = 0.4, n_min_samples = 2) # Check filtered data summary(hypox_filtered) nrow(hypox_filtered)# Check example data summary(pathdb::hypoxia_reads) nrow(pathdb::hypoxia_reads) # YOU decide how your data is transformed. # Here, we want to: # Replace missing values with median # Set minimum counts-per-million of 0.4 # Meet CPM threshold in 2 samples # Keep raw counts hypox_filtered <- process_data(data = pathdb::hypoxia_reads, missing_value = "geneMedian", min_cpm = 0.4, n_min_samples = 2) # Check filtered data summary(hypox_filtered) nrow(hypox_filtered)
Searches the organism database for species matching a query string.
search_species(query, name_type = "all")search_species(query, name_type = "all")
query |
Character. The species name, partial name, or ID to search for. |
name_type |
Character. The type of name to search against. Options:
|
A data frame containing information for all matching species. Throws an error if no species are found.
# Search all names for "Human" search_species(query = "Human", name_type = "all") # Search primary names for "Human" search_species(query = "Human", name_type = "primary") # Search academic names for "Homo sapiens" search_species(query = "Homo sapiens", name_type = "academic") # Search by species ID search_species(query = 96, name_type = "id")# Search all names for "Human" search_species(query = "Human", name_type = "all") # Search primary names for "Human" search_species(query = "Human", name_type = "primary") # Search academic names for "Homo sapiens" search_species(query = "Homo sapiens", name_type = "academic") # Search by species ID search_species(query = 96, name_type = "id")
Prepares background genes for enrichment analysis functions in the format of TERM2GENE data, using pathway information from various databases. Requires ID for a species, and can filter for specific vector of genes.
T2G_prep(species_id = NULL, category = "GOBP", genes = NULL)T2G_prep(species_id = NULL, category = "GOBP", genes = NULL)
species_id |
Numeric. The ID of a desired species from database, found
using |
category |
Character. A vector or character constant of pathway categories/databases (e.g. KEGG, GOBP, GOCC, etc.). It is not recommended to use all categories, as some species have many, leading to performance issues |
genes |
Character. A character vector of genes to add to query |
A data frame containing TERM2GENE Data (pathways to genes)
# CAUTION: The human database is very large, running these examples require # the download of the human database. # Prepare background genes mapping for Hypoxia dataset # Useful for pathway enrichment analysis of our data bg_genes <- T2G_prep( species_id = 96, category = "KEGG", genes = rownames(hypoxia_deseq) ) head(bg_genes)# CAUTION: The human database is very large, running these examples require # the download of the human database. # Prepare background genes mapping for Hypoxia dataset # Useful for pathway enrichment analysis of our data bg_genes <- T2G_prep( species_id = 96, category = "KEGG", genes = rownames(hypoxia_deseq) ) head(bg_genes)