| Title: | Convert Gene IDs Between Each Other and Fetch Annotations from Biomart |
|---|---|
| Description: | Gene Symbols or Ensembl Gene IDs are converted using the Bimap interface in 'AnnotationDbi' in convertId2() but that function is only provided as fallback mechanism for the most common use cases in data analysis. The main function in the package is convert.bm() which queries BioMart using the full capacity of the API provided through the 'biomaRt' package. Presets and defaults are provided for convenience but all "marts", "filters" and "attributes" can be set by the user. Function convert.alias() converts Gene Symbols to Aliases and vice versa and function likely_symbol() attempts to determine the most likely current Gene Symbol. |
| Authors: | Vidal Fey [aut, cre], Henrik Edgren [aut] |
| Maintainer: | Vidal Fey <[email protected]> |
| License: | GPL-3 |
| Version: | 0.3.0 |
| Built: | 2026-05-31 07:08:15 UTC |
| Source: | https://github.com/cran/convertid |
Add values to cache
.addToCache(bfc, result, hash).addToCache(bfc, result, hash)
bfc |
Object of class BiocFileCache, created by a call to BiocFileCache::BiocFileCache() |
result |
character; name of the file written to chache |
hash |
unique hash representing a query. |
Check whether value in cache exists
.checkInCache(bfc, hash, verbose = FALSE).checkInCache(bfc, hash, verbose = FALSE)
bfc |
Object of class BiocFileCache, created by a call to BiocFileCache::BiocFileCache() |
hash |
unique hash representing a query. |
verbose |
logical; should additional verbose output be printed? Not currently used. This function returns TRUE if a record with the requested hash already exists in the file cache, otherwise returns FALSE. |
Read values from cache
.readFromCache(bfc, hash).readFromCache(bfc, hash)
bfc |
Object of class BiocFileCache, created by a call to BiocFileCache::BiocFileCache() |
hash |
unique hash representing a query. |
convert.alias() attempts to find all possible symbol-alias combinations for a given gene symbol, i.e.,
it assumes the input ID to be either an Alias or a Symbol and performs multiple queries to find all possible
counterparts. The input IDs are converted to title and upper case before querying and all possibilities are tested.
There are species presets for Human and Mouse annotations.
convert.alias(id, species = c("Human", "Mouse"), db = NULL)convert.alias(id, species = c("Human", "Mouse"), db = NULL)
id |
( |
species |
( |
db |
( |
A data.frame with two columns:
| 'SYMBOL': The official gene symbol. | |
| 'ALIAS': All possible aliases. |
convert.alias("TRPV4")convert.alias("TRPV4")
convert.bm() is a wrapper for get.bm() which in turn makes use of getBM() from the biomaRt package.
It takes a matrix or data frame with the IDs to be converted in one column or as row names as input and returns a data frame with additional
annotations after cleaning the fetched annotations and merging them with the input data frame.
convert.bm( dat, id = "ID", biom.data.set = c("human", "mouse"), biom.mart = c("ensembl", "mouse", "snp", "funcgen", "plants"), host = "https://www.ensembl.org", biom.filter = "ensembl_gene_id", biom.attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), biom.cache = rappdirs::user_cache_dir("biomaRt"), use.cache = TRUE, sym.col = "hgnc_symbol", rm.dups = FALSE, verbose = FALSE )convert.bm( dat, id = "ID", biom.data.set = c("human", "mouse"), biom.mart = c("ensembl", "mouse", "snp", "funcgen", "plants"), host = "https://www.ensembl.org", biom.filter = "ensembl_gene_id", biom.attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), biom.cache = rappdirs::user_cache_dir("biomaRt"), use.cache = TRUE, sym.col = "hgnc_symbol", rm.dups = FALSE, verbose = FALSE )
dat |
|
id |
|
biom.data.set |
|
biom.mart |
|
host |
|
biom.filter |
|
biom.attributes |
|
biom.cache |
|
use.cache |
( |
sym.col |
|
rm.dups |
|
verbose |
( |
Wrapped around 'get.bm'.
A data frame with the retrieved information.
Vidal Fey
## Not run: dat <- data.frame(ID=c("ENSG00000111199", "ENSG00000134121", "ENSG00000176102", "ENSG00000171611")) bm <- convert.bm(dat) bm ## End(Not run)## Not run: dat <- data.frame(ID=c("ENSG00000111199", "ENSG00000134121", "ENSG00000176102", "ENSG00000171611")) bm <- convert.bm(dat) bm ## End(Not run)
convertId2() uses the Bimap interface in AnnotationDbi to extract information from
annotation packages. The function is limited to Human and Mouse annotations and is provided only as
fallback mechanism for the most common use cases in data analysis. Please use the Biomart interface
function convert.bm() for more flexibility.
convertId2(id, species = c("Human", "Mouse"))convertId2(id, species = c("Human", "Mouse"))
id |
( |
species |
( |
A named character vector where the input IDs are the names and the query results the values.
convertId2("ENSG00000111199") convertId2("TRPV4")convertId2("ENSG00000111199") convertId2("TRPV4")
get.bm() is a user-friendly wrapper for getBM() from the biomaRt package with default
settings for Human and Mouse.
It sets all needed variables and performs the query.
get.bm( values, biom.data.set = c("human", "mouse"), biom.mart = c("ensembl", "mouse", "snp", "funcgen", "plants"), host = "https://www.ensembl.org", biom.filter = "ensembl_gene_id", biom.attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), biom.cache = rappdirs::user_cache_dir("biomaRt"), use.cache = TRUE, verbose = FALSE )get.bm( values, biom.data.set = c("human", "mouse"), biom.mart = c("ensembl", "mouse", "snp", "funcgen", "plants"), host = "https://www.ensembl.org", biom.filter = "ensembl_gene_id", biom.attributes = c("ensembl_gene_id", "hgnc_symbol", "description"), biom.cache = rappdirs::user_cache_dir("biomaRt"), use.cache = TRUE, verbose = FALSE )
values |
|
biom.data.set |
|
biom.mart |
|
host |
|
biom.filter |
|
biom.attributes |
|
biom.cache |
|
use.cache |
( |
verbose |
( |
A data frame with the retrieved information.
Vidal Fey
## Not run: val <- c("ENSG00000111199", "ENSG00000134121", "ENSG00000176102", "ENSG00000171611") bm <- get.bm(val) bm ## End(Not run)## Not run: val <- c("ENSG00000111199", "ENSG00000134121", "ENSG00000176102", "ENSG00000171611") bm <- get.bm(val) bm ## End(Not run)
likely_symbol() downloads the latest version of the HGNC gene symbol database as a text
file and query it to obtain symbol aliases, previous symbols and all symbols currently in use. (Optionally)
assuming the input ID to be either an Alias or a Symbol or a Previous Symbol it performs multiple queries and
compares the results of all possible combinations to determine a likely current Symbol.
The downloaded HGNC table is cached for the duration of the R session to avoid repeated downloads.
likely_symbol( syms, alias_sym = TRUE, prev_sym = TRUE, orgnsm = "human", hgnc = NULL, hgnc_url = NULL, output = c("likely", "symbols", "all"), index_threshold = 10L, refresh = FALSE, verbose = TRUE )likely_symbol( syms, alias_sym = TRUE, prev_sym = TRUE, orgnsm = "human", hgnc = NULL, hgnc_url = NULL, output = c("likely", "symbols", "all"), index_threshold = 10L, refresh = FALSE, verbose = TRUE )
syms |
( |
alias_sym |
( |
prev_sym |
( |
orgnsm |
( |
hgnc |
( |
hgnc_url |
( |
output |
( |
index_threshold |
( |
refresh |
( |
verbose |
( |
The HGNC table is downloaded once per R session and cached in a package-level environment. Subsequent calls
reuse the cached table without any network access. If the cached table is more than 3 days old a warning message
is emitted recommending a refresh, since the HGNC database is updated monthly. To force a fresh download within
the same session use refresh = TRUE or start a new R session.
When the number of unique input symbols is at or above index_threshold, inverted indices (hash tables)
are pre-built from the HGNC table so that each per-symbol lookup is O(1) rather than O(nrow(hgnc)), giving
roughly a 50-100x speedup for batch inputs. For small inputs the original row-scan is retained to avoid the
index-building overhead.
A data.frame with the following columns depending on the output setting.
output="likely":
| 'likely_symbol' | |
| 'input_symbol' |
output="symbols":
| 'current_symbols' | |
| 'likely_symbol' | |
| 'input_symbol' | |
| 'all_symbols' |
output="all":
| 'orig_input' | |
| 'organism' | |
| 'current_symbols' | |
| 'likely_symbol' | |
| 'input_symbol' | |
| 'all_symbols' |
Only fully implemented for Human for now.
## Not run: # Single symbol lookup (uses row-scan, no index overhead) likely_symbol("CCBL1") # Second call reuses cached HGNC table — no download likely_symbol("KAAT1") # Force a fresh download within the same session likely_symbol("CCBL1", refresh = TRUE) # Batch lookup (builds index for speed) likely_symbol(c("ABCC4", "ACPP", "KIAA1524")) # Supply a pre-loaded table to bypass cache and download entirely likely_symbol(c("ABCC4", "ACPP"), hgnc = my_hgnc_table) ## End(Not run)## Not run: # Single symbol lookup (uses row-scan, no index overhead) likely_symbol("CCBL1") # Second call reuses cached HGNC table — no download likely_symbol("KAAT1") # Force a fresh download within the same session likely_symbol("CCBL1", refresh = TRUE) # Batch lookup (builds index for speed) likely_symbol(c("ABCC4", "ACPP", "KIAA1524")) # Supply a pre-loaded table to bypass cache and download entirely likely_symbol(c("ABCC4", "ACPP"), hgnc = my_hgnc_table) ## End(Not run)
todisp2() uses Biomart by employing get.bm() to retrieve Gene Symbols for a set of Ensembl
Gene IDs. It is mainly meant as a fast way to convert IDs in standard gene expression analysis output to Symbols,
e.g., for visualisation, which is why the input ID type is hard-coded to ENSG IDs. If Biomart is not available
the function can fall back to use convertId2() or a user-provided data frame with corresponding ENSG IDs and
Symbols.
todisp2( ensg, lab = NULL, biomart = TRUE, biom.data.set = "hsapiens_gene_ensembl", biom.mart = "ensembl", host = "https://www.ensembl.org", biom.filter = "ensembl_gene_id", biom.attributes = c("ensembl_gene_id", "hgnc_symbol"), biom.cache = rappdirs::user_cache_dir("biomaRt"), use.cache = TRUE, keep.original = TRUE, verbose = FALSE )todisp2( ensg, lab = NULL, biomart = TRUE, biom.data.set = "hsapiens_gene_ensembl", biom.mart = "ensembl", host = "https://www.ensembl.org", biom.filter = "ensembl_gene_id", biom.attributes = c("ensembl_gene_id", "hgnc_symbol"), biom.cache = rappdirs::user_cache_dir("biomaRt"), use.cache = TRUE, keep.original = TRUE, verbose = FALSE )
ensg |
( |
lab |
( |
biomart |
( |
biom.data.set |
|
biom.mart |
|
host |
|
biom.filter |
|
biom.attributes |
|
biom.cache |
|
use.cache |
( |
keep.original |
( |
verbose |
( |
A character vector of Gene Symbols.
## Not run: val <- c("ENSG00000111199", "ENSG00000134121", "ENSG00000176102", "ENSG00000171611") sym <- todisp2(val) sym ## End(Not run)## Not run: val <- c("ENSG00000111199", "ENSG00000134121", "ENSG00000176102", "ENSG00000171611") sym <- todisp2(val) sym ## End(Not run)
Takes a data frame with Ensembl gene IDs (and optionally gene symbols) and returns a deduplicated data frame with unified HGNC symbols, using a priority-based reconciliation of BioMart and AnnotationDbi results.
unify_gene_ids( genes, ensg_col = "ensembl_gene_id", symbol_col = NULL, host = "https://www.ensembl.org", biomart_fallback = c("https://uswest.ensembl.org", "https://asia.ensembl.org", "https://useast.ensembl.org"), keep_intermediates = FALSE, verbose = FALSE )unify_gene_ids( genes, ensg_col = "ensembl_gene_id", symbol_col = NULL, host = "https://www.ensembl.org", biomart_fallback = c("https://uswest.ensembl.org", "https://asia.ensembl.org", "https://useast.ensembl.org"), keep_intermediates = FALSE, verbose = FALSE )
genes |
A data frame with at minimum an Ensembl gene ID column or a character vector of Ensembl gene IDs. |
ensg_col |
Name of the column containing Ensembl gene IDs.
Default: |
symbol_col |
Name of the column containing gene symbols, or |
host |
BioMart host URL. Default: |
biomart_fallback |
Character vector of fallback BioMart host URLs to try
if the primary host fails. Set to |
keep_intermediates |
Logical; if |
verbose |
Logical; if |
Requires the Bioconductor packages org.Hs.eg.db and AnnotationDbi. These are not hard dependencies but will be checked at runtime with an informative error if missing.
Deduplication passes
The function performs two sequential deduplication passes via the internal
dedup_gene_ids() function:
Deduplicate by gene_name (if available) or ensembl_gene_id,
resolving multiple ENSG IDs mapping to the same gene name.
Deduplicate by hgnc_symbol, resolving cases where multiple
gene names resolve to the same symbol.
Symbol assignment priority
The guiding principle is that AnnotationDbi confirmation outranks BioMart ordering. AnnotationDbi (org.Hs.eg.db) reflects a stable, versioned annotation database, while BioMart returns the current Ensembl release which may be ahead of annotations used to build real-world count matrices. Preferring AnnotationDbi-confirmed IDs therefore maximises compatibility with count matrices from sequencing providers whose pipelines are not frequently updated.
Within each group of rows sharing a gene_name, the following priority
order is applied until a single row is selected:
Pre-filter: If any row has hgnc_symbol_2 == gene_name
(AnnotationDbi confirms the symbol), rows with hgnc_symbol_2 == NA
are discarded first. This ensures that an AnnotationDbi-confirmed row is
never passed over in favour of an unconfirmed one merely because the
latter happens to have hgnc_symbol == gene_name from BioMart.
BioMart symbol match: Rows where hgnc_symbol == gene_name
(and is not a raw ENSG placeholder).
AnnotationDbi symbol match: Rows where
hgnc_symbol_2 == gene_name (and is not a raw ENSG placeholder).
Both sources agree: Rows where
hgnc_symbol == hgnc_symbol_2, indicating cross-source confirmation.
BioMart ENSG confirmation: Rows whose ensembl_gene_id
matches the first entry in the ensg_2 ///-separated list
returned by AnnotationDbi. Note that ensg_2 list ordering is not
considered a reliable preference signal on its own; this filter is
intentionally placed after source-agreement filters.
Drop ENSG placeholders: Rows where hgnc_symbol is
still a raw ENSG ID are deprioritised.
Last resort: When all disambiguation fields
(hgnc_symbol_2, ensg_2) are NA across the entire
group, the first row is taken. When rows are otherwise identical in all
metadata, the newer ENSG ID (as returned by BioMart) is preferred as the
more current annotation.
The second pass (by hgnc_symbol) applies the same principle but
additionally prefers rows whose hgnc_symbol matches gene_name,
and uses AnnotationDbi ENSG confirmation as a tiebreaker before falling back
to x[1, ].
ENSG placeholder resolution
After the filter chain, any remaining rows where hgnc_symbol is a raw
ENSG placeholder are fixed: if hgnc_symbol_2 is available it is used;
otherwise gene_name is used (or ensembl_gene_id in ENSG-only
mode). This allows rows with ENSG placeholders from BioMart to be correctly
resolved in the second pass via their hgnc_symbol_2 value.
BioMart fallback
BioMart queries are attempted with graceful fallback through mirror hosts.
If all hosts fail the function proceeds with AnnotationDbi results only.
If both BioMart and AnnotationDbi fail entirely, the input is returned with
ENSG IDs used as hgnc_symbol values.
A deduplicated data frame with unified HGNC symbols in the
hgnc_symbol column, plus hgnc_symbol_2 and ensg_2
columns from the AnnotationDbi lookups.
## Not run: # Example input: two-column data frame with Ensembl IDs and gene symbols, # as typically produced by a sequencing provider's count matrix annotation my_genes <- data.frame( gene_id = c("ENSG00000000003", "ENSG00000000419", "ENSG00000000460", "ENSG00000012048", "ENSG00000075624", "ENSG00000111640", "ENSG00000141510", "ENSG00000146648"), gene_name = c("TSPAN6", "DPM1", "FIRRM", "BRCA1", "ACTB", "GAPDH", "TP53", "EGFR"), stringsAsFactors = FALSE ) # With gene symbols (full mode) result <- unify_gene_ids(my_genes, ensg_col = "gene_id", symbol_col = "gene_name", verbose = TRUE) # ENSG-only (e.g. from count matrix row names, no symbol column available) ensg_only <- data.frame( ensembl_gene_id = my_genes$gene_id, stringsAsFactors = FALSE ) result_ensg <- unify_gene_ids(ensg_only, verbose = TRUE) ## End(Not run)## Not run: # Example input: two-column data frame with Ensembl IDs and gene symbols, # as typically produced by a sequencing provider's count matrix annotation my_genes <- data.frame( gene_id = c("ENSG00000000003", "ENSG00000000419", "ENSG00000000460", "ENSG00000012048", "ENSG00000075624", "ENSG00000111640", "ENSG00000141510", "ENSG00000146648"), gene_name = c("TSPAN6", "DPM1", "FIRRM", "BRCA1", "ACTB", "GAPDH", "TP53", "EGFR"), stringsAsFactors = FALSE ) # With gene symbols (full mode) result <- unify_gene_ids(my_genes, ensg_col = "gene_id", symbol_col = "gene_name", verbose = TRUE) # ENSG-only (e.g. from count matrix row names, no symbol column available) ensg_only <- data.frame( ensembl_gene_id = my_genes$gene_id, stringsAsFactors = FALSE ) result_ensg <- unify_gene_ids(ensg_only, verbose = TRUE) ## End(Not run)