Title: | Identify and Correct Invalid HGNC Human Gene Symbols and MGI Mouse Gene Symbols |
---|---|
Description: | Contains functions for identifying and correcting HGNC human gene symbols and MGI mouse gene symbols which have been converted to date format by Excel, withdrawn, or aliased. Also contains functions for reversibly converting between HGNC symbols and valid R names. |
Authors: | Sehyun Oh [aut], Ayush Aggarwal [aut], Markus Riester [aut], Levi Waldron [aut, cre] |
Maintainer: | Levi Waldron <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 0.8.15 |
Built: | 2024-12-17 06:59:10 UTC |
Source: | CRAN |
This function simply prepends "affy." to the probeset IDs to create valid R names. Reverse operation is done by the rToAffy function.
affyToR(x)
affyToR(x)
x |
vector of Affymetrix probeset identifiers, or any identifier which may with a digit. |
a character vector that is simply x with "affy." prepended to each value.
This function identifies gene symbols which are outdated or may have been mogrified by Excel or other spreadsheet programs. If output is assigned to a variable, it returns a data.frame of the same number of rows as the input, with a second column indicating whether the symbols are valid and a third column with a corrected gene list.
checkGeneSymbols( x, chromosome = NULL, unmapped.as.na = TRUE, map = NULL, species = "human", expand.ambiguous = FALSE )
checkGeneSymbols( x, chromosome = NULL, unmapped.as.na = TRUE, map = NULL, species = "human", expand.ambiguous = FALSE )
x |
A character vector of gene symbols to check for modified or outdated values |
chromosome |
An optional integer vector containing the chromosome number of each gene
provided through the argument |
unmapped.as.na |
If |
map |
Specify if you do not want to use the default maps provided by setting
species equal to "mouse" or "human". Map can be any other data.frame with colnames
identical to |
species |
A character vector of length 1, either "human" (default) or "mouse".
If |
expand.ambiguous |
If |
The function will return a data.frame of the same number of rows as the input, with corrections possible from map.
mouse.table
for the mouse lookup table, hgnc.table
for the human lookup table
library(HGNChelper) ## Human human <- c("FN1", "TP53", "UNKNOWNGENE","7-Sep", "9/7", "1-Mar", "Oct4", "4-Oct", "OCT4-PG4", "C19ORF71", "C19orf71") checkGeneSymbols(human) ## Mouse mouse <- c("1-Feb", "Pzp", "A2m") checkGeneSymbols(mouse, species="mouse") ## expand.ambiguous ## Human human <- "AAVS1" checkGeneSymbols(human, expand.ambiguous=FALSE) checkGeneSymbols(human, expand.ambiguous=TRUE) ## Mouse mouse <- c("Cpamd8", "Mug2") checkGeneSymbols(mouse, species = "mouse", expand.ambiguous = FALSE) checkGeneSymbols(mouse, species = "mouse", expand.ambiguous = TRUE) ## Updating the map if (interactive()) { currentHumanMap <- getCurrentHumanMap() checkGeneSymbols(human, map=currentHumanMap) # You should save this if you are going to use it multiple times, # then load it from file rather than burdening HGNC's servers. save(hgnc.table, file="hgnc.table.rda", compress="bzip2") load("hgnc.table.rda") checkGeneSymbols(human, map=hgnc.table) }
library(HGNChelper) ## Human human <- c("FN1", "TP53", "UNKNOWNGENE","7-Sep", "9/7", "1-Mar", "Oct4", "4-Oct", "OCT4-PG4", "C19ORF71", "C19orf71") checkGeneSymbols(human) ## Mouse mouse <- c("1-Feb", "Pzp", "A2m") checkGeneSymbols(mouse, species="mouse") ## expand.ambiguous ## Human human <- "AAVS1" checkGeneSymbols(human, expand.ambiguous=FALSE) checkGeneSymbols(human, expand.ambiguous=TRUE) ## Mouse mouse <- c("Cpamd8", "Mug2") checkGeneSymbols(mouse, species = "mouse", expand.ambiguous = FALSE) checkGeneSymbols(mouse, species = "mouse", expand.ambiguous = TRUE) ## Updating the map if (interactive()) { currentHumanMap <- getCurrentHumanMap() checkGeneSymbols(human, map=currentHumanMap) # You should save this if you are going to use it multiple times, # then load it from file rather than burdening HGNC's servers. save(hgnc.table, file="hgnc.table.rda", compress="bzip2") load("hgnc.table.rda") checkGeneSymbols(human, map=hgnc.table) }
This function identifies gene symbols which may have been mogrified by Excel or other spreadsheet programs. If output is assigned to a variable, it returns a vector of the same length where symbols which could be mapped have been mapped.
findExcelGeneSymbols( x, mog.map = read.csv(system.file("extdata/mog_map.csv", package = "HGNChelper"), as.is = TRUE), regex = "impossibletomatch^" )
findExcelGeneSymbols( x, mog.map = read.csv(system.file("extdata/mog_map.csv", package = "HGNChelper"), as.is = TRUE), regex = "impossibletomatch^" )
x |
Vector of gene symbols to check for mogrified values |
mog.map |
Map of known mogrifications. This should be a dataframe with two columns: original and mogrified, containing the correct and incorrect symbols, respectively. |
regex |
Regular expression, recognized by the base::grep function which is called with ignore.case=TRUE, to identify mogrified symbols. The default regex will not match anything. The regex in the examples is an attempt to match all Excel-mogrified HGNC human gene symbols. It is not necessary for all matches to have a corresponding entry in mog.map$mogrified; values in x which are matched by this regex but are not found in mog.map$mogrified simply will not be corrected. |
if the return value of the function is assigned to a variable, the function will return a vector of the same length as the input, with corrections possible from mog.map made.
## Available maps from this package: human <- read.csv(system.file("extdata/mog_map.csv", package = "HGNChelper"), as.is=TRUE) mouse <- read.csv(system.file("extdata/HGNChelper_mog_map_MGI_AMC_2016_03_30.csv", package = "HGNChelper"), as.is=TRUE) ## This regex is based that provided by Zeeberg et al., ## Mistaken Identifiers: Gene name errors can be introduced ## inadvertently when using Excel in bioinformatics. BMC ## Bioinformatics 2004, 5:80. re <- "[0-9]\\-(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|[0-9]\\.[0-9][0-9]E\\+[[0-9][0-9]" findExcelGeneSymbols(c("2-Apr", "APR2"), mog.map=human, regex=re) findExcelGeneSymbols(c("1-Feb", "Feb1"), mog.map=mouse)
## Available maps from this package: human <- read.csv(system.file("extdata/mog_map.csv", package = "HGNChelper"), as.is=TRUE) mouse <- read.csv(system.file("extdata/HGNChelper_mog_map_MGI_AMC_2016_03_30.csv", package = "HGNChelper"), as.is=TRUE) ## This regex is based that provided by Zeeberg et al., ## Mistaken Identifiers: Gene name errors can be introduced ## inadvertently when using Excel in bioinformatics. BMC ## Bioinformatics 2004, 5:80. re <- "[0-9]\\-(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|[0-9]\\.[0-9][0-9]E\\+[[0-9][0-9]" findExcelGeneSymbols(c("2-Apr", "APR2"), mog.map=human, regex=re) findExcelGeneSymbols(c("1-Feb", "Feb1"), mog.map=mouse)
Valid human and mouse gene symbols can be updated frequently. Use
these functions to get the most current lists of valid symbols, which you can
then use as an input to the map
argument of checkGeneSymbols
.
Make sure to change the default species="human"
argument to checkGeneSymbols
if you are doing this for mouse. Use getCurrentHumanMap
for HGNC human gene
symbols from https://www.genenames.org/ and getCurrentMouseMap
for MGI mouse gene
symbols from https://www.informatics.jax.org/downloads/reports/MGI_EntrezGene.rpt.
getCurrentHumanMap() getCurrentMouseMap()
getCurrentHumanMap() getCurrentMouseMap()
A data.frame
that can be used for map
argument of checkGeneSymbols
function
## Not run: ## human new.hgnc.table <- getCurrentHumanMap() checkGeneSymbols(c("3-Oct", "10-3", "tp53"), map=new.hgnc.table) ## mouse new.mouse.table <- getCurrentMouseMap() ## Set species to NULL or "mouse" checkGeneSymbols(c("Gm46568", "1-Feb"), map=new.mouse.table, species="mouse") ## End(Not run)
## Not run: ## human new.hgnc.table <- getCurrentHumanMap() checkGeneSymbols(c("3-Oct", "10-3", "tp53"), map=new.hgnc.table) ## mouse new.mouse.table <- getCurrentMouseMap() ## Set species to NULL or "mouse" checkGeneSymbols(c("Gm46568", "1-Feb"), map=new.mouse.table, species="mouse") ## End(Not run)
A data.frame
with the first column providing a gene symbol or
known alias (including withdrawn symbols), second column providing the approved
HGNC human gene symbol.
Symbol
: All valid, Excel-mogrified, and withdrawn symbols
Approved.Symbol
: Approved symbols
hgnc.table
hgnc.table
An object of class data.table
(inherits from data.frame
) with 103939 rows and 3 columns.
Extracted from https://storage.googleapis.com/public-download-files/hgnc/tsv/tsv/hgnc_complete_set.txt and system.file("extdata/mog_map.csv", package="HGNChelper")
data("hgnc.table", package="HGNChelper") head(hgnc.table)
data("hgnc.table", package="HGNChelper") head(hgnc.table)
A data.frame
with the first column providing a gene symbol or
known alias (including withdrawn symbols), second column providing the approved
MGI mouse gene symbol.
Symbol
: All valid, Excel-mogrified, and withdrawn symbols
Approved.Symbol
: Approved symbols
mouse.table
mouse.table
An object of class data.frame
with 790110 rows and 2 columns.
Extracted from http://www.informatics.jax.org/downloads/reports/MGI_EntrezGene.rpt and system.file("extdata/HGNChelper_mog_map_MGI_AMC_2016_03_30.csv", package="HGNChelper")
data("mouse.table", package="HGNChelper") head(mouse.table)
data("mouse.table", package="HGNChelper") head(mouse.table)
This function simply strips the "affy." added by the affyToR function.
rToAffy(x)
rToAffy(x)
x |
the character vector returned by the affyToR function. |
a character vector of Affymetrix probeset identifiers.
This function reverses the actions of the symbolToR function.
rToSymbol(x)
rToSymbol(x)
x |
the character vector returned by the symbolToR function. |
a character vector of HGNC gene symbols, which are not in general valid R names.
This function reversibly converts HGNC gene symbols to valid R names by prepending "symbol.", and making the following substitutions: "-" to "hyphen", "@" to "ampersand", and "/" to "forwardslash".
symbolToR(x)
symbolToR(x)
x |
vector of HGNC symbols |
a vector of valid R names, of the same length as x, which can be converted to the same HGNC symbols using the rToSymbol function.