Title: | Removes the Cell-Cycle Effect from Single-Cell RNA-Sequencing Data |
---|---|
Description: | Implements a method for identifying and removing the cell-cycle effect from scRNA-Seq data. The description of the method is in Barron M. and Li J. (2016) <doi:10.1038/srep33892>. Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data. Submitted. Different from previous methods, ccRemover implements a mechanism that formally tests whether a component is cell-cycle related or not, and thus while it often thoroughly removes the cell-cycle effect, it preserves other features/signals of interest in the data. |
Authors: | Jun Li [aut, cre], Martin Barron [aut] |
Maintainer: | Jun Li <[email protected]> |
License: | GPL-3 |
Version: | 1.0.4 |
Built: | 2024-12-17 06:52:15 UTC |
Source: | CRAN |
This function is only used internally inside ccRemover. The function calcualtes the average load difference on the cell-cycle and control genes. Bootstrap resampling is then used to provide a score for each component. Please see the original manuscript for the mathematical details.
bootstrap_diff(xy, xn, nboot = 200, bar = TRUE)
bootstrap_diff(xy, xn, nboot = 200, bar = TRUE)
xy |
The data for the genes which are annotated to the cell-cycle, i.e.
those genes for which "if_cc" is |
xn |
The data for the genes which are not annotated to the cell-cycle,
control genes, genes for which "if_cc" is |
nboot |
The number of bootstrap repititions to be carried out on each iteration to determine the significance of each component. |
bar |
Whether to display a progress bar or not. The progress bar will
not work in R-markdown enviornments so this option may be turned off. The
default value is |
A data frame containing the loadings for each component on the cell-cycle and control genes as well as the difference between the loadings and the bootstrapped statistic for each loading.
ccRemover
returns a data matrix with the effects of the cell-cycle
removed.
ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10, bar = TRUE)
ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10, bar = TRUE)
dat |
A list containing a data frame , It is recommended that the elements of x are log-transformed and centered
for each gene. For example if The |
cutoff |
The significance cutoff for identifying sources of variation related to the cell-cycle. The default value is 3, which roughly corresponds to a p-value of 0.01. |
max_it |
The maximum number of iterations for the algorithm. The default value is 4. |
nboot |
The number of bootstrap repititions to be carried out on each iteration to determine the significance of each component. |
ntop |
The number of components considered tested at each iteration as cell-cycle effects. The default value if 10 |
bar |
Whether to display a progress bar or not. The progress bar will
not work in R-markdown enviornments so this option may be turned off. The
default value is |
Implements the algorithm described in Barron, M. & Li, J. "Identifying and removing the cell-cycle effect from scRNA-Sequencing data" (2016), Scientific Reports. This function takes a normalized, log-transformed and centered matrix of scRNA-seq expression data and a list of genes which are known to be related to the cell-cycle effect. It then captures the main sources of variation in the data and determines which of these are related to the cell-cycle before removing those that are. Please see the original manuscript for further details.
A data matrix with the effects of the cell-cycle removed.
set.seed(10) # Load in example data data(t.cell_data) head(t.cell_data[,1:5]) # Center data and select small sample for example t_cell_data_cen <- t(scale(t(t.cell_data[,1:20]), center=TRUE, scale=FALSE)) # Extract gene names gene_names <- rownames(t_cell_data_cen) # Determine which genes are annotated to the cell-cycle cell_cycle_gene_indices <- gene_indexer(gene_names, species = "mouse", name_type = "symbol") # Create "if_cc" vector if_cc <- rep(FALSE,nrow(t_cell_data_cen)) if_cc[cell_cycle_gene_indices] <- TRUE # Move data into list dat <- list(x=t_cell_data_cen, if_cc=if_cc) # Run ccRemover ## Not run: xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10) ## End(Not run) # Run ccRemover with reduced bootstrap repetitions for example only xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 20, ntop = 10) head(xhat[,1:5]) # Run ccRemover with more compoents considered ## Not run: xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 15) ## End(Not run)
set.seed(10) # Load in example data data(t.cell_data) head(t.cell_data[,1:5]) # Center data and select small sample for example t_cell_data_cen <- t(scale(t(t.cell_data[,1:20]), center=TRUE, scale=FALSE)) # Extract gene names gene_names <- rownames(t_cell_data_cen) # Determine which genes are annotated to the cell-cycle cell_cycle_gene_indices <- gene_indexer(gene_names, species = "mouse", name_type = "symbol") # Create "if_cc" vector if_cc <- rep(FALSE,nrow(t_cell_data_cen)) if_cc[cell_cycle_gene_indices] <- TRUE # Move data into list dat <- list(x=t_cell_data_cen, if_cc=if_cc) # Run ccRemover ## Not run: xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10) ## End(Not run) # Run ccRemover with reduced bootstrap repetitions for example only xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 20, ntop = 10) head(xhat[,1:5]) # Run ccRemover with more compoents considered ## Not run: xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 15) ## End(Not run)
This data contains expression levels (log-transformed and centered) for 50 cells and 2000 genes. The 50 cells are randomly assigned to two cell types and three cell-cycle stages. 400 genes are assigned as cell-cycle genes, and the other 1600 genes are control genes. For descriptions of how we generated this data, please refer to the paper.
data(dat)
data(dat)
A list that contains the following attributes (only x
and if.cc
are used by ccRemover.main.)
x
the data matrix. rows are genes, and columns are cells. These should be treated as log-transformed and centered (each row has mean 0) expression levels.
if.cc
a vector of values FALSE's or TRUE's, denoting whether the genes are cell-cycle related or control.
n
the number of cells. n=ncol(x).
p
the number of genes. p=nrow(x).
pc
the number of cell-cycle genes. pc=sum(if.cc).
ct
cell types. a vector of values 1 and 2.
cc
cell-cycle stages. a vector of values 1, 2, or 3.
A simulated dataset used to demonstrate the application of ccRemover
Determines which of the genes contained in the dataset are annotated ti the cell-cycle. This is a preprocessing function for ccRemover. Genes can be either mouse or human and either official gene symbols, Ensembl, Entrez or Unigene IDs.
gene_indexer(gene_names, species = NULL, name_type = NULL)
gene_indexer(gene_names, species = NULL, name_type = NULL)
gene_names |
A vector containing the gene names for the dataset. |
species |
The species which the gene names are from. Either
|
name_type |
The type of gene name considered either, Ensembl gene IDS
( |
A vector containg the indices of genes which are annotated to the cell-cycle
set.seed(10) # Load in example data data(t.cell_data) head(t.cell_data[,1:5]) # Center example data t_cell_data_cen <- t(scale(t(t.cell_data), center=TRUE, scale=FALSE)) # Extract gene names gene_names <- rownames(t_cell_data_cen) # Determine which genes are annotated to the cell-cycle cell_cycle_gene_indices <- gene_indexer(gene_names = gene_names, species = "mouse", name_type = "symbol") # Create "if_cc" vector if_cc <- rep(FALSE,nrow(t_cell_data_cen)) if_cc[cell_cycle_gene_indices] <- TRUE # Can allow the function to automatically detect the name type cell_cycle_gene_indices <- gene_indexer(gene_names = gene_names, species = NULL, name_type = NULL)
set.seed(10) # Load in example data data(t.cell_data) head(t.cell_data[,1:5]) # Center example data t_cell_data_cen <- t(scale(t(t.cell_data), center=TRUE, scale=FALSE)) # Extract gene names gene_names <- rownames(t_cell_data_cen) # Determine which genes are annotated to the cell-cycle cell_cycle_gene_indices <- gene_indexer(gene_names = gene_names, species = "mouse", name_type = "symbol") # Create "if_cc" vector if_cc <- rep(FALSE,nrow(t_cell_data_cen)) if_cc[cell_cycle_gene_indices] <- TRUE # Can allow the function to automatically detect the name type cell_cycle_gene_indices <- gene_indexer(gene_names = gene_names, species = NULL, name_type = NULL)
This is an interal function for use by "bootstrap_diff" only.
get_diff(xy, xn)
get_diff(xy, xn)
xy |
The data for the genes which are annotated to the cell-cycle, i.e.
those genes for which "if_cc" is |
xn |
The data for the genes which are not annotated to the cell-cycle,
control genes, genes for which "if_cc" is |
A data frame containing the loadings for each component on the cell-cycle and control genes.
This data set contains Homo Sapien genes which are annotated to the cell-cycle. These genes were retrieved from biomart and are intended for use with the "gene_indexer" function. The data set contains the gene names in four different formats, Ensemble Gene IDs (1838 values), HGNC symbols (1740 values), Entrez Gene IDs (1744 values) and Unigene IDs (1339).
data("HScc_genes")
data("HScc_genes")
A data set that contains with the following attributes
human_cell_cycle_genes
A data frame with four columns corresponding to each of the different ID formats.
A data set containing genes annotated to the cell-cycle in different ID formats
This data set contains Mus Musculus genes which are annotated to the cell-cycle. These genes were retrieved from biomart and are intended for use with the "gene_indexer" function. The data set contains the gene names in three different formats, Ensemble Gene IDs (1433 values), MGI symbols (1422 values), Entrez Gene IDs (1435 values) and Unigene IDs (1102 values).
data("MMcc_genes")
data("MMcc_genes")
A data set that contains with the following attributes
mouse_cell_cycle_genes
A data frame with four columns corresponding to each of the different ID formats.
A dataset containing genes annotated to the cell-cycle in different ID formats
This data contains expression levels (log-transformed normalized count values) for 81 cells and 14,147 genes. The data was normalized using ERCC spike-ins. This data was generated by Mahata, B. et al (2014). The processed data was retrieved from the supplementary material of Buettner et al. (2015), for descriptions of how the data was processed, please refer to their paper.
data(t.cell_data)
data(t.cell_data)
A data set that contains with the following attributes
t.cell_data
the data matrix. rows are cells, and columns are genes. These should be treated as log-transformed and normalized
A scRNA-Seq dataset with gene expression levels for 187 T-helper cells