Title: | Simultaneous Detection of Clusters and Cluster-Specific Genes in High-Throughput Transcriptome Data |
---|---|
Description: | Implements a new method 'ClussCluster' descried in Ge Jiang and Jun Li, "Simultaneous Detection of Clusters and Cluster-Specific Genes in High-throughput Transcriptome Data" (Unpublished). Simultaneously perform clustering analysis and signature gene selection on high-dimensional transcriptome data sets. To do so, 'ClussCluster' incorporates a Lasso-type regularization penalty term to the objective function of K- means so that cell-type-specific signature genes can be identified while clustering the cells. |
Authors: | Li Jun [cre], Jiang Ge [aut], Wang Chuanqi [ctb] |
Maintainer: | Li Jun <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-11-08 06:27:44 UTC |
Source: | CRAN |
ClussCluster
takes the single-cell transcriptome data and returns an object containing cell types and type-specific signature gene sets
Selects the tuning parameter in a permutation approach. The tuning parameter controls the L1 bound on w, the feature weights.
ClussCluster(x, nclust = NULL, centers = NULL, ws = NULL, nepoch.max = 10, theta = NULL, seed = 1, nstart = 20, iter.max = 50, verbose = FALSE) ClussCluster_Gap(x, nclust = NULL, B = 20, centers = NULL, ws = NULL, nepoch.max = 10, theta = NULL, seed = 1, nstart = 20, iter.max = 50, verbose = FALSE)
ClussCluster(x, nclust = NULL, centers = NULL, ws = NULL, nepoch.max = 10, theta = NULL, seed = 1, nstart = 20, iter.max = 50, verbose = FALSE) ClussCluster_Gap(x, nclust = NULL, B = 20, centers = NULL, ws = NULL, nepoch.max = 10, theta = NULL, seed = 1, nstart = 20, iter.max = 50, verbose = FALSE)
x |
An nxp data matrix. There are n cells and p genes. |
nclust |
Number of clusters desired if the cluster centers are not provided. If both are provided, nclust must equal the number of cluster |
centers |
A set of initial (distinct) cluster centres if the number of clusters ( |
ws |
One or multiple candidate tuning parameters to be evaluated and compared. Determines the sparsity of the selected genes. Should be greater than 1. |
nepoch.max |
The maximum number of epochs. In one epoch, each cell will be evaluated to determine if its label needs to be updated. |
theta |
Optional argument. If provided, |
seed |
This seed is used wherever K-means is used. |
nstart |
Argument passed to |
iter.max |
Argument passed to |
verbose |
Print the updates inside every epoch? If TRUE, the updates of cluster label and the value of objective function will be printed out. |
B |
Number of permutation samples. |
Takes the normalized and log transformed number of reads mapped to genes (e.g., log(RPKM+1) or log(TPM+1) where RPKM stands for Reads Per Kilobase of transcript per Million mapped reads and TPM stands for transcripts per million) but NOT centered.
a list containing the optimal tuning parameter, s
, group labels of clustering, theta
, and type-specific weights of genes, w
.
a list containig a vector of candidate tuning parameters, ws
, the corresponding values of objective function, O
, a matrix of values of objective function for each permuted data and tuning parameter, O_b
, gap statistics and their one standard deviations, Gap
and sd.Gap
, the result given by ClussCluster
, run
, the tuning parameters with the largest Gap statistic and within one standard deviation of the largest Gap statistic, bestw
and onesd.bestw
data(Hou_sim) hou.dat <-Hou_sim$x run.ft <- filter_gene(hou.dat) hou.test <- ClussCluster(run.ft$dat.ft, nclust=3, ws=4, verbose = FALSE)
data(Hou_sim) hou.dat <-Hou_sim$x run.ft <- filter_gene(hou.dat) hou.test <- ClussCluster(run.ft$dat.ft, nclust=3, ws=4, verbose = FALSE)
Filters out genes that are not suitable for differential expression analysis.
filter_gene(dfname, minmean = 2, n0prop = 0.2, minsd = 1)
filter_gene(dfname, minmean = 2, n0prop = 0.2, minsd = 1)
dfname |
name of the expression data frame |
minmean |
minimum mean expression for each gene |
n0prop |
minimum proportion of zero expression (count) for each gene |
minsd |
minimum standard deviation of expression for each gene |
Takes an expression data frame that has been properly normalized but NOT centered. It returns a list with the slot dat.ft
being the data set that satisfies the pre-set thresholds on minumum mean, standard deviation (sd), and proportion of zeros (n0prop) for each gene.
If the data has already been centered, one can still apply the filters of mean
and sd
but not n0prop
.
a list containing the data set with genes satisfying the thresholds, dat.ft
, the name of dat.ft
, and the indices of those kept genes, index
.
dat <- matrix(rnbinom(300*60, mu = 2, size = 1), 300, 60) dat_filtered <- filter_gene(dat, minmean=2, n0prop=0.2, minsd=1)
dat <- matrix(rnbinom(300*60, mu = 2, size = 1), 300, 60) dat_filtered <- filter_gene(dat, minmean=2, n0prop=0.2, minsd=1)
This data contains expression levels (normalized and log-transformed) for 33 cells and 100 genes.
data(Hou_sim)
data(Hou_sim)
An object containing the following variables:
x
An expression data frame of 33 HCC cells on 100 genes.
y
Numerical group indicator of all cells.
gnames
Gene names of all genes.
snames
Cell names of all cells.
groups
Cell group names.
note
A simple note of the data set.
This data contains raw expression levels (log-transformed but not centered) for 33 HCC cells and 100 genes. The 33 cells belongs to three different subpopulations and exhibited different biological characteristics. For descriptions of how we generated this data, please refer to the paper.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65364
Hou, Yu, et al. "Single-cell triple omics sequencing reveals genetic, epigenetic, and transcriptomic heterogeneity in hepatocellular carcinomas." Cell research 26.3 (2016): 304-319.
data(Hou_sim) data <- Hou_sim$x
data(Hou_sim) data <- Hou_sim$x
ClussCluster
Plots the number of signature genes against the tuning parameters if multiple tuning parameters are evaluated in the object. If only one is included, then plot_ClussCluster
returns a venn diagram and a heatmap at this particular tuning parameter.
plot_ClussCluster(object, m = 10, snames = NULL, gnames = NULL, ...) top.m.hm(object, m, snames = NULL, gnames = NULL, ...)
plot_ClussCluster(object, m = 10, snames = NULL, gnames = NULL, ...) top.m.hm(object, m, snames = NULL, gnames = NULL, ...)
object |
An object that is obtained by applying the ClussCluster function to the data set. |
m |
The number of top signature genes selected to produce the heatmap. |
snames |
The names of the cells. |
gnames |
The names of the genes |
... |
Addtional parameters, sent to the method |
Takes the normalized and log transformed number of reads mapped to genes (e.g., log(RPKM+1) or log(TPM+1) where RPKM stands for Reads Per Kilobase of transcript per Million mapped reads and TPM stands for transcripts per million) but NOT centered.
If multiple tuning parameters are evaluated in the object, the number of signature genes is computed for each cluster and is plotted against the tuning parameters. Each color and line type corresponds to a cell type.
If only one tuning parameter is evaluated, two plots will be produced. One is the venn diagram of the cell-type-specific genes, the other is the heatmap of the data with the cells and top m signature genes. See more details in the paper.
a ggplot2 object of the heatmap with top signature genes selected by ClussCluster
data(Hou_sim) run.cc <- ClussCluster(Hou_sim$x, nclust = 3, ws = c(2.4, 5, 8.8)) plot_ClussCluster(run.cc, m = 5, snames=Hou$snames, gnames=Hou$gnames)
data(Hou_sim) run.cc <- ClussCluster(Hou_sim$x, nclust = 3, ws = c(2.4, 5, 8.8)) plot_ClussCluster(run.cc, m = 5, snames=Hou$snames, gnames=Hou$gnames)
ClussCluster_Gap
Plots the gap statistics and number of genes selected as the tuning parameter varies.
plot_ClussCluster_Gap(object)
plot_ClussCluster_Gap(object)
object |
object obtained from |
ClussCluster
Prints out the results of ClussCluster
print_ClussCluster(object)
print_ClussCluster(object)
object |
An object that is obtained by applying the ClussCluster function to the data set. |
ClussCluster_Gap
Prints the gap statistics and number of genes selected for each candidate tuning parameter.Prints out the results of ClussCluster_Gap
Prints the gap statistics and number of genes selected for each candidate tuning parameter.
print_ClussCluster_Gap(object)
print_ClussCluster_Gap(object)
object |
An object that is obtained by applying the ClussCluster_Gap function to the data set. |
An example data set containing expressing levels for 60 cells and 200 genes. The 60 cells belong to 4 cell types with 15 cells each. Each cell type is uniquely associated with 30 signature genes, i.e., the first cell type is associated with the first 30 genes, the second cell type is associated with the next 30 genes, so on and so forth. The remaining 80 genes show indistinct expression patterns among the four cell types and are considered as noise genes.
data(sim_dat)
data(sim_dat)
A data frame with 60 cells on 200 genes.
A simulated dataset used to demonstrate the application of ClussCluster
.
data(sim_dat) head(sim_dat)
data(sim_dat) head(sim_dat)