This vignette provides a manual for the analysis of single-cell RNA-seq data with three different methods:
RaceID (Grun et al. 2015) (the current version is RaceID3 (Herman, Sagar, and Grun 2018)) is a method for cell type identification from single-cell RNA-seq data by unsupervised learning. An initial clustering is followed by outlier identification based on a backgorund model of combined technical and biological variability in single-cell RNA-seq data with unique molecular identifiers.
StemID (Grun et al. 2016) (the current version is StemID2 (Herman, Sagar, and Grun 2018)) permits inference of a lineage tree based on clusters identified by RaceID. The current version of RaceID (RaceID3) and StemID (StemID2) are published together with the FateID algorithm (Herman, Sagar, and Grun 2018).
The current version of the RaceID package implements additional improvements in rare cell type detection, offers batch effect removal utilities, optional imputing of gene expression, and substantially decreases runtime as well as memory usage.
VarID (Grun 2020) (the current version is VarID2 (Rosalez-Alvarez et al. 2023)) is implemented as part of the RaceID package. VarID2 allows the quantification of local gene expression variability and the deconvolution into technical and biological noise components, and provides an alternative clustering approach based on community detection. VarID2 analysis is independent of RaceID but can be fully integrated into RaceID objects.
RaceID is integrated with the FateID method for the prediction of cell fate probabilities (Herman, Sagar, and Grun 2018), which is available as a separate package.
Input to RaceID is a matrix of raw expression values (unique
molecular identifiers with or without Poisson correction (Grun, Kester, and Oudenaarden 2014)) with cells
as column and genes as rows. This matrix can be provided as a matrix
object, a data.frame or a sparse matrix produced by the
Matrix
package.
The RaceID
package comes with sample data
intestinalData
for murine intestinal epethelial cells
stored in sparse matrix format. The dataset was published previously
(Grun et al. 2016).
RaceID and StemID functions have various input and output parameters,
which are explained in detail on the help
pages. Here, we
mostly use default parameters, which represent a good choice for common
datasets.
To start the analysis, a RaceID single-cell sequencing
(SCseq
) object is initialized with a count matrix.
The first step is the application of filtering for the purpose of quality control. Cells with a relatively low total number of transcripts are discarded.
In this example, we filter out cells with <2,000 transcipts. The
function allows further filtering of genes by choosing the input
parameters minexpr
and minnumber
, i.e. only
genes with at least minexpr
transcripts in at least
minnumber
cells are retained. The filtered and normalized
expression matrix (normalized to the minimum total transcript count
across all cells retained after filtering) can be retrieved by the
getfdata
function:
The filterdata
function allows the detection and removal
of batch effects by different methods as outlined below in addtional
examples. Alternatively, individual genes or groups of genes correlating
to specific genes can be removed with the FGenes
and
CGenes
arguments. This frequently allows a minimally
invasive removal of batch effects or of particularly highly expressed
genes with an unwanted dominating effect on the clustering.
Next, the distance matrix is computed as the input for clustering and
outlier identification. This can be done with or without imputing gene
expression from nearest neighbours (see example below for imputing).
RaceID offers various alternative metric, e.g. spearman, kendall, and
euclidean, as well as measure of proportionality (rho and phi from the
propr
package).
This function computes the distance matrix based on highly variable
genes by default. If all genes should be used, then the parameter
FSelect
needs to be set to FALSE
. On this
distance matrix clustering is performed:
To infer the initial cluster number, this function computes the
average within-cluster dispersion up to a number of clusters specified
by the clustnr
arguments, which equals 30 by default. If
more major populations are expected, this parameter needs to be set to
higher values which will increase the run time. The initial cluster
number only serves as a rough estimate, since the subsequent outlier
identification will infer additional clusters. The internal inference of
the cluster number and the evaluation of cluster stability by the
computation of Jaccard’s similarity is done on all cells by default. For
large datasets it is reasonable to subsample a limited number of cells,
by setting the samp
argument, e.g., to 1,000. In this case,
the cluster number is inferred based on this smaller subset and
Jacccard’s similarity is not computed by bootstrapping but by
sub-sampling. For k-medoids clustering, subsetting will imply almost
deterministic clustering partitions, if the sample size approaches the
size of the dataset. Therefore, samp
should be signicantly
smaller then the size of the dataset. Otherwise, bootstrapping is better
for assessing the cluster stability. Subsampling can be expected to give
a good estimate of the number of major clusters. Additional small
clusters which might have been missed by the sampling can be
reintroduces during the outlier identification step.
The inferred cluster number can be inspected in a saturation plot, which shows the decrease of the average within-cluster dispersion with increasing cluster number. If this decrease becomes constant, saturation is reached. The automatically chosen cluster number is detected such that the decrease is equal to the decrease upon further increasing the cluster number within the error bars:
The average within-cluster dispersion can also by plotted:
The cluster stability as assessed by Jaccard’s similarity should also be inspected:
In this example, the automated criterion overestimated the cluster number leading to instability as indicated by low Jaccard’s similarity. Based on visual inspection of the average within-cluster dispersion as a function of the cluster number, we manually set the cluster number to 7 without recomputing the saturation behaviour.
This function perform k-medoids clustering by default. K-means
clustering or hierarchical clustering can be chosen with the
FUNcluster
argument. For very large datasets, hierarchical
clustering leads to significantly smaller run time.
Subsequently, outliers in the initial k-medoids clusters are
identified based on an internally computed background model for the
expected gene expression variability and the assumption that transcript
counts follow a negative binomial distribution defined by the mean and
the variance of the expression of each gene in a cluster. Outlier genes
will be in the tail of this distribution at a p-value defined by the
probthr
parameter (1e-3 by default), and outlier cells
require the presence of a number of outlier genes defined by the
outlg
parameter (2 by default).
In contrast to previous versions, outlier genes are inferred from non-normalized transcript counts, which follow a negative binomial distribution modeling the joint technical and biological variability. The assumption of a negative binomial distribution was demonstrated for raw transcript (UMI) count data, but is not strictly valid for normalized expression values (Grun, Kester, and Oudenaarden 2014). Hence, RaceID does not require normalization, since the correlation-based metric for the computation of the distance object is also independent of the normalization. Normalizaion is only relevant when using, e.g., the euclidean metric for the derivation of the distance matrix. RaceID applies a simple size normalization for data representation and follow-up analyses.
The background noise model can be inspected:
The number of outliers as a function of the p-value can be plotted:
Another way of checking the presence of outliers is the inspection of a barplot of p-values across all cells in each cluster:
A heatmap of cell-to-cell distances grouped be the final clusters inferred based on the original clusters and the outliers allows visual inspection of the clusters:
This function is not recommended for very large datasets, since it produces similarly large plotting output.
The best way of visualising the detetcted cell types is plotting cells and clusters in a two-dimensional reduction representaion. RaceID can compute a t-SNE map
or a k-nearest neighbour graph layout utilizing the Fruchterman-Rheingold algorithm:
In this example, the number of nearest neighbours was chosen to be
10. In general, different values for knn
should be tested
to find an ideal layout.
RaceID further allows computation of a umap dimensional reduction representation:
Parameters for umap can be given as additional argument (see help function).
The t-SNE map can be plotted by
The Fruchterman-Rheingold layout can be plotted by
The umap representation can be plotted by
Maps can be changed for both t-SNE and Fruchterman-Rheingold
algorithm by initializing the rseed
argument of the
comptsne
or compfr
function with a random
number.
The dimensional reduction maps can be inspected, e.g., for the localization of (a subset of) samples included in the analysis:
types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
subset <- types[grep("IV|V",types)]
plotsymbolsmap(sc,types,subset=subset,cex=1)
Expression of a gene of interest or aggregated expression for a group of genes can be highlighted in the dimensional reduction representation:
g <- c("Apoa1", "Apoa1bp", "Apoa2", "Apoa4", "Apoa5")
plotexpmap(sc,g,n="Apoa genes",logsc=TRUE,cex=1)
It is also possible to highight expression only for a subset of cells, e.g. a particular batch or sample:
sample <- colnames(sc@ndata)[grep("^I5d",colnames(sc@ndata))]
plotexpmap(sc,"Lyz1",cells=sample,logsc=TRUE,cex=1)
For the murine intestinal example data, inspetion of known marker genes suggests that cluster 2 and 3 correspnd to Lgr5-expressing intestinal stem cells. Cluster 2 is proliferative as indicated by up-regulation of Mki67, Cluster 1 comprises transiently amplifying progenitors biased towards absorptive entorytes in cluster 4 marked by Apolipoproteins. Cluster 7 comprises Lysozyme-expressing Paneth cells while Mucus-producing goblet cells constitute clusters 6 and 10.
With the help of known marker genes, e.g., Chgb for enteroendocrine cells, Muc2, Agr2, and Clca3 for goblet cells, and Defa20 for Paneth cells, rare cell types among the detected RaceID clusters can be annotated:
genes <- c("Lyz1","Defa20","Agr2","Clca3","Muc2","Chgb","Neurog3","Apoa1","Aldob","Lgr5","Clca4","Mki67","Pcna")
plotmarkergenes(sc,genes=genes)
To inspect clusters in more detail, differentially expressed genes can be inferred by an internal approach akin to DESeq2 (Love, Huber, and Anders 2014) but with a dispersion estimated globally from the background model of RaceID.
For instance, to obtain differentially expressed genes within cluster 9 compared to all other cells:
## mean.ncl mean.cl fc pv padj
## Apoa1 0.6362058 21.398348 33.634315 0.000000e+00 0.000000e+00
## Apoa4 0.5702229 13.588646 23.830412 0.000000e+00 0.000000e+00
## Fabp1 0.7444607 21.348669 28.676691 0.000000e+00 0.000000e+00
## Fabp2 1.6023268 27.105795 16.916521 0.000000e+00 0.000000e+00
## Reg1 0.2360083 6.856378 29.051431 0.000000e+00 0.000000e+00
## Reg3b 0.4229290 7.360573 17.403803 9.665345e-88 3.371594e-85
## Gstm3 0.4901265 7.645545 15.599126 2.408636e-85 7.201821e-83
## Rbp2 0.1837525 4.865970 26.481103 1.340014e-73 3.505811e-71
## Reg3g 0.5903128 6.381428 10.810248 4.488428e-63 1.043809e-60
## Plac8 2.2049822 12.625809 5.726037 1.040516e-56 2.177800e-54
## Spink3 0.1808018 3.908387 21.616969 6.661021e-56 1.267411e-53
## Sis 0.2079850 4.014474 19.301750 3.388549e-55 5.910194e-53
## Anpep 0.1337270 3.602236 26.937242 4.383642e-55 7.057664e-53
## Guca2b 0.4864784 5.159079 10.604950 1.758228e-54 2.628551e-52
## Slc5a1 0.3030379 4.362130 14.394669 9.178274e-54 1.280675e-51
## Ces2e 0.8445971 6.651895 7.875820 9.650062e-53 1.262349e-50
## Apoc2 0.2539176 4.025073 15.851887 4.568302e-51 5.391473e-49
## Aldob 1.4071637 8.711907 6.191111 4.636718e-51 5.391473e-49
## Aldh1a1 0.1080674 2.965912 27.445004 4.730617e-47 5.211148e-45
## Crip1 1.3463788 7.794775 5.789437 4.182806e-45 4.377307e-43
## Mep1b 0.1180375 2.923125 24.764369 7.826703e-45 7.800614e-43
## Mttp 0.3015435 3.758237 12.463333 4.122038e-43 3.921557e-41
## Cyp3a11 0.1544745 3.071999 19.886768 6.232458e-43 5.671537e-41
## Clec2h 0.1374732 2.930410 21.316236 1.195432e-42 1.042516e-40
## Ckmt1 0.7801578 5.336224 6.839929 1.340511e-42 1.122276e-40
The differentially expressed genes (in this example only the up-regulated ones with a fold change >1) can be plottted in a heatmap, which can highlight the clusters and samples of origin:
types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
genes <- head(rownames(dg)[dg$fc>1],10)
plotmarkergenes(sc,genes,samples=types)
The heatmap can also be ordered by cell names (i.e. by batch or
sample) by setting order.cells
to TRUE
. With
the input parameters cl
and cells
, the heatmap
can be restricted to a subset of clusters or cells, respectively.
In this example, no inter-sample differences are apparent and all samples contribute to each cluster.
It is also possible to plot gene expression across clusters or samples using a dot plot, where the diameter represents the fraction of cells in a sample or cluster expressing a gene, and the color indicates the expression level or z-score.
For example, the following commmand plots expression z-scores across clusters 2, 3, 1 and 4:
Expression on a log2 scale across samples I, II, and III can be plotted by
samples <- sub("(\\d.+)$","", colnames(sc@ndata))
fractDotPlot(sc, genes, samples=samples, subset=c("I","II","III"), logscale=TRUE)
A differential gene expression analysis between two defined sets of cell, e.g., two (groups of) clusters can be performed:
A <- names(sc@cpart)[sc@cpart %in% c(2,3)]
B <- names(sc@cpart)[sc@cpart %in% c(4)]
x <- diffexpnb(getfdata(sc,n=c(A,B)), A=A, B=B )
plotdiffgenesnb(x,pthr=.05,lthr=.5,mthr=-1,Aname="Cl.2,3",Bname="Cl.4",show_names=TRUE,padj=TRUE)
See the paragraphs below for additional options of RaceID analyses and parameter choices ideal for analysing large datasets.
StemID is an algorithm for the inference of lineage trees and
differentiation trajectories based on pseudo-temporal ordering of
single-cell transcriptomes. It utilizies the clusters predicted by
RaceID and thus requires to run RaceID first. The algorithm was
originally published along with RaceID2 (Grun et
al. 2016) and the improved current version StemID2 was published
later (Herman, Sagar, and Grun 2018). In a
nutshell, StemID infers links between clusters which are more populated
with intermediate single-cell transcriptomes than expected by chance. To
assign cells to inter-cluster links, two fundamentally different
strategies are available (see nmode
argument below). The
first strategy considers the projection of a vector connecting a cluster
medoid to a member cell of the same cluster onto the links from the
medoid of its cluster to the medoids of all other clusters. The longest
projection identifies the link this cell is assigned to and defines the
projection coordinate. The second (nearest neighbour) mode identifies
for a cell in a given cluster the number of k nearest neighbours in each
other cluster and assigns the cell to the link with the cluster where
the average distance to these k nearest neighbours is minimized. The
coordinate on the link is inferred as in the first mode. A faster
approximate version of the first mode is also implemented.
As a first step, a lineage tree object for the StemID analysis needs to be initialized with an SCseq object obtained from a RaceID analysis:
Next, the transcriptome entropy of cell needs to be calculated. This is used by the StemID algorithm for the prediction of the stem cell type, based on maximum transcriptome entropy and maximum number of links to other clusters.
In the subsequent step, cells are projected onto inter-cluster links.
Cells are assigned to a link based on minimum distance to k nearest
neighbours (nmode=TRUE
) or based on the maximum projection
coordinate (nmode=FALSE
). Only clusters with
>cthr
cells are included in the analysis. If
fr=TRUE
then the Fruchterman-Rheingold layout will be used
for representation of the inferred lineage tree, and if
um=TRUE
a UMAP representation will be used. Otherwise,
representation will be done in t-SNE space. The computation of the
lineage graph is independent of the dimensional reduction method which
is only used for visualization.
If projections are used for link determenation
(nmode=FALSE
), the derivation of link significance is done
by comparing to the link population after randomizing cell positions
within the boundaries imposed by the gene expression manifold. This is
done by bootstrapping using 500 randomizations by default. More
randomizations are possible, but obviously linearly increase
runtime.
Based on this information, a lineage graph is computed to approximate the lineage tree (a tree structure is not strictly imposed).
Finally, link p-values are computed and a threshold pthr
is applied on the p-values:
The resulting graph can be plotted, overlaid with a dimensional
reduction representation (Fruchterman-Rheingold or t-SNE, see
projcells
). To retain only the more populated links, a
cutoff scthr
on the linkscore can be applied, e.g. 0.2:
To predict the stem cell, the StemID score can be computed and visualized:
StemID offers a number of plotting functions to inspect the results. RaceID performs clustering using Pearson’s correlation as a metric by default. The StemID projections require a Euclidean space and thus an initial embedding into a high-dimensional space is performed by classical multidimensional scaling. To inspect how well cell-to-cell distances are preserved, a histogram of the log-ratios between the original and transformed distances can be plotted:
The StemID prediction can be compared to a minimal spanning tree of the cluster medoids:
The cell projections onto the links can be directly compared with this minimal spanning tree:
All linkscores and fold enrichments of the population on a link can be plotted as heatmaps:
The (z-scores of the) projections of all cells from a given cluster across all links can be plotted as heatmap, e.g. for cluster 3:
All cells on two different branches originating from the same cluster, e.g. cluster 3 cells on the links to cluster 1 and 8, can be extracted for the computation of differentially expressed genes:
## Prss32 Nip7 Dhrs4 Srsf3 Tm4sf5 Tsc22d1
## 3.224859 2.814688 2.368598 2.335748 2.200576 2.187001
The cells on the two branches can be plotted as additional clusters in the dimensional reduction representation:
Since the randomizations of cell positions for the derivation of link significance require long computation time, and the projection-based method leads to some weak links which are potentially false positives (and can be filtered out based on linkscore), the nearest-neighbour-based method has now been selected to be the default mode of StemID. This method is more robust and fast even on large datasets. The downside is that it will miss some weak links, i.e. lead to more false negatives in comparison to the projection mode.
First, a lineage tree object needs to be initialized followed by the calculation of the transcriptome entropy of each cell.
Next, cell projection are calculated with the parameter
nmode=TRUE
, which is also the default value:
The knn
parameter determines how many nearest neighbours
are considered in each cluster for determining the link assignment: the
distance two each cluster is calculated as the average across the
distance of a cell to the knn
nearest neighbours within
each other cluster, and the cell is assigned to the link with the
cluster minimizing this distance. Now, the lineage tree is inferred and
the p-values for the links are calculated based on a binomial model:
The resulting lineage graph can be inspected and reveals the expected trajectories connecting the stem cells (cluster 2 and 3 of cycling and quiescent cells, respectively) to enterocytes (cluster 4) via transiently amplifying progenitors (cluster 1), to Paneth cells (cluster 7), and to goblet cells (cluster 6). The StemID score suggests stem cell identity for clusters 2 and 3:
RaceID offers the possibility of batch correction utilizing an
internal method or the published mnnCorrect
function from
the batchelor
package (Haghverdi et
al. 2018). The batchelor
package needs to be
separately installed from bioconductor if this mode is desired. In order
to apply batch effect removal, a list with a vector of cell ids for each
batch needs to be defined, e.g.:
n <- colnames(intestinalData)
b <- list(n[grep("^I5",n)],n[grep("^II5",n)],n[grep("^III5",n)],n[grep("^IV5",n)],n[grep("^V5",n)])
This list is provided as input to the filterdata
function, and the bmode
argument is initialized with the
desired method, i.e. mnnCorrect
or RaceID
. The
latter method simply compares the local neigbourhood, i.e. the set of k
nearest neighbours, for each cell between two batches and identifies the
neighbourhood of the two batches with the smallest average distance. A
differential gene expression analysis between the closest neighbourhoods
of two batches yields batch associated genes. The next batch is then
compared in the same way to the merged dataset of the previous batches.
Batches are compared and successively merged according to the order they
are provided in b
. An additional input parameter
knn
controls the number of nearest neighbours, i.e. the
size of the neighbourhood.
The filterdata
function will identify all
batch-associated genes, which are stored in the
filterpar$BGenes
slot of the SCseq
object. All
genes that correlate to a batch gene are removed for the computation of
a distance object. This is a minimally invasive strategy in comparison
to mnnCorrect
, which works well if batches are very
similar, such as datasets produced from the same material using the same
single-cell RNA-seq technology.
RaceID also offers optional imputing of gene expression, which can be
useful if gene expression differences between cell types or states are
governed only by lowly expressed genes and are difficult to resolve by
clustering based on raw counts. If imputing is desired, the
knn
argument needs to be initialized with the number of
nearest neighbours used for imputing gene expression :
Now, for each cell the knn
nearest neighbours are used
to infer a local negative binomial for each gene, utilizing a weighted
mean expression and the internal RaceID noise model to obtain the
corresponding negative binomial. The weights are derived by quadratic
programming, computing the expression vector of a cell as a linear
combination of its knn
nearest neighbours. The cell itself
contributes with the same weight as the aggregated weights of the
nearest neighbours to the weighted mean expression. With the help of
this negative binomial the tail probability of each gene is derived
across all knn
nearest neighbours. The geometric means of
these tail probabilities are finally applied as a weights for each
nearest neighbours in the calculation of the imputed gene expression.
This strategy ensures that gene expression can only be learned from
nearest neighbours following the same transcript count
distributions.
After this, all steps remain the same. Imputing often helps to improve cluster discrimination and stability. Importantly, distances derived from imputed gene expression are only used for clustering. The outlier identification relies on unimputed gene expression, and hence can correct spurious clusters produced from imputed values.
If batch effect removal has been applied, the remaining batch effect can be checked by plotting symbols representig the sample of origin:
Ideally, all samples should intermingle in each clusters. Imputed
gene expression can be plotted by setting the imputed
argument to TRUE
. Otherwise, unimputed values are
shown.
plotexpmap(sc,"Mki67",imputed=TRUE,fr=TRUE)
plotmarkergenes(sc,c("Clca4","Mki67","Defa24","Defa20","Agr2","Apoa1"),imputed=TRUE,samples=types)
An expression matrix with imputed expression values can be extracted for further analysis:
RaceID can also regress out variability associated with particular
sources such as batches or cell cycle. If batch effect removal has been
done by the filterdata
function with
bmode="RaceID"
then this function can further regress out
residual variability remaining after batch associated genes have been
discarded for the distance computation. In the case, the argument
Batch
has to be set to TRUE
and
vars
can be left empty if no further sources of variability
should be regressed out. Batch effects can also be regressed out
directly without prior removal using the filterdata
function:
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
vars <- data.frame(row.names=colnames(intestinalData),batch=sub("(\\_\\d+)$","",colnames(intestinalData)))
sc <- varRegression(sc,vars)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)
However, regression also leads to loss of biological variation and this step is only recommended if variability associated with a particular variable is a strong confounding factor and cannot be removed by other means.
After running the filterdata
function, a prior
dimensional reduction using PCA or ICA can be performed using the
CCcorrect
function. This function can also be provided with
a list vset
of sets of genes, and principal components with
loadings enriched in any of these sets will be discarded. Another
options is to provide a list CGenes
of genes, and sets to
be tested for enrichment in each component are derived as the groups of
all genes correlating to a component in FGenes. The
CCcorrect
function predicts a dimension for the prior
dimensional reduction based on an ellbow function of the explained
variability as a function of the number of components. This can be
inspected by the plotdimsat
function and manually
adjusted:
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,dimR=TRUE)
plotdimsat(sc)
plotdimsat(sc,change=FALSE)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,nComp=9)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)
Rerunning CCcorrect
requires to run
filterdata
first, because otherwise the dimensional
reduction scores in the dimRed
slot will be subject to a
second dimensional reduction, which is not desired. All sub-sequent
steps remain unaltered.
RaceID provides the option to run a random forests based
reclassifiction, in order to obtain a stable clustering partition. This
can be done on the final clustering after running
findoutliers
:
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- rfcorrect(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)
However, this is normally not required, since the improved outlier
detection of the current version leads to stable clusters which do not
change substantially after applying this function. Running
rfcorrect
takes very long for large datasets and should be
omitted in this case.
For large dataset (>25k cells), we recommend using the VarID2 approach (see below) for density based clustering which does not require computation and storage of a huge distance matrix. For datasets below this size we recommend running RaceID/StemID with specific parameter settings. Since the correlation metric does not require dataset normalization, this approach has certained advantages compared to VarID2. However, the latter can also be run with a correlation matrix. In the following, we discuss a few paramters critical for the runtime of RaceID/StemID on large datasets.
Preferentially, clustering should be done by
FUNcluster="kmedoids"
but "hclust"
often gives
similar results and is significantly faster.
For the determination of the cluster number and the inference of
Jaccard’s similarity, sub-sampling should be applied by setting the
subset size samp
to an integer number. A good choice could
be 10-25% of the total number of cells. For large datasets this should
be sufficient to discriminate the most abundant cluster, and additional
smaller clusters will automatically be identified in the next step using
the findoutliers
function. It is important that the
clustnr
argument is much larger then the expected number of
clusters. If stability analysis is not wanted, one can set
bootnr=1
. If clustering is re-run with a specific cluster
number, e.g. cln=12
, then the saturation criterion should
be disabled by setting sat=FALSE
.
If the cluster granularity is too fine or too coarse, the
probthr
argment can be decreased or increased,
respectively, e.g.:
It is adviced to change the perplexity
argument to
larger values, when computing the t-SNE map for large datasets,
e.g. perplexity=200
. However, a large perplexity will
return an error for small datasets.
The Fruchterman-Rheingold layout critically depends on the number
knn
of nearest neighbours, and different values should be
tested:
For a sub-sequent StemID analysis of very large datasets, the nearest-neighbour mode (see above) will help to reduce the runtime.
To inspect pseudotemporal expression profiles, functions provided by
the FateID
package can be utilized. First, the trajectory
needs to be defined based on a sequence of clusters. This sequence
should ideally correspond to a trajectory predicted by StemID2, i.e. the
clusters should be connected by a series of significant links. However,
non-linked clusters can also be included. A pseudo-temporally ordered
vector of cells along a StemID trajectory can be extracted with the
cellsfromtree
function from an Ltree
object:
The list n
contains a vector n$f
of cell
ids ordered by their projection coordinate along the trajectory
reflecting differentiation pseudotime. This vector and a filtered or
unfiltered expression matrix can be used as input for pseudo-temporal
gene expression analysis. The filtered expression data used for RaceID3
can be extracted with the getfdata
function:
Additional filtering and subsetting of the gene expression matrix for
cells on the trajectory, n$f
, is done in the next step,
utilizing functions from the FateID
package:
The filterset
function can be used to eliminate lowly
expressed genes on the trajectory from the subsequent analysis and has
two additional arguments to discard genes, which are not expressed at a
level of minexpr
or higher in at least
minnumber
of cells. The function returns a filtered
expression data frame with genes as rows and cells as columns in the
same order as in the input vector n
. In the next step, a
self-organizing map (SOM) of the pseudo-temporal expression profiles is
computed:
This map provides a grouping of similar expression profiles into
modules. The first input argument is again an expression data frame. In
this case, we use the filtered expression table generated by the
filterset function to retain only genes that are expressed on the
trajectory under consideration. Pseudo-temporal expression profiles
along the differentiation trajectory of interest are computed after
smoothing by local regression with smoothing parameter
alpha
.
This function returns a list of the following three components, i. e.
a som object returned by the function som
of the package
som
, a data frame x
with smoothed and
normalized expression profiles, and a data frame zs
of
z-score transformed pseudo-temporal expression profiles.
The SOM is then processed by another function to group the nodes of the SOM into larger modules and to produce additional z-score transformed and binned expression data frames for display:
The first input argument is given by the SOM computed by the function
getsom
. The function has two additional input parameters to
control the grouping of the SOM nodes into larger modules. The parameter
corthr
defines a correlation threshold. If the correlation
of the z-scores of the average normalized pseudo-temporal expression
profiles of neighboring nodes in the SOM exceeds this threshold, genes
of the neighboring nodes are merged into a larger module. Only modules
with at least minsom
genes are kept. The function returns a
list of various data frames with normalized, z-score-transformed, or
binned expression along with the assignment of genes to modules of the
SOM (see man pages for details).
The output of the processed SOM can be plotted using the plotheatmap
function. First, in order to highlight the clustering partition
y
the same color scheme as in the SCseq
object
can be used:
Now, the different output data frames of the procsom
function can be plotted.
Plot average z-score for all modules derived from the SOM:
Plot z-score profile of each gene ordered by SOM modules:
Plot normalized expression profile of each gene ordered by SOM modules:
Plot binarized expression profile of each gene (z-score < -1, -1 < z-score < 1, z-score > 1):
In order to inspect genes within individual modules of the SOM, these
genes can be extracted given the number of the module. The module
numbers are contained in the return value nodes
of the
procsom
function and can be extracted, e. g. for module
number 24:
The average pseudo-temporal expression profile of this group can be
plotted by the function plotexpression
:
In the same way it is possible to plot expression profiles of individual genes, e.g.:
It is also possible to highlight the data points as specific symbols, for example reflecting batches, by using the types argument:
VarID2 is a novel algorithm utilizing a k nearest neighbour graph
derived from transcriptome similarities (Rosalez-Alvarez et al. 2023). The key idea is
the inference of locally homogenous neighbourhoods of cells in cell
state space, i.e., all k nearest neighbours follow the same transcript
count distributions as the cell to which they are nearest neighbours
(“central” cell). The computation starts with k nearest neighbours and
infers a negative binomial distribution for the transcript counts of
each gene, derived from the mean-variance dependence computed locally
across a central cell and its k nearest neighbours. The mean expression
is derived from a weighted average across the k nearest neighbours, with
weights inferred using quadratic programming, representing the central
cell as a linear combination of its neighbours. The contribution of the
central cell itself can be controlled by a tunable scaling parameter
alpha
. Typical values of alpha
should be in
the range of 0 to 10 (default 1). If alpha
is
NULL
, then it is inferred by a local optimization, i.e., it
is minimized under the constraint that the gene expression in the
central cell does not deviate more then one standard deviation from the
predicted weigthed mean across its k nearest neighbours, where the
standard deviation is calculated from the predicted mean using the local
background model (the average dependence of the variance on the mean
expression). With the negative binomial distribution inferred from the
local means and mean-variance dependence, the probabilities of observed
transcript counts for all genes can be computed in each neighbour, and,
after multiple testing correction, the geometric mean across the
nb
(deafult 3) genes with the lowest probability serves as
a proxy for the link probability (i.e., the probability of neighbours
being in the same state). If this probability falls below a user-defined
threshold, the link to this neighbour is pruned After this procedure is
applied to the nearest neighbours of each cell, a pruned k nearest
neighbour graph with homogenous neighbourhoods in cell state space is
obtained, in which the remaining links indicate that neighbours follow
the same transcript count distribution for all genes. These local
neighbourhoods can be used to compute local properties such as gene
expression variability, which in turn can be correlated to cell state or
predicted fate, derived, e.g., from FateID analysis, respectively.
Although VarID2 analysis can be performed independently of the RaceID data object, an integration with a RaceID analysis permits convenient visualization and processing of the results. For this reason, we here demonstrate an integrated pipeline, but explain at each step, how computation can be performed given minimal input data, i.e., a transcript count matrix.
The VarID2 method requires unique molecular identifier (UMI) counts, which have been shown to follow a negative binomial distribution, in order for the background model to be valid.
Given such a UMI matrix with cells as columns and genes as rows, we could first initialize a RaceID object, and perform cell and gene filtering to retain only expressed genes and high quality cells.
We here demonstrate the functionalities of VarID2 on our single-cell transcriptome data of intestinal epithelial cells (Grun et al. 2016).
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=1000,FGenes=grep("^Gm\\d",rownames(intestinalData),value=TRUE),CGenes=grep("^(mt|Rp(l|s))",rownames(intestinalData),value=TRUE))
In this example, all cells with a raw UMI count <1,000 are
discarded, and all gene IDs starting with Gm (genes without official
gene symbol, uncharacterized loci) are removed. Moreover, all
mitochondrial and ribosomal sub-unit encoding genes as well as all genes
correlating with these classes are filtered out. Although VarID2
analysis can be performed with any given distance matrix, the preferred
(default) mode does not require a distance matrix, and relies on
k-nearest neighbor analysis in PCA space. For large datasets, the
computation of a distance matrix is prohibitive due to a memory
requirement scaling with N^2 if N is the number of cells. In this case,
the pruneKnn
function can be executed with
large=TRUE
(default setting). Here, the input expression
matrix is first optionally corrected for variability associated with
total transcript count per cell by a negative binomial regression (if
regNB=TRUE
as by default; if regNB=FALSE
,
transcript counts in each cell are normalized to one, multipled by the
minimal total transcript count across all cells, followed by adding a
pseudocount of 0.1 and taking the logarithm). The mean dependence of the
regression parameters is smoothed, and Pearson residual are computed.
The matrix of Pearson residuals (default setting), or, alternatively,
the raw count matrix then undergoes a dimensional reduction by a
principle component analysis (PCA), including the top
pcaComp
principle components (PCs). The number of top PCs
used for dimensional reduction is determined by an elbow criterion
(requiring that the change in explained variability upon further
increasing the number of PCs has become linear). However, the number of
PCs can also be adjusted manually (parameter pcaComp
). This
reduced PC matrix is subsequently used for a fast k nearest neighbour
search by the get.knn
function from the FNN
package based on euclidean distances. The algorithm
parameter of the pruneKnn
function enables the selection of
the search algorithm to apply. In general, it is recommended to run
VarID2 with large=TRUE
and regNB=TRUE
(default
settings). A RaceID object can be utilized to facilitate the analysis,
but in this mode the compdist
function does not need to be
executed:
The pruneKnn
function allows normalization by a full
negative binomial regression (akin to (Hafemeister and Satija 2019)), or,
alternatively, by analyical Pearson residuals as proposed by Lause et
al. (Lause, Berens, and Kobak 2020)
(default). In the latter case, the dispersion parameter can either be
inferred by maximum likelihood, or approximated as θ=10 (default). The function has a
number of optional parameters controlling the outcome of the this step.
For instance, genes
allows limiting the analysis to a
subset of genes, knn
defines the number of nearest
neighbours, alpha
is the scale parameter for the central
cell, and FSelect
allows internal feature selection. The
background model is calculated internally based on the local
mean-variance relation in the same way as done in RaceID and feature
selection follows the same logic. The function further permits
multithreading and the number of cores can be defines using the
no_cores
argument. If this is set to NULL
, the
function detects the number of available cores and uses this number
minus two to leave resources for other tasks running on the same
machine. Calling help
for this function explains all
parameters in detail (as for all other functions).
The coefficients of the negative binomial regression (or the
coefficients of the analytical model, if offsetModel=TRUE
)
can be inspected. Intercept:
Coefficient β of total log UMI count:
Dispersion parameter θ:
In this example, the analytical model was used (no regression was calculated) and θ=10 was used (Lause, Berens, and Kobak 2020).
To check if the mean-dependence of the variance has been successfully removed by the normalization, the Pearson residuals can be plotted as a function of the mean:
The number of PCs selected based on the elbow criterion can be inspected by plotting the percentage variability explained by each principle component:
Alternatively, the difference between the percentage variability explained by PC i and PC i + 1 can be plotted on a logarithmic scale:
The pruned k nearest neighbour information can be used as input for graph inference, Louvain clustering, and the derivation of a Fruchterman-Rheingold graph layout:
Alternatively, community detection can be performed with the Leiden algorithm at variying resolution. However, the leiden package requires preinstallation of python libraries (https://cran.r-project.org/package=leiden). This can be done, e.g., by using python reticulate.
install.packages("reticulate")
reticulate::use_python("/usr/bin/python3", required=TRUE)
#confirm that leiden, igraph, and python are available (should return TRUE).
reticulate::py_module_available("leidenalg") && reticulate::py_module_available("igraph")
reticulate::py_available()
With this graphCluster
can be run using leiden
clustering at defined resolution:
The function returns a list of three components: a vector
partition
with a clustering partition, a data.frame
fr
containing a Fruchterman-Rheingold layout computed from
the pruned k nearest neighbour graph, and an integer number
residual.cluster
. If clusters with less than
min.size
(input parameter to graphCluster
)
were obtained, all these clusters are aggregated into the cluster with
the largest cluster number in the partition. If this aggregation was
done, the cluster nuber is stored in residual.cluster
(which is NULL
otherwise). By setting the
use.weights
argument of graphCluster
to
TRUE
(default), clustering can be performed on the pruned
knn graph with weighted links, where weights correspond to the link
probabilities. This may lead to the emergence of too many clusters
(which could happen, e.g., due to variability in the link probabilities
when integrating different experimental batches). In this case, the
use.weights
parameter can be set to FALSE
, and
clustering is performed on a pruned knn graph with equal weights for all
links. The pvalue
parameter determines the probability
cutoff for link pruning, i.e., links with probabilities lower than
pvalue
are removed.
To visualize the clustering with the help of RaceID, the
SCseq
object can be updated with the results:
This function call will initialize the cpart
and the
cluster$kpart
slots with the clustering partition in
cl$partition
, and the fr
slot with the
Fruchterman-Rheingold layout.
The clustering can readily be inspected in the Fruchterman-Rheingold layout:
After computing a t-SNE map, the clustering can be also be visualized in t-SNE space:
Alternatively, a umap representation can be computed for visualization:
The pruned k nearest neighbour graph allows the derivation of
transition probabilities between clusters, based on the remaining links
connecting two clusters (at a minimum probability given by
pvalue
) and the inferred probabilities of these links:
If RaceID is used for visualization and has been updated with the
clustering using the updateSC
function (see above), the
transition probabilities can be visualized in the dimensional reduction
representation (fr=TRUE
for Fruchterman-Rheingold or
um=TRUE
for umap representation, default is t-SNE):
The argument prthr
allows discarding all links with a
transition probability <= prthr
and the th
cthr
argument sets a cluster size threshold. Only clusters
with > cthr
cells are shown.
The Individual neighbourhood of a cell, e.g, cell i=20
,
can be inspected with the inspectKnn
function, which
returns a list object with information on gene expression and outlier
p-values across all original k nearest neighbours.
If the function is called with plotSymbol=TRUE
, a
dimensional reduction representation is returned (um=TRUE
for umap), highlighting nearest neighbours and outliers which are
subject to pruning:
One element of the returned list object is an outlier p-value across all k nearest neighbours:
## I5d_25 I5d_31 V5d_29 I5d_7 II5d_16 V5d_55 V5d_11 I5d_13 II5d_3
## Dmbt1 1 1 1 1 1 1 1 4.203964e-10 1
## Lgals2 1 1 1 1 1 1 1 1.001882e-01 1
## Cox5b 1 1 1 1 1 1 1 1.000000e+00 1
## Elf3 1 1 1 1 1 1 1 1.000000e+00 1
## Serpinb6a 1 1 1 1 1 1 1 1.000000e+00 1
## Ndufs8 1 1 1 1 1 1 1 1.000000e+00 1
## III5d_2 II5d_43 I5d_11 IV5d_16 V5d_44 V5d_48 V5d_26 I5d_71 I5d_9
## Dmbt1 1 1 1 1 1 1 1 1 1
## Lgals2 1 1 1 1 1 1 1 1 1
## Cox5b 1 1 1 1 1 1 1 1 1
## Elf3 1 1 1 1 1 1 1 1 1
## Serpinb6a 1 1 1 1 1 1 1 1 1
## Ndufs8 1 1 1 1 1 1 1 1 1
## I5d_61 III5d_67 II5d_44 III5d_91 II5d_72 V5d_16 V5d_96
## Dmbt1 0.391412 1 0.0310523 0.0008214995 1 1 1.577005e-11
## Lgals2 1.000000 1 1.0000000 0.1276183067 1 1 6.164142e-02
## Cox5b 1.000000 1 1.0000000 1.0000000000 1 1 1.000000e+00
## Elf3 1.000000 1 1.0000000 1.0000000000 1 1 1.000000e+00
## Serpinb6a 1.000000 1 1.0000000 1.0000000000 1 1 1.000000e+00
## Ndufs8 1.000000 1 1.0000000 1.0000000000 1 1 1.000000e+00
## III5d_87
## Dmbt1 0.0002871395
## Lgals2 1.0000000000
## Cox5b 1.0000000000
## Elf3 1.0000000000
## Serpinb6a 1.0000000000
## Ndufs8 1.0000000000
Another element contains the corresponding normalized gene expression counts:
## I5d_25 I5d_31 V5d_29 I5d_7 II5d_16 V5d_55 V5d_11
## Dmbt1 0.000000 0.000000 1.001958 0.000000 1.001958 16.521861 2.007853
## Lgals2 4.031579 6.071431 8.127667 6.071431 5.049473 20.824484 6.071431
## Cox5b 3.017717 5.049473 8.127667 4.031579 3.017717 14.397368 3.017717
## Elf3 0.000000 3.017717 4.031579 0.000000 1.001958 0.000000 0.000000
## Serpinb6a 0.000000 2.007853 0.000000 0.000000 2.007853 2.007853 0.000000
## Ndufs8 0.000000 1.001958 1.001958 0.000000 0.000000 0.000000 0.000000
## I5d_13 II5d_3 III5d_2 II5d_43 I5d_11 IV5d_16 V5d_44
## Dmbt1 58.126707 15.457411 0.000000 6.071431 7.097484 12.290360 2.007853
## Lgals2 49.489744 16.521861 0.000000 20.824484 13.341696 10.200553 12.290360
## Cox5b 11.243324 9.162012 0.000000 11.243324 8.127667 6.071431 3.017717
## Elf3 2.007853 4.031579 2.007853 3.017717 5.049473 3.017717 1.001958
## Serpinb6a 0.000000 3.017717 0.000000 0.000000 2.007853 0.000000 1.001958
## Ndufs8 2.007853 0.000000 1.001958 4.031579 2.007853 0.000000 2.007853
## V5d_48 V5d_26 I5d_71 I5d_9 I5d_61 III5d_67 II5d_44
## Dmbt1 7.097484 2.007853 0.000000 2.007853 18.664133 0.000000 24.099582
## Lgals2 8.127667 6.071431 5.049473 6.071431 25.200659 1.001958 24.099582
## Cox5b 5.049473 0.000000 7.097484 6.071431 9.162012 2.007853 6.071431
## Elf3 2.007853 1.001958 1.001958 0.000000 8.127667 3.017717 3.017717
## Serpinb6a 0.000000 0.000000 1.001958 1.001958 1.001958 0.000000 2.007853
## Ndufs8 1.001958 0.000000 0.000000 0.000000 2.007853 0.000000 3.017717
## III5d_91 II5d_72 V5d_16 V5d_96 III5d_87 mu sd
## Dmbt1 30.778221 6.071431 0.000000 64.479391 33.043723 2.5055710 2.1985583
## Lgals2 48.279339 14.397368 1.001958 50.705898 18.664133 7.9128263 5.4987806
## Cox5b 13.341696 8.127667 2.007853 27.417123 7.097484 4.9477707 3.7253325
## Elf3 4.031579 2.007853 0.000000 10.200553 1.001958 1.0414053 1.1874937
## Serpinb6a 2.007853 1.001958 0.000000 6.071431 1.001958 0.4475373 0.7031934
## Ndufs8 3.017717 4.031579 1.001958 7.097484 2.007853 0.7428016 0.9551797
## pv
## Dmbt1 1.577005e-11
## Lgals2 6.164142e-02
## Cox5b 1.000000e+00
## Elf3 1.000000e+00
## Serpinb6a 1.000000e+00
## Ndufs8 1.000000e+00
If the same function is executed with plotSymbol=FALSE
,
the local mean-variance dependence is plotted along with a
loess-regression, a second order polynomial fit (used for pruning), and
the background model of the local variability (used for biological noise
inference, see below):
Alternatively, the coefficient of variation (CV) can be plotted with the same model fits:
The plot also highlights the local variability associated with cell-to-cell variability of total transcript counts, as calculated directly from the mean and variance of total transcript counts (turquoise) or from a local fit of a gamma distribution (orange).
The inference of homogenous local neighbourhoods, given by a cell and
its remaining nearest neighbours after pruning, permits the computation
of local quantities across such sets of similar cells. To investigate
the dynamics of gene expression noise across cell states, we derive
local estimates of gene expression variability. The systematic
dependence of gene expression variance on the mean expression is a
central challenge for the derivation of differential variability between
cell states. VarID2 solves this problem by deconvoluting binomial
sampling noise, cell-to-cell variability in total transcript counts (as
a consequence of cell-to-cell variability in sequencing efficiency, but
also due to global biological variability in total RNA content, e.g.,
during different phases of the cell cycle), and residual variability,
which we address as biological noise. This deconvolution is achieved by
fitting a Gamma distribution to the local disrtibution of total
transcript counts, yielding a technical noise paramater rt
for each central cell, followed by a maximum a posterior inference of
the reidual variability, i.e., biological noise, epsilon
for each gene in every neighbourhood. The local mean expression
mu
within each neighbourhood is computed as the average of
non-normalized transcript count across pruned local neighbourhoods.
The noise inference can be done on a transcript count matrix, filtered by expression, e.g., retaining only cells with a minimum of 5 UMI counts in at least 20 cells:
x <- getFilteredCounts(sc,minexpr=5,minnumber=20)
noise <- compTBNoise(res,x,pvalue=0.01,gamma = 0.5,no_cores=1)
The most important argument of the compTBNoise
is the
gamma
parameter which determines the strength of the prior
for the suppression of spurious inflation of noise levels at low
expression. The p-value
argument again determines the link
probability threshold for pruning. For neighbourhoods with zero
transcript count noise estimates are replaced by NA
in
noise$epsilon
.
The prior parameter gamma
should be chosen such that the
correlation between mean noise across cells and total UMI count per cell
is minimal. This dependence can be analysed with the function
plotUMINoise
:
If a positive correlation is observed, gamma
should be
increased in order to weaken the prior. If the correlation is negative,
gamma
should be decreased in order to increase the strength
of the prior.
The RaceID SCseq
object can be updated with the
biological noise estimates:
This initializes a slot noise
of the SCseq
object with a gene-by-cell matrix of the biological noise estimates
noise$epsilon
. The noise levels of a given gene (or group
of genes), can then be visualized in the same way as gene expression
with the help of RaceID functions.
Gene expression (on logarithmic scale):
Biological noise (on logarithmic scale):
This plot indicates that cluster 2 comprises Clca4+ intestinal stem cells.
A scatter plot of noise versus gene expression allows inspection of the dependence between noise and gene expression:
The biological noise can also be plotted in a cluster heatmap and
directly compared to gene expression. After deriving a gene expression
heatmap, the order of genes stored in ph
(returned by the
pheatmap
function of the pheatmap
package),
can be applied to the noise heatmap in order to permit direct
comparability:
genes <- c("Lyz1","Agr2","Clca3","Apoa1","Aldob","Clca4","Mki67","Pcna")
ph <- plotmarkergenes(sc,genes=genes,noise=FALSE)
Based on marker gene expression, clusters 1, 5, and 6, can be annotated as enterocytes, goblet and Paneth cells. The gene expression can also be visualized in a dot plot highlighting the fraction of cells in a cluster expressing a gene (size of dots) together with the average expresson level (dot color):
With the clustering derived by graphCluster
and the
noise
object returned by the compTBNoise
function, genes with differential noise levels in a cluster (or a set of
clusters) set
versus a background set bgr
(by
default all clusters not in set
) can be inferred:
## mu.set mu.bgr mu.all eps.set eps.bgr eps.all log2FC
## Alpi 2.622500 0.4345075 0.7970891 0.7606782 0.09054539 0.20159597 2.175339
## Rbp2 3.303502 0.3869806 0.8702899 0.4695719 0.03668366 0.10841943 2.059037
## Clca4 2.204035 8.7989254 7.7060579 1.1162821 0.20998329 0.36016995 1.972216
## Rpl3 13.953534 22.1451900 20.7877156 0.3698101 0.02435900 0.08160519 1.917567
## Clec2h 1.968947 0.2804979 0.5602980 0.5452161 0.07208925 0.15049313 1.906625
## Ces2a 2.065667 0.4755221 0.7390318 0.3476483 0.03348812 0.08554895 1.745654
## pvalue
## Alpi 5.366725e-19
## Rbp2 6.225010e-14
## Clca4 1.022393e-06
## Rpl3 4.377632e-12
## Clec2h 1.776301e-14
## Ces2a 4.138734e-16
The output lists average (non-normalized) expression mu
and noise levels eps
for clusters in set
and
bgr
as well as the log2 fold change in noise levels,
log2FC
, between set
and bgr
and a
multiple testing-corrected p-value (Bonferroni correction, multiplied by
the maximal number knn
of nearest neighbours).
This functions applies a Wilcoxon test to compare variability in a
cluster (or set of clusters) versus the remaining clusters or a given
background set of clusters. To avoid spurious signals with small effect
size, a uniform random variable (by default, from [0:ps]
with ps=0.1
) is added to the noise estimates.
The top varaible genes can be plotted in a heatmap:
genes <- head(rownames(ngenes),50)
ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=FALSE,zsc=TRUE)
One can also cluster the cells and/or genes in the heatmap based on the noise quantification:
The gene expression itself can be represented with the same order of
cells and genes by utilizing the ph
output:
plotmarkergenes(sc,genes=ph$tree_row$labels[ ph$tree_row$order ],noise=FALSE,cells=ph$tree_col$labels[ ph$tree_col$order ], order.cells=TRUE,cluster_rows=FALSE)
It is also possible to order genes by their noise levels across a
(set of) cluster or the entire dataset, if the cl
argument
is omitted:
## Reg3b Hspa1b Tmem38b Hspa1a H2afx Reg3g
## 1.4435890 1.0905636 1.0844315 0.8998721 0.8227457 0.7793737
Differential noise between clusters in set
and
bgr
can also be visualized as MA plot:
For comparison, differentially expressed genes between the same groups can be inferred and visualized in a similar manner:
The distribution of gene expression (for individual genes or groups of genes) across a set of clusters can be visualized as violin plots:
Similarly, the distribution of gene expression noise can be plotted:
To further investigate transcriptome variability and related
quantities, the function quantKnn
allows computation of
average noise levels across cells, cell-to-cell transcriptome
correlation, and total UMI counts.
These quantities can be inspected in the dimensional reduction representation, or as boxplots across cluster, with the level of a selected reference cluster highlighted as horizontal line. For instance, the Lgr5+ stem cell cluster 2 can be used as a reference:
Average biological noise across all genes:
Avergae Spearman’s transcriptome correlation across all cells within pruned k nearest neighbourhoods:
Total UMI count averaged across cells within pruned k nearest neighbourhoods:
Dependencies between different quantities can be inspected in a scatter plot, highlighting a cluster of interest:
For instance, the dependency between biological noise and UMI counts:
A residual dependency of the noise estimate on the total UMI count
may indicate that the prior is either not strong enough to suppress
noise inflation at low expression (negative slope) or too strong
(positive slope). In this case, the parameter gamma
of the
function compTBNoise
should be increased or decreased,
respectively.
Dependency between transcriptome correlation and average biological noise:
To study dynamics of gene expression and biological noise during differentiation, VarID2 facilitates pseudo-time analysis. Selection of a linear trajectry connecting a defined set of clusters could be based on the transition probabilities obtained from VarID2.
For examples, clusters 1, 5, and 3, represent the differentiation trajectory from Lgr5+ stem cell towards Apoa1+ enterocytes.
To order cells pseudo-temporally, VarID2 offers a wrapper function
pseudoTime
, which calls the Slingshot method (Street et al. 2018) on a desired dimensional
reduction, which can be a subset from the precomputed dimensional
reductions of the RaceID object (sc@fr
,
sc@tsne
, or sc@umap
, given by the
map
parameter), or a dimensional reduction (t-SNE or UMAP)
computed for the subset of cells on the trajectory. This set is assumed
to be ordered along the trajectory, with the initial cluster as first
entry. This function is not designed to recover branched trajetories;
single lineages could be defined based on the output of the
plotTrPtobs
function as an ordered set of clusters.
pseudoTime
will always fit a single trajectory to the
clusters specified in set
according to the indicated order.
The slingsot package needs to be installed from Bioconductor. If the
package is not installed, the function falls back to principal curve
inference using the princurve
package.
# ordered set of clusters on the trajectory
set <- c(2,3,1)
# if slingshot is available, run with useSlingshot=TRUE (default)
pt <- pseudoTime(sc,m="umap",set=set,useSlingshot=FALSE)
Inferred pseudo-time can be highlighted in the inferred dimensional reduction representation:
The trajectory can also be overlaid with the cluster annotation:
With the derived pseudo-time a filtered count matrix with cells ordered by pseudo-time can be obtained:
Using methods from the FateID
package, pseudo-time
expression profiles can be derived by self-organizing maps (SOM) and
grouped into modules:
library(FateID)
s1d <- getsom(fs,nb=50,alpha=1)
ps <- procsom(s1d,corthr=.85,minsom=0)
part <- pt$part
ord <- pt$ord
plotheatmap(ps$all.z, xpart=part[ord], xcol=sc@fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=TRUE)
Individual pseudo-time expression profiles can be plotted for genes of interest, e.g.,
It is also possible to plot profiles for a group of genes into the same window:
genes <- c("Mki67","Pcna","Apoa1")
plotexpressionProfile(fs,y=part,g=genes,n=ord,alpha=1,col=rainbow(length(genes)),lwd=2)
From the SOM all genes within a given module, e.g., module 1, can be extracted:
Similarly, gene modules with correlated noise profiles can be derived:
fsn <- extractCounts(sc,minexpr=5,minnumber=20,pt=pt,noise=TRUE)
s1dn <- getsom(fsn,nb=25,alpha=1)
psn <- procsom(s1dn,corthr=.85,minsom=0)
plotheatmap(psn$all.z, xpart=part[ord], xcol=sc@fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=TRUE)
Noise profiles of individual genes or groups of genes:
genes <- c("Mki67","Pcna","Apoa1")
plotexpressionProfile(fsn,y=part,g=genes,n=ord,alpha=1,col=rainbow(length(genes)),lwd=2,ylab="Noise")
genes <- getNode(psn,1)
plotexpressionProfile(fsn,y=part,g=head(genes,10),n=ord,alpha=1,lwd=2,ylab="Noise")
For further details on plotting functions for pseudo-time analysis
see vignette("FateID")
. Gene expression and noise modules
can now be co-analyzed to characterize the inter-dependence of gene
expression and noise dynamics during differentiation.
VarID2 offers the possibility to remove batch effects and other
confounding sources of variability. To achieve this, batch variables and
other latent variables can be included as dependent variables in the
negative binomial regression, utilized in the pruneKnn
function if the parameter regNB
is set to
TRUE
. Alternatively, batch effect correction can ben done
by the harmony
package, if the bmethod
parameter of pruneKnn
is set to "harmony"
.
These batch correction options only exist if large
is set
to TRUE
. The batch
parameter is given by a
categorical vector of batch names.
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=1000,FGenes=grep("^Gm\\d",rownames(intestinalData),value=TRUE),CGenes=grep("^(mt|Rp(l|s))",rownames(intestinalData),value=TRUE))
expData <- getExpData(sc)
batch <- sub("5d.+","",colnames(expData))
names(batch) <- colnames(expData)
head(batch)
require(Matrix)
S_score <- colMeans(sc@ndata[intersect(cc_genes$s,rownames(sc@ndata)),])
G2M_score <- colMeans(sc@ndata[intersect(cc_genes$g2m,rownames(sc@ndata)),])
regVar <- data.frame(S_score=S_score, G2M_score=G2M_score)
rownames(regVar) <- colnames(expData)
res <- pruneKnn(expData,no_cores=1,batch=batch,regVar=regVar)
cl <- graphCluster(res,pvalue=0.01)
probs <- transitionProbs(res,cl)
x <- getFilteredCounts(sc,minexpr=5,minnumber=5)
noise <- compTBNoise(res,x,pvalue=0.01,no_cores=1)
sc <- updateSC(sc,res=res,cl=cl,noise=noise)
sc <- compumap(sc,min_dist=0.5)
sc <- comptsne(sc,perplexity=50)
plotmap(sc,cex=1)
plotmap(sc,fr=TRUE,cex=1)
plotmap(sc,um=TRUE,cex=1)
plotsymbolsmap(sc,batch,um=TRUE,cex=1)
plotexpmap(sc,"Mki67",um=TRUE,cex=1,log=TRUE)
plotexpmap(sc,"Pcna",um=TRUE,cex=1,log=TRUE)
To regress out additional latent variables associated with distinct
phases of the cell cycle, the parameter regVar
was
initialized with a data.frame of cell cycle S phase and G2/M phase
scores (aggregated normalized gene expression of cell cycle phase
markers). Column names of regVar
indicate the latent
variable names, and row names correspond to cell IDs, i.e. column names
of expData
. Interaction terms between the batch variable
and the UMI count or the other latent variables in regVar
are included.
See previous paragraphs for more advice on how to explore the output data.
Seurat
object can be converted into an
SCseq
object for VarID2 analysis, using the function
Seurat2SCseq
. This function expects a class
Seurat
object from the package as input and converts this
into a SCseq
object. The function transfers the counts,
initializes ndata
and fdata
without further
filtering, transfers the PCA cell embeddings from the
Seurat
object to dimRed
, transfers the
clustering partition, and umap
and tsne
dimensional reduction (if available). CAUTION: Cluster numbers in
RaceID
start at 1 by default. Hence, all
Seurat
cluster numbers are shifted by 1.
library(Seurat)
library(RaceID)
Se <- CreateSeuratObject(counts = intestinalData, project = "intestine", min.cells = 3, min.features = 200)
Se <- NormalizeData(Se, normalization.method = "LogNormalize", scale.factor = 10000)
Se <- FindVariableFeatures(Se, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(Se)
Se <- ScaleData(Se, features = all.genes)
Se <- RunPCA(Se, features = VariableFeatures(object = Se))
Se <- RunUMAP(Se, dims = 1:30, verbose = FALSE)
Se <- RunTSNE(Se, dims = 1:30, verbose = FALSE)
Se <- FindNeighbors(Se, dims = 1:10)
Se <- FindClusters(Se, resolution = 0.5)
DimPlot(Se, label = TRUE) + NoLegend()
res <- pruneKnn(Se)
## without pruning (fast)
## res <- pruneKnn(Se,do.prune=FALSE,no_cores=1)
sc <- Seurat2SCseq(Se)
plotmap(sc,um=TRUE)
noise <- compTBNoise(res,getExpData(sc),no_cores=1)
sc <- updateSC(sc,noise=noise)
plotexpmap(sc,"Clca4",um=TRUE,cex=1,log=TRUE)
plotexpmap(sc,"Clca4",um=TRUE,noise=TRUE,cex=1)
See previous paragraphs for more advice on how to explore the output data.
For bug reports and any questions related to RaceID and StemID please email directly to link.
Processed count data from the Cell Ranger pipeline (or equivalent
formats) can be read from the working directory (or any other path) and
transformed into a sparse count matrix, which can be loaded into an
SCseq
object. This requires the Matrix
package, which is installed as a dependency together with
RaceID
. The following chunk of code can be used:
require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]
sc <- SCseq(x)
Datasets with >25k cells should be analyzed with VarID2 instead of
RaceID, which is more suitable for large datasets and does not require a
distance matrix. The memory requirement of a distance matrix scales with
N^2 as a function of the number of cells N. VarID2 should be run with
with parameters large=TRUE
and regNB=TRUE
of
the pruneKnn
function in order to achieve optimal
performance (default parameters). To increase processing speed VarID2
can be run with multiple cores. However, each core requires additional
memory, and with too many cores VarID2 will run out of memory. To reduce
computation time, the number of genes used for the negative binomial
regression of total UMI count per cell can be limited to, e.g., 2000 by
setting the ngenes
parameter.
Example for mouse data, including filtering of mitochondrial genes, ribosomal genes, and Gm* genes (and genes correlating to these groups):
require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]
sc <- SCseq(x)
sc <- filterdata(sc,mintotal=1000,CGenes=rownames(x)[grep("^(mt|Rp(l|s)|Gm\\d)",rownames(x))])
expData <- getExpData(sc)
res <- pruneKnn(expData,no_cores=5)
cl <- graphCluster(res,pvalue=0.01)
probs <- transitionProbs(res,cl)
## compute noise from corrected variance
noise <- compTBNoise(res,expData,pvalue=0.01,no_cores=5)
sc <- updateSC(sc,res=res,cl=cl,noise=noise)
sc <- comptsne(sc)
sc <- compumap(sc)
This could be a consequence of technical noise, e.g., if experimental
batches with strong differences in total UMI counts were co-analyzed. In
such cases the background models of RaceID or VarID2 could indicate the
presence of many spurious outliers. In RaceID, the number of outliers
can be reduced by setting the the probthr
parameter of the
findoutliers
function to values closer to zero. This is the
probability threshold for outlier detection. In VarID2 the
pvalue
parameter of the graphCluster
function
can be set to lower values. This is the probability threshold for
pruning links of the knn network. In the extreme case that no outliers
(or pruning) are desired, probthr
(or pvalue
)
should be set to zero. Moreover, the use.weights
parameter
of the graphCluster
function can be set to
FALSE
; in this case, Louvain clustering is performed on a
(pruned) knn-network with equal weights for all links. By default, links
are weighted by the link probability, and if strong technical
differences lead to low link probabilities, this could be the reason for
the emergence of too many clusters.