Title: | Efficient and Precise Single-Cell Reference Atlas Mapping |
---|---|
Description: | Implements the Symphony single-cell reference building and query mapping algorithms and additional functions described in Kang et al <https://www.nature.com/articles/s41467-021-25957-x>. |
Authors: | Joyce Kang [aut, cre] , Ilya Korsunsky [aut] , Soumya Raychaudhuri [aut] |
Maintainer: | Joyce Kang <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.1 |
Built: | 2024-12-08 07:19:50 UTC |
Source: | CRAN |
Function for building a Symphony reference starting from expression matrix
buildReference( exp_ref, metadata_ref, vars = NULL, K = 100, verbose = FALSE, do_umap = TRUE, do_normalize = TRUE, vargenes_method = "vst", vargenes_groups = NULL, topn = 2000, tau = 0, theta = 2, save_uwot_path = NULL, d = 20, additional_genes = NULL, umap_min_dist = 0.1, seed = 111 )
buildReference( exp_ref, metadata_ref, vars = NULL, K = 100, verbose = FALSE, do_umap = TRUE, do_normalize = TRUE, vargenes_method = "vst", vargenes_groups = NULL, topn = 2000, tau = 0, theta = 2, save_uwot_path = NULL, d = 20, additional_genes = NULL, umap_min_dist = 0.1, seed = 111 )
exp_ref |
Reference gene expression (genes by cells) |
metadata_ref |
Reference cell metadata (cells by attributes) |
vars |
Reference variables to Harmonize over e.g. c('donor', 'technology') |
K |
Number of soft cluster centroids in model |
verbose |
Verbose output |
do_umap |
Perform UMAP visualization on harmonized reference embedding |
do_normalize |
Perform log(CP10K+1) normalization |
vargenes_method |
Variable gene selection method (either 'vst' or 'mvp') |
vargenes_groups |
Name of metadata column specifying groups for variable gene selection. If not NULL, calculate topn variable genes in each group separately, then pool |
topn |
Number of variable genes to subset by |
tau |
Tau parameter for Harmony step |
theta |
Theta parameter(s) for Harmony step |
save_uwot_path |
Absolute path to save the uwot model (used if do_umap is TRUE) |
d |
Number of PC dimensions |
additional_genes |
Any custom genes (e.g. marker genes) to include in addition to variable genes |
umap_min_dist |
umap parameter (see uwot documentation for details) |
seed |
Random seed |
Symphony reference object. Integrated embedding is stored in the $Z_corr slot. Other slots include cell-level metadata ($meta_data), variable genes means and standard deviations ($vargenes), loadings from PCA ($loadings), original PCA embedding ($Z_orig), reference compression terms ($cache), betas from Harmony integration ($betas), cosine normalized soft cluster centroids ($centroids), centroids in PC space ($centroids_pc), and optional umap coordinates ($umap$embedding).
Function for building a Symphony reference from a Harmony object. Useful if you would like your code to be more modular. Note that you must have saved vargenes_means_sds and PCA loadings.
buildReferenceFromHarmonyObj( harmony_obj, metadata, vargenes_means_sds, pca_loadings, verbose = TRUE, do_umap = TRUE, save_uwot_path = NULL, umap_min_dist = 0.1, seed = 111 )
buildReferenceFromHarmonyObj( harmony_obj, metadata, vargenes_means_sds, pca_loadings, verbose = TRUE, do_umap = TRUE, save_uwot_path = NULL, umap_min_dist = 0.1, seed = 111 )
harmony_obj |
Harmony object (output from HarmonyMatrix()) |
metadata |
Reference cell metadata (cells by attributes) |
vargenes_means_sds |
Variable genes in dataframe with columns named ('symbol', 'mean', 'stddev') |
pca_loadings |
Gene loadings from PCA (e.g. irlba(ref_exp_scaled, nv = 20)$u) |
verbose |
Verbose output |
do_umap |
Perform UMAP visualization on harmonized reference embedding |
save_uwot_path |
Absolute path to save the uwot model (if do_umap is TRUE) |
umap_min_dist |
UMAP parameter (see uwot documentation for details) |
seed |
Random seed |
Symphony reference object. Integrated embedding is stored in the $Z_corr slot. Other slots include cell-level metadata ($meta_data), variable genes means and standard deviations ($vargenes), loadings from PCA or other dimensional reduction such as CCA ($loadings), original PCA embedding ($Z_orig), reference compression terms ($cache), betas from Harmony integration ($betas), cosine-normalized soft cluster centroids ($centroids), centroids in PC space ($centroids_pc), and optional umap coordinates ($umap$embedding).
Calculates the k-NN correlation, which measures how well the sorted ordering of k nearest reference neighbors in a gold standard embedding correlate with the ordering for the same reference cells in an alternative embedding (i.e. from reference mapping). NOTE: it is very important for the order of reference cells (cols) in gold_ref matches that of alt_ref (same for matching columns of gold_query and alt_query).
calcknncorr(gold_ref, alt_ref, gold_query, alt_query, k = 500)
calcknncorr(gold_ref, alt_ref, gold_query, alt_query, k = 500)
gold_ref |
Reference cells in gold standard embedding (PCs by cells) |
alt_ref |
Reference cells in alternative embedding (PCs by cells) |
gold_query |
Query cells in gold standard embedding (PCs by cells) |
alt_query |
Query cells in alternative embedding (PCs by cells) |
k |
Number of reference neighbors to use for kNN-correlation calculation |
Vector of k-NN correlations for query cells
Calculates the k-NN correlation within the query cells only, which measures how well the sorted ordering of k nearest query neighbors in a query de novo PCA embedding correlate with the ordering for the cells in the reference mapping embedding.
calcknncorrWithinQuery( query, var = NULL, k = 100, topn = 2000, d = 20, distance = "euclidean" )
calcknncorrWithinQuery( query, var = NULL, k = 100, topn = 2000, d = 20, distance = "euclidean" )
query |
Query object (returned from mapQuery) |
var |
Query metadata batch variable (PCA is calculated within each batch separately); if NULL, do not split by batch |
k |
Number of neighbors to use for kNN-correlation calculation |
topn |
number of variable genes to calculate within each query batch for query PCA |
d |
number of dimensions for query PCA within each query batch |
distance |
either 'euclidean' or 'cosine' |
Vector of within-query k-NN correlations for query cells
Per-cell Confidence Score: Calculates the weighted Mahalanobis distance for the query cells to reference clusters. Returns a vector of distance scores, one per query cell. Higher distance metric indicates less confidence.
calcPerCellMappingMetric( reference, query, Z_orig = TRUE, metric = "mahalanobis" )
calcPerCellMappingMetric( reference, query, Z_orig = TRUE, metric = "mahalanobis" )
reference |
Reference object as returned by Symphony buildReference() |
query |
Query object as returned by Symphony mapQuery() |
Z_orig |
Define reference distribution using original PCA embedding or harmonized PC embedding |
metric |
Uses Mahalanobis by default, but added as a parameter for potential future use |
A vector of per-cell mapping metric scores for each cell.
Per-cluster Confidence Score: Calculates the Mahalanobis distance from user-defined query clusters to their nearest reference centroid after initial projection into reference PCA space. All query cells in a cluster get the same score. Higher distance indicates less confidence. Due to the instability of estimating covariance with small numbers of cells, we do not assign a score to clusters smaller than u * d, where d is the dimensionality of the embedding and u is specified.
calcPerClusterMappingMetric( reference, query, query_cluster_labels, metric = "mahalanobis", u = 2, lambda = 0 )
calcPerClusterMappingMetric( reference, query, query_cluster_labels, metric = "mahalanobis", u = 2, lambda = 0 )
reference |
Reference object as returned by Symphony buildReference() |
query |
Query object as returned by Symphony mapQuery() |
query_cluster_labels |
Vector of user-defined labels denoting clusters / putative novel cell type to calculate the score for |
metric |
Uses Mahalanobis by default, but added as a parameter for potential future use |
u |
Do not assign scores to clusters smaller than u * d (see above description) |
lambda |
Optional ridge parameter added to covariance diagonal to help stabilize numeric estimates |
A data.frame of per-cluster mapping metric scores for each user-specified query cluster.
Function for evaluating F1 by cell type, adapted from automated cell type identifiaction benchmarking paper (Abdelaal et al. Genome Biology, 2019)
evaluate(true, predicted)
evaluate(true, predicted)
true |
vector of true labels |
predicted |
vector of predicted labels |
A list of results with confusion matrix ($Conf), median F1-score ($MedF1), F1 scores per class ($F1), and accuracy ($Acc).
Function to find variable genes using mean variance relationship method
findVariableGenes( X, groups, min_expr = 0.1, max_expr = Inf, min_dispersion = 0, max_dispersion = Inf, num.bin = 20, binning.method = "equal_width", return_top_n = 0 )
findVariableGenes( X, groups, min_expr = 0.1, max_expr = Inf, min_dispersion = 0, max_dispersion = Inf, num.bin = 20, binning.method = "equal_width", return_top_n = 0 )
X |
expression matrix |
groups |
vector of groups |
min_expr |
min expression cutoff |
max_expr |
max expression cutoff |
min_dispersion |
min dispersion cutoff |
max_dispersion |
max dispersion cutoff |
num.bin |
number of bins to use for scaled analysis |
binning.method |
how bins are computed |
return_top_n |
returns top n genes |
A data.frame of variable genes
Predict annotations of query cells from the reference using k-NN method
knnPredict( query_obj, ref_obj, train_labels, k = 5, save_as = "cell_type_pred_knn", confidence = TRUE, seed = 0 )
knnPredict( query_obj, ref_obj, train_labels, k = 5, save_as = "cell_type_pred_knn", confidence = TRUE, seed = 0 )
query_obj |
Symphony query object |
ref_obj |
Symphony reference object |
train_labels |
vector of labels to train |
k |
number of neighbors |
save_as |
string that result column will be named in query metadata |
confidence |
return k-NN confidence scores (proportion of neighbors voting for the predicted annotation) |
seed |
random seed (k-NN has some stochasticity in the case of ties) |
Symphony query object, with predicted reference labels stored in the 'save_as' slot of the query$meta_data
Function for mapping query cells to a Symphony reference
mapQuery( exp_query, metadata_query, ref_obj, vars = NULL, verbose = TRUE, do_normalize = TRUE, do_umap = TRUE, sigma = 0.1 )
mapQuery( exp_query, metadata_query, ref_obj, vars = NULL, verbose = TRUE, do_normalize = TRUE, do_umap = TRUE, sigma = 0.1 )
exp_query |
Query gene expression (genes by cells) |
metadata_query |
Query metadata (cells by attributes) |
ref_obj |
Reference object as returned by Symphony buildReference() |
vars |
Query batch variable(s) to integrate over (column names in metadata) |
verbose |
Verbose output |
do_normalize |
Perform log(CP10K+1) normalization on query expression |
do_umap |
Perform umap projection into reference UMAP (if reference includes a uwot model) |
sigma |
Fuzziness parameter for soft clustering (sigma = 1 is hard clustering) |
Symphony query object. Mapping embedding is in the $Z slot. Other slots include query expression matrix ($exp), query cell-level metadata ($meta_data), query cell embedding in pre-Harmonized reference PCs ($Zq_pca), query cell soft cluster assignments ($R), and query cells in reference UMAP coordinates ($umap).
Log(CP10k+1) normalized counts matrix (genes by cells) for 10x PBMCs dataset for vignette.
pbmcs_exprs_small
pbmcs_exprs_small
: Sparse matrix (dgCMatrix): dimensions 1,764 genes by 1,200 cells
Metadata for 10x PBMCs dataset for vignette.
pbmcs_meta_small
pbmcs_meta_small
: A data frame with 1,200 cells and 7 metadata fields.
unique cell ID
dataset (3pv1, 3pv2, or 5p)
number of UMIs
number of genes
percent mito genes
cell type assigned in Symphony publication
cell subtype assigned in Symphony publication
Function to plot reference, colored by cell type
plotReference( reference, as.density = TRUE, bins = 10, bandwidth = 1.5, title = "Reference", color.by = "cell_type", celltype.colors = NULL, show.legend = TRUE, show.labels = TRUE, show.centroids = FALSE )
plotReference( reference, as.density = TRUE, bins = 10, bandwidth = 1.5, title = "Reference", color.by = "cell_type", celltype.colors = NULL, show.legend = TRUE, show.labels = TRUE, show.centroids = FALSE )
reference |
Symphony reference object (must have UMAP stored) |
as.density |
if TRUE, plot as density; if FALSE, plot as individual cells |
bins |
for density, nbins parameter for stat_density_2d |
bandwidth |
for density, bandwidth parameter for stat_density_2d |
title |
Plot title |
color.by |
metadata column name for phenotype labels |
celltype.colors |
custom color mapping |
show.legend |
Show cell type legend |
show.labels |
Show cell type labels |
show.centroids |
Plot soft cluster centroid locations |
A ggplot object.
Calculate standard deviations by row
rowSDs(A, row_means = NULL, weights = NULL)
rowSDs(A, row_means = NULL, weights = NULL)
A |
expression matrix (genes by cells) |
row_means |
row means |
weights |
weights for weighted standard dev calculation |
A vector of row standard deviations
Runs a standard PCA pipeline on query (1 batch). Assumes query_exp is already normalized.
runPCAQueryAlone(query_exp, topn = 2000, d = 20, seed = 1)
runPCAQueryAlone(query_exp, topn = 2000, d = 20, seed = 1)
query_exp |
Query expression matrix (genes x cells) |
topn |
Number of variable genes to use |
d |
Number of dimensions |
seed |
random seed |
A matrix of PCs by cells
Scale data with given mean and standard deviations
scaleDataWithStats(A, mean_vec, sd_vec, margin = 1, thresh = 10)
scaleDataWithStats(A, mean_vec, sd_vec, margin = 1, thresh = 10)
A |
expression matrix (genes by cells) |
mean_vec |
vector of mean values |
sd_vec |
vector of standard deviation values |
margin |
1 for row-wise calculation |
thresh |
threshold to clip max values |
A matrix of scaled expression values.
Function to find variable genes using variance stabilizing transform (vst) method
vargenes_vst(object, groups, topn, loess.span = 0.3)
vargenes_vst(object, groups, topn, loess.span = 0.3)
object |
expression matrix |
groups |
finds variable genes within each group then pools |
topn |
Return top n genes |
loess.span |
Loess span parameter used when fitting the variance-mean relationship |
A data.frame of variable genes, with means and standard deviations.