Package 'symphony' reference manual

Title:	Efficient and Precise Single-Cell Reference Atlas Mapping
Description:	Implements the Symphony single-cell reference building and query mapping algorithms and additional functions described in Kang et al <https://www.nature.com/articles/s41467-021-25957-x>.
Authors:	Joyce Kang [aut, cre] , Ilya Korsunsky [aut] , Soumya Raychaudhuri [aut]
Maintainer:	Joyce Kang <[email protected]>
License:	GPL (>= 3)
Version:	0.1.1
Built:	2025-03-08 06:36:04 UTC
Source:	CRAN

Function for building a Symphony reference starting from expression matrix

Description

Function for building a Symphony reference starting from expression matrix

Usage

buildReference(
  exp_ref,
  metadata_ref,
  vars = NULL,
  K = 100,
  verbose = FALSE,
  do_umap = TRUE,
  do_normalize = TRUE,
  vargenes_method = "vst",
  vargenes_groups = NULL,
  topn = 2000,
  tau = 0,
  theta = 2,
  save_uwot_path = NULL,
  d = 20,
  additional_genes = NULL,
  umap_min_dist = 0.1,
  seed = 111
)
buildReference(
  exp_ref,
  metadata_ref,
  vars = NULL,
  K = 100,
  verbose = FALSE,
  do_umap = TRUE,
  do_normalize = TRUE,
  vargenes_method = "vst",
  vargenes_groups = NULL,
  topn = 2000,
  tau = 0,
  theta = 2,
  save_uwot_path = NULL,
  d = 20,
  additional_genes = NULL,
  umap_min_dist = 0.1,
  seed = 111
)

Arguments

`exp_ref`	Reference gene expression (genes by cells)
`metadata_ref`	Reference cell metadata (cells by attributes)
`vars`	Reference variables to Harmonize over e.g. c('donor', 'technology')
`K`	Number of soft cluster centroids in model
`verbose`	Verbose output
`do_umap`	Perform UMAP visualization on harmonized reference embedding
`do_normalize`	Perform log(CP10K+1) normalization
`vargenes_method`	Variable gene selection method (either 'vst' or 'mvp')
`vargenes_groups`	Name of metadata column specifying groups for variable gene selection. If not NULL, calculate topn variable genes in each group separately, then pool
`topn`	Number of variable genes to subset by
`tau`	Tau parameter for Harmony step
`theta`	Theta parameter(s) for Harmony step
`save_uwot_path`	Absolute path to save the uwot model (used if do_umap is TRUE)
`d`	Number of PC dimensions
`additional_genes`	Any custom genes (e.g. marker genes) to include in addition to variable genes
`umap_min_dist`	umap parameter (see uwot documentation for details)
`seed`	Random seed

Value

Symphony reference object. Integrated embedding is stored in the $Z_corr slot. Other slots include cell-level metadata ($meta_data), variable genes means and standard deviations ($vargenes), loadings from PCA ($loadings), original PCA embedding ($Z_orig), reference compression terms ($cache), betas from Harmony integration ($betas), cosine normalized soft cluster centroids ($centroids), centroids in PC space ($centroids_pc), and optional umap coordinates ($umap$embedding).

Function for building a Symphony reference from a Harmony object. Useful if you would like your code to be more modular. Note that you must have saved vargenes_means_sds and PCA loadings.

Description

Function for building a Symphony reference from a Harmony object. Useful if you would like your code to be more modular. Note that you must have saved vargenes_means_sds and PCA loadings.

Usage

buildReferenceFromHarmonyObj(
  harmony_obj,
  metadata,
  vargenes_means_sds,
  pca_loadings,
  verbose = TRUE,
  do_umap = TRUE,
  save_uwot_path = NULL,
  umap_min_dist = 0.1,
  seed = 111
)
buildReferenceFromHarmonyObj(
  harmony_obj,
  metadata,
  vargenes_means_sds,
  pca_loadings,
  verbose = TRUE,
  do_umap = TRUE,
  save_uwot_path = NULL,
  umap_min_dist = 0.1,
  seed = 111
)

Arguments

`harmony_obj`	Harmony object (output from HarmonyMatrix())
`metadata`	Reference cell metadata (cells by attributes)
`vargenes_means_sds`	Variable genes in dataframe with columns named ('symbol', 'mean', 'stddev')
`pca_loadings`	Gene loadings from PCA (e.g. irlba(ref_exp_scaled, nv = 20)$u)
`verbose`	Verbose output
`do_umap`	Perform UMAP visualization on harmonized reference embedding
`save_uwot_path`	Absolute path to save the uwot model (if do_umap is TRUE)
`umap_min_dist`	UMAP parameter (see uwot documentation for details)
`seed`	Random seed

Value

Symphony reference object. Integrated embedding is stored in the $Z_corr slot. Other slots include cell-level metadata ($meta_data), variable genes means and standard deviations ($vargenes), loadings from PCA or other dimensional reduction such as CCA ($loadings), original PCA embedding ($Z_orig), reference compression terms ($cache), betas from Harmony integration ($betas), cosine-normalized soft cluster centroids ($centroids), centroids in PC space ($centroids_pc), and optional umap coordinates ($umap$embedding).

Calculates the k-NN correlation, which measures how well the sorted ordering of k nearest reference neighbors in a gold standard embedding correlate with the ordering for the same reference cells in an alternative embedding (i.e. from reference mapping). NOTE: it is very important for the order of reference cells (cols) in gold_ref matches that of alt_ref (same for matching columns of gold_query and alt_query).

Description

Calculates the k-NN correlation, which measures how well the sorted ordering of k nearest reference neighbors in a gold standard embedding correlate with the ordering for the same reference cells in an alternative embedding (i.e. from reference mapping). NOTE: it is very important for the order of reference cells (cols) in gold_ref matches that of alt_ref (same for matching columns of gold_query and alt_query).

Usage

calcknncorr(gold_ref, alt_ref, gold_query, alt_query, k = 500)
calcknncorr(gold_ref, alt_ref, gold_query, alt_query, k = 500)

Arguments

`gold_ref`	Reference cells in gold standard embedding (PCs by cells)
`alt_ref`	Reference cells in alternative embedding (PCs by cells)
`gold_query`	Query cells in gold standard embedding (PCs by cells)
`alt_query`	Query cells in alternative embedding (PCs by cells)
`k`	Number of reference neighbors to use for kNN-correlation calculation

Value

Vector of k-NN correlations for query cells

Calculates the k-NN correlation within the query cells only, which measures how well the sorted ordering of k nearest query neighbors in a query de novo PCA embedding correlate with the ordering for the cells in the reference mapping embedding.

Description

Calculates the k-NN correlation within the query cells only, which measures how well the sorted ordering of k nearest query neighbors in a query de novo PCA embedding correlate with the ordering for the cells in the reference mapping embedding.

Usage

calcknncorrWithinQuery(
  query,
  var = NULL,
  k = 100,
  topn = 2000,
  d = 20,
  distance = "euclidean"
)
calcknncorrWithinQuery(
  query,
  var = NULL,
  k = 100,
  topn = 2000,
  d = 20,
  distance = "euclidean"
)

Arguments

`query`	Query object (returned from mapQuery)
`var`	Query metadata batch variable (PCA is calculated within each batch separately); if NULL, do not split by batch
`k`	Number of neighbors to use for kNN-correlation calculation
`topn`	number of variable genes to calculate within each query batch for query PCA
`d`	number of dimensions for query PCA within each query batch
`distance`	either 'euclidean' or 'cosine'

Value

Vector of within-query k-NN correlations for query cells

Per-cell Confidence Score: Calculates the weighted Mahalanobis distance for the query cells to reference clusters. Returns a vector of distance scores, one per query cell. Higher distance metric indicates less confidence.

Description

Per-cell Confidence Score: Calculates the weighted Mahalanobis distance for the query cells to reference clusters. Returns a vector of distance scores, one per query cell. Higher distance metric indicates less confidence.

Usage

calcPerCellMappingMetric(
  reference,
  query,
  Z_orig = TRUE,
  metric = "mahalanobis"
)
calcPerCellMappingMetric(
  reference,
  query,
  Z_orig = TRUE,
  metric = "mahalanobis"
)

Arguments

`reference`	Reference object as returned by Symphony buildReference()
`query`	Query object as returned by Symphony mapQuery()
`Z_orig`	Define reference distribution using original PCA embedding or harmonized PC embedding
`metric`	Uses Mahalanobis by default, but added as a parameter for potential future use

Value

A vector of per-cell mapping metric scores for each cell.

Per-cluster Confidence Score: Calculates the Mahalanobis distance from user-defined query clusters to their nearest reference centroid after initial projection into reference PCA space. All query cells in a cluster get the same score. Higher distance indicates less confidence. Due to the instability of estimating covariance with small numbers of cells, we do not assign a score to clusters smaller than u * d, where d is the dimensionality of the embedding and u is specified.

Description

Per-cluster Confidence Score: Calculates the Mahalanobis distance from user-defined query clusters to their nearest reference centroid after initial projection into reference PCA space. All query cells in a cluster get the same score. Higher distance indicates less confidence. Due to the instability of estimating covariance with small numbers of cells, we do not assign a score to clusters smaller than u * d, where d is the dimensionality of the embedding and u is specified.

Usage

calcPerClusterMappingMetric(
  reference,
  query,
  query_cluster_labels,
  metric = "mahalanobis",
  u = 2,
  lambda = 0
)
calcPerClusterMappingMetric(
  reference,
  query,
  query_cluster_labels,
  metric = "mahalanobis",
  u = 2,
  lambda = 0
)

Arguments

`reference`	Reference object as returned by Symphony buildReference()
`query`	Query object as returned by Symphony mapQuery()
`query_cluster_labels`	Vector of user-defined labels denoting clusters / putative novel cell type to calculate the score for
`metric`	Uses Mahalanobis by default, but added as a parameter for potential future use
`u`	Do not assign scores to clusters smaller than u * d (see above description)
`lambda`	Optional ridge parameter added to covariance diagonal to help stabilize numeric estimates

Value

A data.frame of per-cluster mapping metric scores for each user-specified query cluster.

Function for evaluating F1 by cell type, adapted from automated cell type identifiaction benchmarking paper (Abdelaal et al. Genome Biology, 2019)

Description

Function for evaluating F1 by cell type, adapted from automated cell type identifiaction benchmarking paper (Abdelaal et al. Genome Biology, 2019)

Usage

evaluate(true, predicted)
evaluate(true, predicted)

Arguments

`true`	vector of true labels
`predicted`	vector of predicted labels

Value

A list of results with confusion matrix ($Conf), median F1-score ($MedF1), F1 scores per class ($F1), and accuracy ($Acc).

Function to find variable genes using mean variance relationship method

Description

Function to find variable genes using mean variance relationship method

Usage

findVariableGenes(
  X,
  groups,
  min_expr = 0.1,
  max_expr = Inf,
  min_dispersion = 0,
  max_dispersion = Inf,
  num.bin = 20,
  binning.method = "equal_width",
  return_top_n = 0
)
findVariableGenes(
  X,
  groups,
  min_expr = 0.1,
  max_expr = Inf,
  min_dispersion = 0,
  max_dispersion = Inf,
  num.bin = 20,
  binning.method = "equal_width",
  return_top_n = 0
)

Arguments

`X`	expression matrix
`groups`	vector of groups
`min_expr`	min expression cutoff
`max_expr`	max expression cutoff
`min_dispersion`	min dispersion cutoff
`max_dispersion`	max dispersion cutoff
`num.bin`	number of bins to use for scaled analysis
`binning.method`	how bins are computed
`return_top_n`	returns top n genes

Value

A data.frame of variable genes

Predict annotations of query cells from the reference using k-NN method

Description

Predict annotations of query cells from the reference using k-NN method

Usage

knnPredict(
  query_obj,
  ref_obj,
  train_labels,
  k = 5,
  save_as = "cell_type_pred_knn",
  confidence = TRUE,
  seed = 0
)
knnPredict(
  query_obj,
  ref_obj,
  train_labels,
  k = 5,
  save_as = "cell_type_pred_knn",
  confidence = TRUE,
  seed = 0
)

Arguments

`query_obj`	Symphony query object
`ref_obj`	Symphony reference object
`train_labels`	vector of labels to train
`k`	number of neighbors
`save_as`	string that result column will be named in query metadata
`confidence`	return k-NN confidence scores (proportion of neighbors voting for the predicted annotation)
`seed`	random seed (k-NN has some stochasticity in the case of ties)

Value

Symphony query object, with predicted reference labels stored in the 'save_as' slot of the query$meta_data

Function for mapping query cells to a Symphony reference

Description

Function for mapping query cells to a Symphony reference

Usage

mapQuery(
  exp_query,
  metadata_query,
  ref_obj,
  vars = NULL,
  verbose = TRUE,
  do_normalize = TRUE,
  do_umap = TRUE,
  sigma = 0.1
)
mapQuery(
  exp_query,
  metadata_query,
  ref_obj,
  vars = NULL,
  verbose = TRUE,
  do_normalize = TRUE,
  do_umap = TRUE,
  sigma = 0.1
)

Arguments

`exp_query`	Query gene expression (genes by cells)
`metadata_query`	Query metadata (cells by attributes)
`ref_obj`	Reference object as returned by Symphony buildReference()
`vars`	Query batch variable(s) to integrate over (column names in metadata)
`verbose`	Verbose output
`do_normalize`	Perform log(CP10K+1) normalization on query expression
`do_umap`	Perform umap projection into reference UMAP (if reference includes a uwot model)
`sigma`	Fuzziness parameter for soft clustering (sigma = 1 is hard clustering)

Value

Symphony query object. Mapping embedding is in the $Z slot. Other slots include query expression matrix ($exp), query cell-level metadata ($meta_data), query cell embedding in pre-Harmonized reference PCs ($Zq_pca), query cell soft cluster assignments ($R), and query cells in reference UMAP coordinates ($umap).

Log(CP10k+1) normalized counts matrix (genes by cells) for 10x PBMCs dataset for vignette.

Description

Log(CP10k+1) normalized counts matrix (genes by cells) for 10x PBMCs dataset for vignette.

Usage

pbmcs_exprs_small
pbmcs_exprs_small

Format

: Sparse matrix (dgCMatrix): dimensions 1,764 genes by 1,200 cells

Metadata for 10x PBMCs dataset for vignette.

Description

Metadata for 10x PBMCs dataset for vignette.

Usage

pbmcs_meta_small
pbmcs_meta_small

Format

: A data frame with 1,200 cells and 7 metadata fields.

cell_id: unique cell ID
donor: dataset (3pv1, 3pv2, or 5p)
nUMI: number of UMIs
nGene: number of genes
percent_mito: percent mito genes
cell_type: cell type assigned in Symphony publication
cell_type_broad: cell subtype assigned in Symphony publication

Function to plot reference, colored by cell type

Description

Function to plot reference, colored by cell type

Usage

plotReference(
  reference,
  as.density = TRUE,
  bins = 10,
  bandwidth = 1.5,
  title = "Reference",
  color.by = "cell_type",
  celltype.colors = NULL,
  show.legend = TRUE,
  show.labels = TRUE,
  show.centroids = FALSE
)
plotReference(
  reference,
  as.density = TRUE,
  bins = 10,
  bandwidth = 1.5,
  title = "Reference",
  color.by = "cell_type",
  celltype.colors = NULL,
  show.legend = TRUE,
  show.labels = TRUE,
  show.centroids = FALSE
)

Arguments

`reference`	Symphony reference object (must have UMAP stored)
`as.density`	if TRUE, plot as density; if FALSE, plot as individual cells
`bins`	for density, nbins parameter for stat_density_2d
`bandwidth`	for density, bandwidth parameter for stat_density_2d
`title`	Plot title
`color.by`	metadata column name for phenotype labels
`celltype.colors`	custom color mapping
`show.legend`	Show cell type legend
`show.labels`	Show cell type labels
`show.centroids`	Plot soft cluster centroid locations

Value

A ggplot object.

Calculate standard deviations by row

Description

Calculate standard deviations by row

Usage

rowSDs(A, row_means = NULL, weights = NULL)
rowSDs(A, row_means = NULL, weights = NULL)

Arguments

`A`	expression matrix (genes by cells)
`row_means`	row means
`weights`	weights for weighted standard dev calculation

Value

A vector of row standard deviations

Runs a standard PCA pipeline on query (1 batch). Assumes query_exp is already normalized.

Description

Runs a standard PCA pipeline on query (1 batch). Assumes query_exp is already normalized.

Usage

runPCAQueryAlone(query_exp, topn = 2000, d = 20, seed = 1)
runPCAQueryAlone(query_exp, topn = 2000, d = 20, seed = 1)

Arguments

`query_exp`	Query expression matrix (genes x cells)
`topn`	Number of variable genes to use
`d`	Number of dimensions
`seed`	random seed

Value

A matrix of PCs by cells

Scale data with given mean and standard deviations

Description

Scale data with given mean and standard deviations

Usage

scaleDataWithStats(A, mean_vec, sd_vec, margin = 1, thresh = 10)
scaleDataWithStats(A, mean_vec, sd_vec, margin = 1, thresh = 10)

Arguments

`A`	expression matrix (genes by cells)
`mean_vec`	vector of mean values
`sd_vec`	vector of standard deviation values
`margin`	1 for row-wise calculation
`thresh`	threshold to clip max values

Value

A matrix of scaled expression values.

symphony

Description

Efficient single-cell reference atlas mapping (Kang et al.)

Function to find variable genes using variance stabilizing transform (vst) method

Description

Function to find variable genes using variance stabilizing transform (vst) method

Usage

vargenes_vst(object, groups, topn, loess.span = 0.3)
vargenes_vst(object, groups, topn, loess.span = 0.3)

Arguments

`object`	expression matrix
`groups`	finds variable genes within each group then pools
`topn`	Return top n genes
`loess.span`	Loess span parameter used when fitting the variance-mean relationship

Value

A data.frame of variable genes, with means and standard deviations.

Package 'symphony'

Help Index

Function for building a Symphony reference starting from expression matrix

Description

Usage

Arguments

Value

Function for building a Symphony reference from a Harmony object. Useful if you would like your code to be more modular. Note that you must have saved vargenes_means_sds and PCA loadings.

Description

Usage

Arguments

Value

Description

Usage

Arguments

Value

Calculates the k-NN correlation within the query cells only, which measures how well the sorted ordering of k nearest query neighbors in a query de novo PCA embedding correlate with the ordering for the cells in the reference mapping embedding.

Description

Usage

Arguments

Value

Per-cell Confidence Score: Calculates the weighted Mahalanobis distance for the query cells to reference clusters. Returns a vector of distance scores, one per query cell. Higher distance metric indicates less confidence.

Description

Usage

Arguments

Value

Description

Usage

Arguments

Value

Function for evaluating F1 by cell type, adapted from automated cell type identifiaction benchmarking paper (Abdelaal et al. Genome Biology, 2019)

Description

Usage

Arguments

Value

Function to find variable genes using mean variance relationship method

Description

Usage

Arguments

Value

Predict annotations of query cells from the reference using k-NN method

Description

Usage

Arguments

Value

Function for mapping query cells to a Symphony reference

Description

Usage

Arguments

Value

Log(CP10k+1) normalized counts matrix (genes by cells) for 10x PBMCs dataset for vignette.

Description

Usage

Format

Metadata for 10x PBMCs dataset for vignette.

Description

Usage

Format

Function to plot reference, colored by cell type

Description

Usage

Arguments

Value

Calculate standard deviations by row

Description

Usage

Arguments

Value

Runs a standard PCA pipeline on query (1 batch). Assumes query_exp is already normalized.

Description

Usage

Arguments

Value

Scale data with given mean and standard deviations

Description

Usage

Arguments

Value

symphony

Description