In this tutorial, we will construct a Symphony reference from two PBMC datasets from 2 technologies (10x 3’v1 and 3’v2), then map a third dataset from a new technology (10x 5’) with Symphony. The analysis follows from Fig. 2 of the paper (but with downsampled datasets to fit within CRAN limits on subdirectory size).
Install Symphony with standard commands.
Once Symphony is installed, load it up!
Get the expression and metadata.
## [1] 1767 1200
## [1] 1200 7
## cell_id donor nUMI nGene percent_mito cell_type
## 17642 fivePrime_GATCGCGAGACCGGAT 5p 6580 2020 0.05714286 B
## 4820 threepfresh_GATCGTACAAAGGTGC 3pv2 4426 1207 0.03366471 T_CD8
## 13347 fivePrime_AACTCAGGTTACGCGC 5p 4750 1670 0.05642105 B
## 10634 threepv1_CGTGATGAGTCCTC 3pv1 4121 1228 0.01504489 T_CD4
## cell_type_broad
## 17642 B
## 4820 T
## 13347 B
## 10634 T
Subset the dataset into reference and query.
There are two options for how to build a Symphony reference. Option 1
(buildReferenceFromHarmonyObj
) is the more modular option,
meaning that the user has more control over the preprocessing steps
prior to reference compression. Option 2 (buildReference
)
builds a reference starting from expression, automating the procedure
more but offering less flexibility.
We’ll demonstrate both options below.
This option consists of more steps than Option 2 but allows your code to be more modular and flexible if you want to do your own preprocessing steps before the Harmony integration step. We recommend this option for most users.
It is important to generate vargenes_means_sds
(containing variable gene means and standard deviations used to scale
the genes) as well as save the loadings for the PCA step.
Starting with the reference expression,
## 5 x 2 sparse Matrix of class "dgCMatrix"
## threepfresh_GATCGTACAAAGGTGC threepv1_CGTGATGAGTCCTC
## LYZ 1.181536 1.766987
## FTL 2.678021 2.370840
## HLA-DRA 1.708152 .
## CD74 1.181536 1.766987
## S100A4 1.708152 2.113817
Select variable genes and subset reference expression by variable genes (the command below will select the top 1,000 genes per batch, then pool them)
var_genes = vargenes_vst(ref_exp_full, groups = as.character(ref_metadata[['donor']]), topn = 1000)
ref_exp = ref_exp_full[var_genes, ]
dim(ref_exp)
## [1] 1600 770
Calculate and save the mean and standard deviations for each gene
vargenes_means_sds = tibble(symbol = var_genes, mean = Matrix::rowMeans(ref_exp))
vargenes_means_sds$stddev = rowSDs(ref_exp, vargenes_means_sds$mean)
head(vargenes_means_sds)
## # A tibble: 6 × 3
## symbol mean stddev
## <chr> <dbl> <dbl>
## 1 LYZ 1.78 1.91
## 2 HLA-DRA 1.87 1.69
## 3 FTL 3.55 1.22
## 4 CD74 2.54 1.53
## 5 S100A9 1.62 1.83
## 6 S100A4 2.53 1.46
Scale data using calculated gene means and standard deviations
Run PCA (using SVD), save gene loadings (s$u
)
set.seed(0)
s = irlba(ref_exp_scaled, nv = 20)
Z_pca_ref = diag(s$d) %*% t(s$v) # [pcs by cells]
loadings = s$u
Run Harmony integration. It is important to set
return_object = TRUE
.
set.seed(0)
ref_harmObj = harmony::HarmonyMatrix(
data_mat = t(Z_pca_ref), ## PCA embedding matrix of cells
meta_data = ref_metadata, ## dataframe with cell labels
theta = c(2), ## cluster diversity enforcement
vars_use = c('donor'), ## variable to integrate out
nclust = 100, ## number of clusters in Harmony model
max.iter.harmony = 20, ## max number of iterations
return_object = TRUE, ## return the full Harmony model object
do_pca = FALSE ## don't recompute PCs
)
## Warning in harmony::HarmonyMatrix(data_mat = t(Z_pca_ref), meta_data =
## ref_metadata, : HarmonyMatrix is deprecated and will be removed in the future
## from the API in the future
## Warning: Warning: The parameters do_pca and npcs are deprecated. They will be ignored for this function call and please remove parameters do_pca and npcs and pass to harmony cell_embeddings directly.
## This warning is displayed once per session.
## Warning: Warning: The parameter max.iter.harmony is replaced with parameter max_iter. It will be ignored for this function call and please use parameter max_iter in future function calls.
## This warning is displayed once per session.
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony converged after 4 iterations
To run the next function buildReferenceFromHarmonyObj()
,
you need to input the saved gene loadings (loadings
) and
vargenes_means_sds
.
# Compress a Harmony object into a Symphony reference
reference = buildReferenceFromHarmonyObj(
ref_harmObj, # output object from HarmonyMatrix()
ref_metadata, # reference cell metadata
vargenes_means_sds, # gene names, means, and std devs for scaling
loadings, # genes x PCs matrix
verbose = TRUE, # verbose output
do_umap = TRUE, # set to TRUE to run UMAP
save_uwot_path = './testing_uwot_model_1') # file path to save uwot model
## Save metadata, vargenes (S), and loadings (U)
## Save R, Z_orig, Z_corr, and betas from Harmony object
## Calculate final L2 normalized reference centroids (Y_cos)
## Calculate reference compression terms (Nr and C)
## UMAP
## Saved uwot model
## Finished nicely.
Save Symphony reference for later mapping (modify with your desired output path)
Let’s take a look at what the reference object contains: * meta_data: metadata * vargenes: variable genes, means, and standard deviations used for scaling * loadings: gene loadings for projection into pre-Harmony PC space * R: Soft cluster assignments * Z_orig: Pre-Harmony PC embedding * Z_corr: Harmonized PC embedding * centroids: locations of final Harmony soft cluster centroids * cache: pre-calculated reference-dependent portions of the mixture model * umap: UMAP coordinates * save_uwot_path: path to saved uwot model (for query UMAP projection into reference UMAP coordinates)
## List of 12
## $ meta_data :'data.frame': 770 obs. of 7 variables:
## ..$ cell_id : chr [1:770] "threepfresh_GATCGTACAAAGGTGC" "threepv1_CGTGATGAGTCCTC" "threepv1_AAGAATCTTGGAGG" "threepfresh_CTGAAACCACACTGCG" ...
## ..$ donor : chr [1:770] "3pv2" "3pv1" "3pv1" "3pv2" ...
## ..$ nUMI : int [1:770] 4426 4121 1535 11956 6273 1840 2702 4260 3436 3966 ...
## ..$ nGene : int [1:770] 1207 1228 560 2685 1615 584 945 1188 1360 1486 ...
## ..$ percent_mito : num [1:770] 0.0337 0.015 0.0261 0.0278 0.0289 ...
## ..$ cell_type : chr [1:770] "T_CD8" "T_CD4" "B" "DC" ...
## ..$ cell_type_broad: chr [1:770] "T" "T" "B" "DC" ...
## $ vargenes : tibble [1,600 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ symbol: chr [1:1600] "LYZ" "HLA-DRA" "FTL" "CD74" ...
## ..$ mean : Named num [1:1600] 1.78 1.87 3.55 2.54 1.62 ...
## .. ..- attr(*, "names")= chr [1:1600] "LYZ" "HLA-DRA" "FTL" "CD74" ...
## ..$ stddev: Named num [1:1600] 1.91 1.69 1.22 1.53 1.83 ...
## .. ..- attr(*, "names")= chr [1:1600] "LYZ" "HLA-DRA" "FTL" "CD74" ...
## $ loadings : num [1:1600, 1:20] 0.1247 0.0811 0.1181 0.0621 0.118 ...
## $ R : num [1:100, 1:770] 6.51e-11 3.37e-02 1.56e-05 3.22e-02 5.23e-10 ...
## $ Z_orig : num [1:20, 1:770] -4.332 0.976 -3.186 0.138 -0.154 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "PC_1" "PC_2" "PC_3" "PC_4" ...
## .. ..$ : chr [1:770] "4820" "10634" "8538" "4091" ...
## $ Z_corr : num [1:20, 1:770] -4.329 0.985 -3.516 -0.836 0.22 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "harmony_1" "harmony_2" "harmony_3" "harmony_4" ...
## .. ..$ : chr [1:770] "4820" "10634" "8538" "4091" ...
## $ betas : num [1:3, 1:20, 1:100] -4.67507 0.01562 -0.01562 6.11835 0.00125 ...
## $ centroids : num [1:20, 1:100] -0.21222 0.27757 0.36455 0.10133 0.00315 ...
## $ cache :List of 2
## ..$ : num [1:100, 1] 1.95 7.88 7.58 7.88 5.51 ...
## ..$ : num [1:100, 1:20] -9.1 -38.4 -32.3 -38.5 -20.8 ...
## $ centroids_pc :'data.frame': 100 obs. of 20 variables:
## ..$ harmony_1 : num [1:100] -4.66 -4.87 -4.26 -4.89 -3.77 ...
## ..$ harmony_2 : num [1:100] 6.1 0.792 3.162 0.796 -11.914 ...
## ..$ harmony_3 : num [1:100] 8.01 -3.73 -1.18 -3.73 4.5 ...
## ..$ harmony_4 : num [1:100] 2.227 -0.324 0.359 -0.315 -0.282 ...
## ..$ harmony_5 : num [1:100] 0.0693 0.191 -0.1894 0.1869 1.2772 ...
## ..$ harmony_6 : num [1:100] 1.979 -0.296 0.787 -0.295 -0.565 ...
## ..$ harmony_7 : num [1:100] 0.877 0.893 -1.706 0.909 0.639 ...
## ..$ harmony_8 : num [1:100] 1.619 0.882 -2.36 0.902 -0.449 ...
## ..$ harmony_9 : num [1:100] -3.4439 -0.0297 1.8631 -0.0938 -2.056 ...
## ..$ harmony_10: num [1:100] -4.17497 0.09616 0.11491 0.10254 -0.00614 ...
## ..$ harmony_11: num [1:100] -2.1483 0.0609 -1.569 0.0746 -0.5361 ...
## ..$ harmony_12: num [1:100] -3.309 0.0603 0.1691 0.0619 1.6341 ...
## ..$ harmony_13: num [1:100] 2.337 0.331 -0.327 0.324 0.2 ...
## ..$ harmony_14: num [1:100] -6.7439 -0.0648 -0.0452 -0.0676 4.0906 ...
## ..$ harmony_15: num [1:100] 7.585 -0.264 0.208 -0.3 0.456 ...
## ..$ harmony_16: num [1:100] 0.8259 0.0305 0.4656 0.0137 -1.2136 ...
## ..$ harmony_17: num [1:100] -3.95 -0.108 -0.497 -0.113 1.07 ...
## ..$ harmony_18: num [1:100] 9.8882 -0.0421 0.7514 -0.0343 -3.4906 ...
## ..$ harmony_19: num [1:100] 6.7551 -0.076 -0.3102 -0.0697 2.8904 ...
## ..$ harmony_20: num [1:100] -5.88 -0.356 0.523 -0.387 -4.708 ...
## $ umap :List of 1
## ..$ embedding: num [1:770, 1:2] -0.0665 1.0889 -5.5889 -7.7594 -1.0674 ...
## .. ..- attr(*, "scaled:center")= num [1:2] -1.88 3.74
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:770] "4820" "10634" "8538" "4091" ...
## .. .. ..$ : chr [1:2] "UMAP1" "UMAP2"
## $ save_uwot_path: chr "./testing_uwot_model_1"
The harmonized embedding is located in the Z_corr
slot
of the reference object.
## [1] 20 770
## 4820 10634 8538 4091 11682
## harmony_1 -4.3288149 -4.2445957 -4.4774239 8.8143345 12.4672375
## harmony_2 0.9850685 1.2294724 -13.1961863 -2.5585248 -0.1101657
## harmony_3 -3.5156035 -3.2315005 4.6187725 0.2711107 0.3434141
## harmony_4 -0.8360186 -0.5159932 0.2887552 -0.2665001 -5.2331532
## harmony_5 0.2204747 -0.8677291 1.3867310 -3.9301075 -0.1100130
Visualize reference UMAP
This option computes a reference object starting from expression in a unified pipeline, automating the preprocessing steps.
# Build reference
set.seed(0)
reference = symphony::buildReference(
ref_exp_full,
ref_metadata,
vars = c('donor'), # variables to integrate over
K = 100, # number of Harmony clusters
verbose = TRUE, # verbose output
do_umap = TRUE, # can set to FALSE if want to run umap separately later
do_normalize = FALSE, # set to TRUE if input counts are not normalized yet
vargenes_method = 'vst', # method for variable gene selection ('vst' or 'mvp')
vargenes_groups = 'donor', # metadata column specifying groups for variable gene selection
topn = 1000, # number of variable genes to choose per group
d = 20, # number of PCs
save_uwot_path = './testing_uwot_model_2' # file path to save uwot model
)
## Finding variable genes using vst method
## Total 1600 genes for downstream steps
## Scaling and PCA
## Running Harmony integration
## Warning in harmony::HarmonyMatrix(data_mat = t(Z_pca_ref), meta_data =
## metadata_ref, : HarmonyMatrix is deprecated and will be removed in the future
## from the API in the future
## Warning: Warning: The parameter tau is deprecated. It will be ignored for this function call and please remove parameter tau in future function calls. Advanced users can set value of parameter tau by using parameter .options and function harmony_options().
## This warning is displayed once per session.
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony converged after 4 iterations
## Computing reference compression terms
## Running UMAP
## Saved uwot model
# Save reference (modify with your desired output path)
saveRDS(reference, './testing_reference2.rds')
Visualize reference UMAP
In order to map a new query dataset onto the reference, you will need a reference object saved from the steps above, as well as query cell expression and metadata.
The query dataset is assumed to have been normalized in the same manner as the reference cells (here, default is log(CP10k+1) normalization).
# Read in Symphony reference to map to
reference = readRDS('./testing_reference1.rds')
# Map query
query = mapQuery(query_exp, # query gene expression (genes x cells)
query_metadata, # query metadata (cells x attributes)
reference, # Symphony reference object
do_normalize = FALSE, # perform log(CP10k) normalization on query
do_umap = TRUE) # project query cells into reference UMAP
## Scaling and synchronizing query gene expression
## Found 1600 out of 1600 reference variable genes in query dataset
## Project query cells using reference gene loadings
## Clustering query cells to reference centroids
## Correcting query batch effects
## UMAP
## All done!
Note: Symphony assumes that the query is normalized in the same manner as the reference. Our default implementation currently uses log(CP10k+1) normalization.
Let’s take a look at what the query object contains: * Z: query cells in reference Harmonized embedding * Zq_pca: query cells in pre-Harmony reference PC embedding (prior to correction) * R: query cell soft cluster assignments * Xq: query cell design matrix for correction step * umap: query cells projected into reference UMAP coordinates (using uwot) * meta_data: metadata
## List of 7
## $ exp :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:99120] 1 2 3 5 10 13 14 15 26 27 ...
## .. ..@ p : int [1:431] 0 267 497 727 942 1197 1384 1608 1877 2047 ...
## .. ..@ Dim : int [1:2] 1767 430
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:1767] "LYZ" "FTL" "HLA-DRA" "CD74" ...
## .. .. ..$ : chr [1:430] "fivePrime_GATCGCGAGACCGGAT" "fivePrime_AACTCAGGTTACGCGC" "fivePrime_ACATCAGGTCTTGTCC" "fivePrime_GCCTCTATCATGTCCC" ...
## .. ..@ x : num [1:99120] 2.15 3.9 5.41 3.4 3.17 ...
## .. ..@ factors : list()
## $ meta_data:'data.frame': 430 obs. of 7 variables:
## ..$ cell_id : chr [1:430] "fivePrime_GATCGCGAGACCGGAT" "fivePrime_AACTCAGGTTACGCGC" "fivePrime_ACATCAGGTCTTGTCC" "fivePrime_GCCTCTATCATGTCCC" ...
## ..$ donor : chr [1:430] "5p" "5p" "5p" "5p" ...
## ..$ nUMI : int [1:430] 6580 4750 4432 5831 5558 5360 4749 5152 4929 4194 ...
## ..$ nGene : int [1:430] 2020 1670 1668 2027 1890 1388 1661 1805 1417 1575 ...
## ..$ percent_mito : num [1:430] 0.0571 0.0564 0.0399 0.0326 0.0587 ...
## ..$ cell_type : chr [1:430] "B" "B" "B" "T_CD4" ...
## ..$ cell_type_broad: chr [1:430] "B" "B" "B" "T" ...
## $ Z : num [1:20, 1:430] -3.377 -13.151 5.843 -0.133 0.29 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "harmony_1" "harmony_2" "harmony_3" "harmony_4" ...
## .. ..$ : chr [1:430] "17642" "13347" "13638" "17934" ...
## $ Zq_pca : num [1:20, 1:430] -3.089 -14.01 6.805 1.207 -0.721 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "PC_1" "PC_2" "PC_3" "PC_4" ...
## .. ..$ : chr [1:430] "17642" "13347" "13638" "17934" ...
## $ R : num [1:100, 1:430] 2.75e-09 2.83e-11 3.09e-11 2.77e-11 1.55e-03 ...
## $ Xq :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:860] 0 1 0 1 0 1 0 1 0 1 ...
## .. ..@ p : int [1:431] 0 2 4 6 8 10 12 14 16 18 ...
## .. ..@ Dim : int [1:2] 2 430
## .. ..@ Dimnames:List of 2
## .. .. ..$ : NULL
## .. .. ..$ : NULL
## .. ..@ x : num [1:860] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ factors : list()
## $ umap : num [1:430, 1:2] -6.12 -6.04 -5.82 2.36 -6.3 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:430] "17642" "13347" "13638" "17934" ...
## .. ..$ : chr [1:2] "UMAP1" "UMAP2"
Predict query cell types using k-NN. Setting confidence = TRUE also returns the prediction confidence scores (proportion of neighbors with winning vote).
query = knnPredict(query, # query object
reference, # reference object
reference$meta_data$cell_type, # reference cell labels for training
k = 5, # number of reference neighbors to use for prediction
confidence = TRUE)
Query cell type predictions are now in the cell_type_pred_knn column.
## cell_id donor nUMI nGene percent_mito cell_type
## 17642 fivePrime_GATCGCGAGACCGGAT 5p 6580 2020 0.05714286 B
## 13347 fivePrime_AACTCAGGTTACGCGC 5p 4750 1670 0.05642105 B
## 13638 fivePrime_ACATCAGGTCTTGTCC 5p 4432 1668 0.03993682 B
## 17934 fivePrime_GCCTCTATCATGTCCC 5p 5831 2027 0.03258446 T_CD4
## 14432 fivePrime_AGGGATGGTGGGTATG 5p 5558 1890 0.05865419 B
## 18442 fivePrime_GGGACCTCATGCCACG 5p 5360 1388 0.03750000 T_CD4
## cell_type_broad cell_type_pred_knn cell_type_pred_knn_prob
## 17642 B B 1.0
## 13347 B B 1.0
## 13638 B B 1.0
## 17934 T T_CD4 1.0
## 14432 B B 1.0
## 18442 T T_CD4 0.8
# Sync the column names for both data frames
reference$meta_data$cell_type_pred_knn = NA
reference$meta_data$cell_type_pred_knn_prob = NA
reference$meta_data$ref_query = 'reference'
query$meta_data$ref_query = 'query'
# Add the UMAP coordinates to the metadata
meta_data_combined = rbind(query$meta_data, reference$meta_data)
umap_combined = rbind(query$umap, reference$umap$embedding)
umap_combined_labels = cbind(meta_data_combined, umap_combined)
# Plot UMAP visualization of all cells
plotBasic(umap_combined_labels, title = 'Reference and query cells', color.by = 'ref_query')
Plot the reference and query side by side.
plotBasic(umap_combined_labels, title = 'Reference and query cells',
color.mapping = pbmc_colors, facet.by = 'ref_query')
And that’s a wrap! If you run into issues or have questions about Symphony or this tutorial, please open an issue on GitHub.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 ggrastr_1.0.2 ggthemes_5.1.0 ggplot2_3.5.1
## [5] dplyr_1.1.4 data.table_1.16.2 irlba_2.3.5.1 Matrix_1.7-1
## [9] harmony_1.2.1 Rcpp_1.0.13-1 symphony_0.1.1 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] class_7.3-22 stringi_1.8.4 lattice_0.22-6
## [7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
## [10] grid_4.4.2 fastmap_1.2.0 jsonlite_1.8.9
## [13] purrr_1.0.2 fansi_1.0.6 scales_1.3.0
## [16] RhpcBLASctl_0.23-42 codetools_0.2-20 jquerylib_0.1.4
## [19] cli_3.6.3 rlang_1.1.4 RcppAnnoy_0.0.22
## [22] uwot_0.2.2 cowplot_1.1.3 munsell_0.5.1
## [25] withr_3.0.2 RANN_2.6.2 cachem_1.1.0
## [28] yaml_2.3.10 ggbeeswarm_0.7.2 tools_4.4.2
## [31] colorspace_2.1-1 buildtools_1.0.0 vctrs_0.6.5
## [34] R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1
## [37] vipor_0.4.7 beeswarm_0.4.0 pkgconfig_2.0.3
## [40] pillar_1.9.0 bslib_0.8.0 gtable_0.3.6
## [43] glue_1.8.0 highr_0.11 xfun_0.49
## [46] tibble_3.2.1 tidyselect_1.2.1 sys_3.4.3
## [49] knitr_1.48 farver_2.1.2 htmltools_0.5.8.1
## [52] labeling_0.4.3 maketools_1.3.1 Cairo_1.6-2
## [55] compiler_4.4.2