This vignette introduces the PRECAST workflow for the analysis of integrating multiple spatial transcriptomics dataset. The workflow consists of three steps
We demonstrate the use of PRECAST to three simulated Visium data that are here, which can be downloaded to the current working path by the following command:
githubURL <- "https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/data_simu.rda?raw=true"
download.file(githubURL,"data_simu.rda",mode='wb')
Then load to R
The package can be loaded with the command:
First, we view the the three simulated spatial transcriptomics data with Visium platform.
Check the content in data_simu
.
We show how to create a PRECASTObject object step by step. First, we
create a Seurat list object using the count matrix and meta data of each
data batch. Although data_simu
is a prepared Seurat list
object, we re-create a same objcet seuList to show the details.
row
and col
, which benefits the
identification of spaital coordinates by PRECAST.## Get the gene-by-spot read count matrices
countList <- lapply(data_simu, function(x){
assay <- DefaultAssay(x)
GetAssayData(x, assay = assay, slot='counts')
} )
## Check the spatial coordinates: Yes, they are named as "row" and "col"!
head(data_simu[[1]]@meta.data)
## Get the meta data of each spot for each data batch
metadataList <- lapply(data_simu, function(x) x@meta.data)
## ensure the row.names of metadata in metaList are the same as that of colnames count matrix in countList
M <- length(countList)
for(r in 1:M){
row.names(metadataList[[r]]) <- colnames(countList[[r]])
}
## Create the Seurat list object
seuList <- list()
for(r in 1:M){
seuList[[r]] <- CreateSeuratObject(counts = countList[[r]], meta.data=metadataList[[r]], project = "PRECASTsimu")
}
Next, we use CreatePRECASTObject()
to create a
PRECASTObject based on the Seurat list object seuList
. This
function will do three things:
premin.features
and premin.spots
,
respectively; the spots are retained in raw data (seuList) with at least
premin.features number of nonzero-count features (genes), and the genes
are retained in raw data (seuList) with at least
premin.spots
number of spots. To ease presentation, we
denote the filtered Seurat list object as data_filter1.gene.number=2000
) for each data batch using
FindSVGs()
function in DR.SC
package for
spatially variable genes or FindVariableFeatures()
function
in Seurat
package for highly variable genes. Next, we
prioritized genes based on the number of times they were selected as
variable genes in all samples and chose the top 2,000 genes. Then denote
the Seurat list object as data_filter2, where only 2,000 genes are
retained.postmin.features
and
postmin.spots
, respectively; the spots are retained with at
least post.features
nonzero counts across genes; the
features (genes) are retained with at least postmin.spots
number of nonzero-count spots. Usually, no genes are filltered because
these genes are variable genes.If the argument customGenelist
is not NULL
,
then this function only does (3) based on customGenelist
gene list.
In this simulated dataset, we don’t require to select genes, thus, we
set customGenelist=row.names(seuList[[1]])
, representing
the user-defined gene list. User can retain the raw seurat list object
by setting rawData.preserve = TRUE
.
Add adjacency matrix list and parameter setting of PRECAST. More
model setting parameters can be found in model_set()
.
## check the number of genes/features after filtering step
PRECASTObj@seulist
## seuList is null since the default value `rawData.preserve` is FALSE.
PRECASTObj@seuList
## Add adjacency matrix list for a PRECASTObj object to prepare for PRECAST model fitting.
PRECASTObj <- AddAdjList(PRECASTObj, platform = "Visium")
## Add a model setting in advance for a PRECASTObj object: verbose =TRUE helps outputing the information in the algorithm; coreNum set the how many cores are used in PRECAST. If you run PRECAST for multiple number of clusters, you can set multiple cores; otherwise, set it to 1.
PRECASTObj <- AddParSetting(PRECASTObj, Sigma_equal=FALSE, maxIter=30, verbose=TRUE,
coreNum =1)
For function PRECAST
, users can specify the number of
clusters K or set
K
to be an integer vector by using modified BIC(MBIC) to
determine K. For convenience,
we give a single K here.
Run for multiple K. Here, we set K=6:9
.
## Reset parameters by increasing cores.
PRECASTObj2 <- AddParSetting(PRECASTObj, Sigma_equal=FALSE, maxIter=30, verbose=TRUE,
coreNum =2)
set.seed(2023)
PRECASTObj2 <- PRECAST(PRECASTObj2, K=6:7)
resList2 <- PRECASTObj2@resList
PRECASTObj2 <- SelectModel(PRECASTObj2)
ulimit -s unlimited
Besides, user can also use different initialization method by setting
int.model
, for example, set int.model=NULL
;
see the functions AddParSetting()
and
model_set()
for more details.
Select a best model and re-organize the results by useing
SelectModel()
. Even though K
is not a vector,
it is also necessary to run SelectModel()
to re-organize
the results in PRECASTObj
. The selected best K is 7 by
using command str(PRECASTObj@resList)
.
## check the fitted results: there are four list for the fitted results of each K (6:9).
str(PRECASTObj@resList)
## backup the fitted results in resList
resList <- PRECASTObj@resList
# PRECASTObj@resList <- resList
PRECASTObj <- SelectModel(PRECASTObj)
## check the best and re-organized results
str(PRECASTObj@resList) ## The selected best K is 7
Use ARI to check the performance of clustering:
true_cluster <- lapply(PRECASTObj@seulist, function(x) x$true_cluster)
str(true_cluster)
mclust::adjustedRandIndex(unlist(PRECASTObj@resList$cluster), unlist(true_cluster))
We provide two methods to correct the batch effects in gene
expression level. Method (1) is using only PRECAST results to obtain the
batch corrected gene expressions if the species of data is unknown or
the number of overlapped housekeeping genes between the variable genes
in PRECASTObj@seulist
and the genes in database is less
than five. Method (2) is using bouth housekeeping gene and PRECAST
results to obtain the batch corrected gene expressions.
PRECASTObj@seulist
.Integrate the two samples by the function
IntegrateSpaData
. Because this is a simulated data, we use
Method (1) by setting species='unknown'
.
First, user can choose a beautiful color schema using
chooseColors()
.
Show the spatial scatter plot for clusters
p12 <- SpaPlot(seuInt, batch=NULL, cols=cols_cluster, point_size=2, combine=TRUE)
p12
# users can plot each sample by setting combine=FALSE
Users can re-plot the above figures for specific need by returning a ggplot list object. For example, we only plot the spatial heatmap of first two data batches.
pList <- SpaPlot(seuInt, batch=NULL, cols=cols_cluster, point_size=2, combine=FALSE, title_name=NULL)
drawFigs(pList[1:2], layout.dim = c(1,2), common.legend = TRUE, legend.position = 'right', align='hv')
Show the spatial UMAP/tNSE RGB plot
seuInt <- AddUMAP(seuInt)
SpaPlot(seuInt, batch=NULL,item='RGB_UMAP',point_size=1, combine=TRUE, text_size=15)
## Plot tSNE RGB plot
#seuInt <- AddTSNE(seuInt)
#SpaPlot(seuInt, batch=NULL,item='RGB_TSNE',point_size=2, combine=T, text_size=15)
Show the tSNE plot based on the extracted features from PRECAST to check the performance of integration.
seuInt <- AddTSNE(seuInt, n_comp = 2)
p1 <- dimPlot(seuInt, item='cluster', font_family='serif', cols=cols_cluster) # Times New Roman
p2 <- dimPlot(seuInt, item='batch', point_size = 1, font_family='serif')
drawFigs(list(p1, p2), common.legend=FALSE, align='hv')
# It is noted that only sample batch 1 has cluster 4, and only sample batch 2 has cluster 7.
Show the UMAP plot based on the extracted features from PRECAST.
Users can also use the visualization functions in Seurat package:
library(Seurat)
p1 <- DimPlot(seuInt[,1: 4226], reduction = 'position', cols=cols_cluster, pt.size =1) # plot the first data batch: first 4226 spots.
p2 <- DimPlot(seuInt, reduction = 'tSNE',cols=cols_cluster, pt.size=1)
drawFigs(list(p1, p2), layout.dim = c(1,2), common.legend = TRUE)
Combined differential expression analysis
dat_deg <- FindAllMarkers(seuInt)
library(dplyr)
n <- 2
dat_deg %>%
group_by(cluster) %>%
top_n(n = n, wt = avg_log2FC) -> top10
head(top10)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] PRECAST_1.6.5 gtools_3.9.5 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.4.2
#> [3] later_1.4.1 tibble_3.2.1
#> [5] polyclip_1.10-7 fastDummies_1.7.4
#> [7] lifecycle_1.0.4 rstatix_0.7.2
#> [9] globals_0.16.3 lattice_0.22-6
#> [11] MASS_7.3-61 backports_1.5.0
#> [13] magrittr_2.0.3 plotly_4.10.4
#> [15] sass_0.4.9 jquerylib_0.1.4
#> [17] yaml_2.3.10 httpuv_1.6.15
#> [19] Seurat_5.1.0 sctransform_0.4.1
#> [21] spam_2.11-0 sp_2.1-4
#> [23] spatstat.sparse_3.1-0 reticulate_1.40.0
#> [25] cowplot_1.1.3 pbapply_1.7-2
#> [27] buildtools_1.0.0 RColorBrewer_1.1-3
#> [29] abind_1.4-8 zlibbioc_1.52.0
#> [31] Rtsne_0.17 GenomicRanges_1.59.1
#> [33] purrr_1.0.2 BiocGenerics_0.53.3
#> [35] GenomeInfoDbData_1.2.13 IRanges_2.41.2
#> [37] S4Vectors_0.45.2 ggrepel_0.9.6
#> [39] irlba_2.3.5.1 listenv_0.9.1
#> [41] spatstat.utils_3.1-1 maketools_1.3.1
#> [43] goftest_1.2-3 RSpectra_0.16-2
#> [45] spatstat.random_3.3-2 fitdistrplus_1.2-1
#> [47] parallelly_1.40.1 leiden_0.4.3.1
#> [49] codetools_0.2-20 DelayedArray_0.33.3
#> [51] scuttle_1.17.0 tidyselect_1.2.1
#> [53] UCSC.utils_1.3.0 farver_2.1.2
#> [55] viridis_0.6.5 ScaledMatrix_1.15.0
#> [57] matrixStats_1.4.1 stats4_4.4.2
#> [59] spatstat.explore_3.3-3 jsonlite_1.8.9
#> [61] BiocNeighbors_2.1.2 Formula_1.2-5
#> [63] progressr_0.15.1 ggridges_0.5.6
#> [65] survival_3.7-0 scater_1.35.0
#> [67] tools_4.4.2 ica_1.0-3
#> [69] Rcpp_1.0.13-1 glue_1.8.0
#> [71] gridExtra_2.3 SparseArray_1.7.2
#> [73] xfun_0.49 MatrixGenerics_1.19.0
#> [75] ggthemes_5.1.0 GenomeInfoDb_1.43.2
#> [77] dplyr_1.1.4 formatR_1.14
#> [79] fastmap_1.2.0 fansi_1.0.6
#> [81] digest_0.6.37 rsvd_1.0.5
#> [83] R6_2.5.1 mime_0.12
#> [85] colorspace_2.1-1 scattermore_1.2
#> [87] tensor_1.5 spatstat.data_3.1-4
#> [89] utf8_1.2.4 tidyr_1.3.1
#> [91] generics_0.1.3 data.table_1.16.4
#> [93] httr_1.4.7 htmlwidgets_1.6.4
#> [95] S4Arrays_1.7.1 uwot_0.2.2
#> [97] pkgconfig_2.0.3 gtable_0.3.6
#> [99] lmtest_0.9-40 SingleCellExperiment_1.29.1
#> [101] XVector_0.47.0 sys_3.4.3
#> [103] htmltools_0.5.8.1 carData_3.0-5
#> [105] dotCall64_1.2 SeuratObject_5.0.2
#> [107] scales_1.3.0 Biobase_2.67.0
#> [109] png_0.1-8 spatstat.univar_3.1-1
#> [111] knitr_1.49 reshape2_1.4.4
#> [113] nlme_3.1-166 cachem_1.1.0
#> [115] zoo_1.8-12 stringr_1.5.1
#> [117] KernSmooth_2.23-24 miniUI_0.1.1.1
#> [119] vipor_0.4.7 GiRaF_1.0.1
#> [121] pillar_1.9.0 grid_4.4.2
#> [123] vctrs_0.6.5 RANN_2.6.2
#> [125] ggpubr_0.6.0 promises_1.3.2
#> [127] car_3.1-3 BiocSingular_1.23.0
#> [129] DR.SC_3.4 beachmat_2.23.4
#> [131] xtable_1.8-4 cluster_2.1.8
#> [133] beeswarm_0.4.0 evaluate_1.0.1
#> [135] cli_3.6.3 compiler_4.4.2
#> [137] rlang_1.1.4 crayon_1.5.3
#> [139] ggsignif_0.6.4 future.apply_1.11.3
#> [141] mclust_6.1.1 plyr_1.8.9
#> [143] ggbeeswarm_0.7.2 stringi_1.8.4
#> [145] viridisLite_0.4.2 deldir_2.0-4
#> [147] BiocParallel_1.41.0 munsell_0.5.1
#> [149] lazyeval_0.2.2 spatstat.geom_3.3-4
#> [151] CompQuadForm_1.4.3 Matrix_1.7-1
#> [153] RcppHNSW_0.6.0 patchwork_1.3.0
#> [155] future_1.34.0 ggplot2_3.5.1
#> [157] shiny_1.10.0 SummarizedExperiment_1.37.0
#> [159] ROCR_1.0-11 broom_1.0.7
#> [161] igraph_2.1.2 bslib_0.8.0