--- title: 'PRECAST: simulation' author: "Wei Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{PRECAST: simulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette introduces the PRECAST workflow for the analysis of integrating multiple spatial transcriptomics dataset. The workflow consists of three steps * Independent preprocessing and model setting * Probabilistic embedding, clustering and alignment using PRECAST model * Downstream analysis (i.e. visualization of clusters and embeddings, combined differential expression analysis) We demonstrate the use of PRECAST to three simulated Visium data that are [here](https://github.com/feiyoung/PRECAST/tree/main/vignettes_data), which can be downloaded to the current working path by the following command: ```{r eval=FALSE} githubURL <- "https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/data_simu.rda?raw=true" download.file(githubURL,"data_simu.rda",mode='wb') ``` Then load to R ```{r eval = FALSE} load("data_simu.rda") ``` The package can be loaded with the command: ```{r eval = FALSE} library(PRECAST) library(Seurat) ``` ## Load the simulated data First, we view the the three simulated spatial transcriptomics data with Visium platform. ```{r eval = FALSE} data_simu ## a list including three Seurat object with default assay: RNA ``` Check the content in `data_simu`. ```{r eval= FALSE} head(data_simu[[1]]) ``` ```{r eval= FALSE} row.names(data_simu[[1]])[1:10] ``` ## Create a PRECASTObject object We show how to create a PRECASTObject object step by step. First, we create a Seurat list object using the count matrix and meta data of each data batch. Although `data_simu` is a prepared Seurat list object, we re-create a same objcet seuList to show the details. * Note: the spatial coordinates must be contained in the meta data and named as `row` and `col`, which benefits the identification of spaital coordinates by PRECAST. ```{r eval = FALSE} ## Get the gene-by-spot read count matrices countList <- lapply(data_simu, function(x){ assay <- DefaultAssay(x) GetAssayData(x, assay = assay, slot='counts') } ) ## Check the spatial coordinates: Yes, they are named as "row" and "col"! head(data_simu[[1]]@meta.data) ## Get the meta data of each spot for each data batch metadataList <- lapply(data_simu, function(x) x@meta.data) ## ensure the row.names of metadata in metaList are the same as that of colnames count matrix in countList M <- length(countList) for(r in 1:M){ row.names(metadataList[[r]]) <- colnames(countList[[r]]) } ## Create the Seurat list object seuList <- list() for(r in 1:M){ seuList[[r]] <- CreateSeuratObject(counts = countList[[r]], meta.data=metadataList[[r]], project = "PRECASTsimu") } ``` ### Prepare the PRECASTObject with preprocessing step. Next, we use `CreatePRECASTObject()` to create a PRECASTObject based on the Seurat list object `seuList`. This function will do three things: - (1) Filter low-quality spots and genes, controlled by the arguments `premin.features` and `premin.spots`, respectively; the spots are retained in raw data (seuList) with at least premin.features number of nonzero-count features (genes), and the genes are retained in raw data (seuList) with at least `premin.spots` number of spots. To ease presentation, we denote the filtered Seurat list object as data_filter1. - (2) Select the top 2,000 variable genes (by setting `gene.number=2000`) for each data batch using `FindSVGs()` function in `DR.SC` package for spatially variable genes or `FindVariableFeatures()` function in `Seurat` package for highly variable genes. Next, we prioritized genes based on the number of times they were selected as variable genes in all samples and chose the top 2,000 genes. Then denote the Seurat list object as data_filter2, where only 2,000 genes are retained. - (3) Conduct strict quality control for data_filter2 by filtering spots and genes, controlled by the arguments `postmin.features` and `postmin.spots`, respectively; the spots are retained with at least `post.features` nonzero counts across genes; the features (genes) are retained with at least `postmin.spots` number of nonzero-count spots. Usually, no genes are filltered because these genes are variable genes. If the argument `customGenelist` is not `NULL`, then this function only does (3) based on `customGenelist` gene list. In this simulated dataset, we don't require to select genes, thus, we set `customGenelist=row.names(seuList[[1]])`, representing the user-defined gene list. User can retain the raw seurat list object by setting `rawData.preserve = TRUE`. ```{r eval = FALSE} ## Create PRECASTObject set.seed(2022) PRECASTObj <- CreatePRECASTObject(seuList, customGenelist=row.names(seuList[[1]])) ## User can retain the raw seuList by the following commond. ## PRECASTObj <- CreatePRECASTObject(seuList, customGenelist=row.names(seuList[[1]]), rawData.preserve = TRUE) ``` ## Fit PRECAST using simulated data ### Add the model setting Add adjacency matrix list and parameter setting of PRECAST. More model setting parameters can be found in `model_set()`. ```{r eval = FALSE} ## check the number of genes/features after filtering step PRECASTObj@seulist ## seuList is null since the default value `rawData.preserve` is FALSE. PRECASTObj@seuList ## Add adjacency matrix list for a PRECASTObj object to prepare for PRECAST model fitting. PRECASTObj <- AddAdjList(PRECASTObj, platform = "Visium") ## Add a model setting in advance for a PRECASTObj object: verbose =TRUE helps outputing the information in the algorithm; coreNum set the how many cores are used in PRECAST. If you run PRECAST for multiple number of clusters, you can set multiple cores; otherwise, set it to 1. PRECASTObj <- AddParSetting(PRECASTObj, Sigma_equal=FALSE, maxIter=30, verbose=TRUE, coreNum =1) ``` ### Fit PRECAST For function `PRECAST`, users can specify the number of clusters $K$ or set `K` to be an integer vector by using modified BIC(MBIC) to determine $K$. For convenience, we give a single K here. ```{r eval = FALSE} ### Given K set.seed(2022) PRECASTObj <- PRECAST(PRECASTObj, K=7) ```
**Other options** Run for multiple K. Here, we set `K=6:9`. ```{r eval =FALSE} ## Reset parameters by increasing cores. PRECASTObj2 <- AddParSetting(PRECASTObj, Sigma_equal=FALSE, maxIter=30, verbose=TRUE, coreNum =2) set.seed(2023) PRECASTObj2 <- PRECAST(PRECASTObj2, K=6:7) resList2 <- PRECASTObj2@resList PRECASTObj2 <- SelectModel(PRECASTObj2) ``` * Note: For parallel compuation based on Rcpp on Linux, users require to use the following system command to set the C_stack unlimited in case of R Error: C stack usage is too close to the limit. `ulimit -s unlimited` Besides, user can also use different initialization method by setting `int.model`, for example, set `int.model=NULL`; see the functions `AddParSetting()` and `model_set()` for more details.
Select a best model and re-organize the results by useing `SelectModel()`. Even though `K` is not a vector, it is also necessary to run `SelectModel()` to re-organize the results in `PRECASTObj`. The selected best K is 7 by using command `str(PRECASTObj@resList)`. ```{r eval = FALSE} ## check the fitted results: there are four list for the fitted results of each K (6:9). str(PRECASTObj@resList) ## backup the fitted results in resList resList <- PRECASTObj@resList # PRECASTObj@resList <- resList PRECASTObj <- SelectModel(PRECASTObj) ## check the best and re-organized results str(PRECASTObj@resList) ## The selected best K is 7 ``` Use ARI to check the performance of clustering: ```{r eval = FALSE} true_cluster <- lapply(PRECASTObj@seulist, function(x) x$true_cluster) str(true_cluster) mclust::adjustedRandIndex(unlist(PRECASTObj@resList$cluster), unlist(true_cluster)) ``` We provide two methods to correct the batch effects in gene expression level. Method (1) is using only PRECAST results to obtain the batch corrected gene expressions if the species of data is unknown or the number of overlapped housekeeping genes between the variable genes in `PRECASTObj@seulist` and the genes in database is less than five. Method (2) is using bouth housekeeping gene and PRECAST results to obtain the batch corrected gene expressions. - Note: to obtain batch corrected gene expressions based on housekeeping genes as the negative control, users must specify the species of data source and use gene symbol names in `PRECASTObj@seulist`. Integrate the two samples by the function `IntegrateSpaData`. Because this is a simulated data, we use Method (1) by setting `species='unknown'`. ```{r eval = FALSE} seuInt <- IntegrateSpaData(PRECASTObj, species='unknown') seuInt ## The low-dimensional embeddings obtained by PRECAST are saved in PRECAST reduction slot. ``` ## Visualization First, user can choose a beautiful color schema using `chooseColors()`. ```{r eval = FALSE} cols_cluster <- chooseColors(palettes_name = 'Nature 10', n_colors = 7, plot_colors = TRUE) ``` Show the spatial scatter plot for clusters ```{r eval = FALSE, fig.height=5, fig.width=7} p12 <- SpaPlot(seuInt, batch=NULL, cols=cols_cluster, point_size=2, combine=TRUE) p12 # users can plot each sample by setting combine=FALSE ``` Users can re-plot the above figures for specific need by returning a ggplot list object. For example, we only plot the spatial heatmap of first two data batches. ```{r eval = FALSE, fig.height=2.6, fig.width=7} pList <- SpaPlot(seuInt, batch=NULL, cols=cols_cluster, point_size=2, combine=FALSE, title_name=NULL) drawFigs(pList[1:2], layout.dim = c(1,2), common.legend = TRUE, legend.position = 'right', align='hv') ``` Show the spatial UMAP/tNSE RGB plot ```{r eval = FALSE, fig.height=5, fig.width=6} seuInt <- AddUMAP(seuInt) SpaPlot(seuInt, batch=NULL,item='RGB_UMAP',point_size=1, combine=TRUE, text_size=15) ## Plot tSNE RGB plot #seuInt <- AddTSNE(seuInt) #SpaPlot(seuInt, batch=NULL,item='RGB_TSNE',point_size=2, combine=T, text_size=15) ``` Show the tSNE plot based on the extracted features from PRECAST to check the performance of integration. ```{r eval = FALSE, fig.height=8, fig.width=6} seuInt <- AddTSNE(seuInt, n_comp = 2) p1 <- dimPlot(seuInt, item='cluster', font_family='serif', cols=cols_cluster) # Times New Roman p2 <- dimPlot(seuInt, item='batch', point_size = 1, font_family='serif') drawFigs(list(p1, p2), common.legend=FALSE, align='hv') # It is noted that only sample batch 1 has cluster 4, and only sample batch 2 has cluster 7. ``` Show the UMAP plot based on the extracted features from PRECAST. ```{r eval = FALSE, fig.height=4, fig.width=6} dimPlot(seuInt, reduction = 'UMAP3', item='cluster', cols=cols_cluster, font_family='serif') ``` Users can also use the visualization functions in Seurat package: ```{r eval = FALSE, fig.height=3, fig.width=8} library(Seurat) p1 <- DimPlot(seuInt[,1: 4226], reduction = 'position', cols=cols_cluster, pt.size =1) # plot the first data batch: first 4226 spots. p2 <- DimPlot(seuInt, reduction = 'tSNE',cols=cols_cluster, pt.size=1) drawFigs(list(p1, p2), layout.dim = c(1,2), common.legend = TRUE) ``` Combined differential expression analysis ```{r eval = FALSE} dat_deg <- FindAllMarkers(seuInt) library(dplyr) n <- 2 dat_deg %>% group_by(cluster) %>% top_n(n = n, wt = avg_log2FC) -> top10 head(top10) ```
**Session Info** ```{r} sessionInfo() ```