--- title: "Tutorial 2: Using `metadata_ondisc_matrix` and `multimodal_ondisc_matrix`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tutorial 2: Using `metadata_ondisc_matrix` and `multimodal_ondisc_matrix`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Tutorial 1 covered `ondisc_matrix`, the core class implemented by `ondisc`. Thus tutorial covers `metadata_ondisc_matrix` and `multimodal_ondisc_matrix`, two additional classes provided by the package. `metadata_ondisc_matrix` stores cell-specific and feature-specific covariate matrices alongside the expression matrix, and `multimodal_ondisc_matrix` stores multiple `metadata_ondisc_matrices` representing different cellular modalities. Together, `metadata_ondisc_matrix` and `multimodal_ondisc_matrix` facilitate feature selection, quality control, and other common single-cell data preprocessing tasks. We begin by loading the package. ```{r setup} library(ondisc) ``` # The `metadata_ondisc_matrix` class A `metadata_ondisc_matrix` object consists of three components: (i) an `ondisc_matrix` representing the expression data, (ii) a data frame storing the cell-specific covariates, and (iii) a data frame storing the feature-specific covariates. The easiest way to initialize a `metadata_ondisc_matrix` is by calling `create_ondisc_matrix_from_mtx` on an mtx file and associated metadata files, setting the optional parameter `return_metadata_ondisc_matrix` to `TRUE`. Below, we reproduce the example from Tutorial 1, this time returning a `metadata_ondisc_matrix` instead of a list. ```{r} # Set paths to the .mtx and .tsv files raw_data_dir <- system.file("extdata", package = "ondisc") mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx") barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv") features_fp <- paste0(raw_data_dir, "/genes.tsv") # Specify directory in which to store the .h5 file temp_dir <- tempdir() # Initialize metadata_ondisc_matrix expressions <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp, barcodes_fp = barcodes_fp, features_fp = features_fp, on_disk_dir = temp_dir, return_metadata_ondisc_matrix = TRUE) ``` The variable `expressions` is an object of class `metadata_ondisc_matrix`; `expressions` contains the fields `ondisc_matrix`, `cell_covariates`, and `feature_covariates`. ```{r} # Print the variable expressions ``` We alternately can initialize a `metadata_ondisc_matrix` by calling the constructor function of the `metadata_ondisc_matrix` class; see documentation (via ?metadata_ondisc_matrix) for details. # The `multimodal_ondisc_matrix` class The `multimodal_ondisc_matrix` class is used to represent multimodal data. `multimodal_ondisc_matrix` objects have two fields: (i) a named list of `metadata_ondisc_matrices` representing different modalities, and (ii) a global (i.e., cross-modality) cell-specific covariate matrix. The `ondisc` package ships with example CRISPR perturbation data, which we use to initialize a new perturbation modality via a call to `create_ondisc_matrix_from_mtx`. ```{r} # Set paths to the perturbation .mtx and .tsv files mtx_fp <- paste0(raw_data_dir, "/perturbation.mtx") barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv") features_fp <- paste0(raw_data_dir, "/guides.tsv") # Initialize metadata_ondisc_matrix perturbations <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp, barcodes_fp = barcodes_fp, features_fp = features_fp, on_disk_dir = temp_dir, return_metadata_ondisc_matrix = TRUE) ``` Like `expressions`, the variable `perturbations` is an object of class `metadata_ondisc_matrix`. However, because `perturbations` represents logical perturabtion data instead of integer gene expression data, the cell-specific and feature-specific covariates of `perturbations` differ from those of `expressions`. ```{r} # These matrices have different columns head(expressions@cell_covariates) head(perturbations@cell_covariates) ``` The `expressions` and `perturbations` data are multimodal -- they are collected from the same set of cells. We can create a `multimodal_ondisc_matrix` by passing a named list of `metadata_ondisc_matrix` objects -- in this case, `expressions` and `perturbations` -- to the constructor function of the `multimodal_ondisc_matrix` class. ```{r} modality_list <- list(expressions = expressions, perturbations = perturbations) crispr_experiment <- multimodal_ondisc_matrix(modality_list) ``` The variable `crispr_experiment` is an object of class `multimodal_ondisc_matrix`. The column names of the global covariate matrix are derived from the names of the modalities. ```{r} # print variable crispr_experiment # show the global covariate matrix head(crispr_experiment@global_cell_covariates) ``` The figure below summarizes the relationship between `ondisc_matrix`, `metadata_ondisc_matrix`, and `multimodal_ondisc_matrix`. ```{r classes, echo=FALSE, fig.cap="**Figure**: Classes provided by the package. a) `ondisc_matrix`, b) `metadata_ondisc_matrix`, c) `multimodal_ondisc_matrix` ", out.width = '55%'} knitr::include_graphics("classes_cropped.jpg") ``` # Querying basic information We can use the functions `get_feature_ids`, `get_feature_names`, and `get_cell_barcodes` to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of a `metadata_ondisc_matrix` or a `multimodal_ondisc_matrix`. `get_feature_ids` and `get_feature_names` return a list when called on a `multimodal_ondisc_matrix`, as the different modalities contain different features. ```{r} # metadata_ondisc_matrix cell_barcodes <- get_cell_barcodes(expressions) feature_ids <- get_feature_ids(expressions) feature_names <- get_feature_names(expressions) # multimodal_ondisc_matrix cell_barcodes <- get_cell_barcodes(crispr_experiment) feature_ids <- get_feature_ids(crispr_experiment) ``` We likewise can use `dim`, `nrow`, and `ncol` to query the dimension, number of rows, and number of columns of a `metadata_ondisc_matrix` or `multimodal_ondisc_matrix`. `dim` and `nrow` again return lists when called on a `multimodal_ondisc_matrix`. ```{r} # metadata_ondisc_matrix dim(expressions) nrow(expressions) ncol(expressions) # multimodal_ondisc_matrix dim(crispr_experiment) nrow(crispr_experiment) ncol(crispr_experiment) ``` # Subsetting Similar to `ondisc_matrices`, `metadata_ondisc_matrices` and `multimodal_ondisc_matrices` can be subset using the `[` operator. `metadata_ondisc_matrices` can be subset either by feature or cell, while `multimodal_ondisc_matrices` can be subset by cell only. ```{r} # metadata_ondisc_matrix # keep cells 100 - 150 expressions_sub <- expressions[,100:150] # keep genes ENSG00000188305, ENSG00000257284, ENSG00000251655 expressions_sub <- expressions[c("ENSG00000188305", "ENSG00000257284", "ENSG00000251655"),] # multimodal_ondisc_matrix # keep all cells except 1 - 100 crispr_experiment_sub <- crispr_experiment[,-c(1:100)] ``` As with `ondisc_matrices`, the original objects remain unchanged. ```{r} expressions crispr_experiment ``` # Notes and tips - To save and load a `metadata_ondisc_matrix` or `multimodal_ondisc_matrix`, simply use the functions `saveRDS` and `readRDS`, respectively. - It is not possible to apply the double-bracket "pull submatrix into memory" operator to a `metadata_ondisc_matrix` or `multimodal_ondisc_matrix`. To apply the double-bracket operator to an `ondisc_matrix` stored *inside* a `metadata_ondisc_matrix` or `multimodal_ondisc_matrix`, simply access the desired `ondisc_matrix` first. - The `metadata_ondisc_matrix` and `multimodal_ondisc_matrix` classes are inspired by analogous classes in the `SingleCellExperiment` and `Seurat` packages. However, unlike `SingleCellExperiment` and `Seurat`, `ondisc` does not (currently) support storing the dense matrix of normalized UMI counts. We view this as a feature: storing the normalized expression matrix is almost never necessary and is hugely expensive (in space and time) for large-scale data.