Tutorial 2: Using metadata_ondisc_matrix and multimodal_ondisc_matrix

Tutorial 1 covered ondisc_matrix, the core class implemented by ondisc. Thus tutorial covers metadata_ondisc_matrix and multimodal_ondisc_matrix, two additional classes provided by the package. metadata_ondisc_matrix stores cell-specific and feature-specific covariate matrices alongside the expression matrix, and multimodal_ondisc_matrix stores multiple metadata_ondisc_matrices representing different cellular modalities. Together, metadata_ondisc_matrix and multimodal_ondisc_matrix facilitate feature selection, quality control, and other common single-cell data preprocessing tasks.

We begin by loading the package.

library(ondisc)

The metadata_ondisc_matrix class

A metadata_ondisc_matrix object consists of three components: (i) an ondisc_matrix representing the expression data, (ii) a data frame storing the cell-specific covariates, and (iii) a data frame storing the feature-specific covariates. The easiest way to initialize a metadata_ondisc_matrix is by calling create_ondisc_matrix_from_mtx on an mtx file and associated metadata files, setting the optional parameter return_metadata_ondisc_matrix to TRUE. Below, we reproduce the example from Tutorial 1, this time returning a metadata_ondisc_matrix instead of a list.

# Set paths to the .mtx and .tsv files
raw_data_dir <- system.file("extdata", package = "ondisc")
mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/genes.tsv")

# Specify directory in which to store the .h5 file
temp_dir <- tempdir()

# Initialize metadata_ondisc_matrix
expressions <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
                                              barcodes_fp = barcodes_fp,
                                              features_fp = features_fp,
                                              on_disk_dir = temp_dir,
                                              return_metadata_ondisc_matrix = TRUE)
#> |========                                                                 | 11%|=================                                                        | 23%|==========================                                               | 36%|====================================                                     | 48%|=============================================                            | 61%|======================================================                   | 73%|===============================================================          | 86%|=========================================================================| 98%|=========================================================================| 100%
#> |========                                                                 | 11%|=================                                                        | 23%|==========================                                               | 36%|====================================                                     | 48%|=============================================                            | 61%|======================================================                   | 73%|===============================================================          | 86%|=========================================================================| 98%|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.

The variable expressions is an object of class metadata_ondisc_matrix; expressions contains the fields ondisc_matrix, cell_covariates, and feature_covariates.

# Print the variable
expressions
#> A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.

We alternately can initialize a metadata_ondisc_matrix by calling the constructor function of the metadata_ondisc_matrix class; see documentation (via ?metadata_ondisc_matrix) for details.

The multimodal_ondisc_matrix class

The multimodal_ondisc_matrix class is used to represent multimodal data. multimodal_ondisc_matrix objects have two fields: (i) a named list of metadata_ondisc_matrices representing different modalities, and (ii) a global (i.e., cross-modality) cell-specific covariate matrix. The ondisc package ships with example CRISPR perturbation data, which we use to initialize a new perturbation modality via a call to create_ondisc_matrix_from_mtx.

# Set paths to the perturbation .mtx and .tsv files
mtx_fp <- paste0(raw_data_dir, "/perturbation.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/guides.tsv")

# Initialize metadata_ondisc_matrix
perturbations <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
                                               barcodes_fp = barcodes_fp,
                                               features_fp = features_fp,
                                               on_disk_dir = temp_dir,
                                               return_metadata_ondisc_matrix = TRUE)
#> 
#> Writing CSC data.
#> Writing CSR data.

Like expressions, the variable perturbations is an object of class metadata_ondisc_matrix. However, because perturbations represents logical perturabtion data instead of integer gene expression data, the cell-specific and feature-specific covariates of perturbations differ from those of expressions.

# These matrices have different columns
head(expressions@cell_covariates)
#>   n_nonzero n_umis     p_mito
#> 1        43    214 0.04672897
#> 2        26    169 0.00000000
#> 3        22    116 0.05172414
#> 4        37    258 0.08139535
#> 5        36    224 0.08035714
#> 6        31    147 0.07482993
head(perturbations@cell_covariates)
#>   n_nonzero
#> 1        21
#> 2        11
#> 3        17
#> 4        14
#> 5         7
#> 6        11

The expressions and perturbations data are multimodal – they are collected from the same set of cells. We can create a multimodal_ondisc_matrix by passing a named list of metadata_ondisc_matrix objects – in this case, expressions and perturbations – to the constructor function of the multimodal_ondisc_matrix class.

modality_list <- list(expressions = expressions, perturbations = perturbations)
crispr_experiment <- multimodal_ondisc_matrix(modality_list)

The variable crispr_experiment is an object of class multimodal_ondisc_matrix. The column names of the global covariate matrix are derived from the names of the modalities.

# print variable
crispr_experiment
#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 250 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero.
#>  A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.

# show the global covariate matrix
head(crispr_experiment@global_cell_covariates)
#>   expressions_n_nonzero expressions_n_umis expressions_p_mito
#> 1                    43                214         0.04672897
#> 2                    26                169         0.00000000
#> 3                    22                116         0.05172414
#> 4                    37                258         0.08139535
#> 5                    36                224         0.08035714
#> 6                    31                147         0.07482993
#>   perturbations_n_nonzero
#> 1                      21
#> 2                      11
#> 3                      17
#> 4                      14
#> 5                       7
#> 6                      11

The figure below summarizes the relationship between ondisc_matrix, metadata_ondisc_matrix, and multimodal_ondisc_matrix.

**Figure**: Classes provided by the package. a) `ondisc_matrix`, b) `metadata_ondisc_matrix`, c) `multimodal_ondisc_matrix`

Figure: Classes provided by the package. a) ondisc_matrix, b) metadata_ondisc_matrix, c) multimodal_ondisc_matrix

Querying basic information

We can use the functions get_feature_ids, get_feature_names, and get_cell_barcodes to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of a metadata_ondisc_matrix or a multimodal_ondisc_matrix. get_feature_ids and get_feature_names return a list when called on a multimodal_ondisc_matrix, as the different modalities contain different features.

# metadata_ondisc_matrix
cell_barcodes <- get_cell_barcodes(expressions)
feature_ids <- get_feature_ids(expressions)
feature_names <- get_feature_names(expressions)

# multimodal_ondisc_matrix
cell_barcodes <- get_cell_barcodes(crispr_experiment)
feature_ids <- get_feature_ids(crispr_experiment)

We likewise can use dim, nrow, and ncol to query the dimension, number of rows, and number of columns of a metadata_ondisc_matrix or multimodal_ondisc_matrix. dim and nrow again return lists when called on a multimodal_ondisc_matrix.

# metadata_ondisc_matrix
dim(expressions)
#> [1] 300 900
nrow(expressions)
#> [1] 300
ncol(expressions)
#> [1] 900

# multimodal_ondisc_matrix
dim(crispr_experiment)
#> $expressions
#> [1] 300 900
#> 
#> $perturbations
#> [1] 250 900
nrow(crispr_experiment)
#> $expressions
#> [1] 300
#> 
#> $perturbations
#> [1] 250
ncol(crispr_experiment)
#> [1] 900

Subsetting

Similar to ondisc_matrices, metadata_ondisc_matrices and multimodal_ondisc_matrices can be subset using the [ operator. metadata_ondisc_matrices can be subset either by feature or cell, while multimodal_ondisc_matrices can be subset by cell only.

# metadata_ondisc_matrix
# keep cells 100 - 150
expressions_sub <- expressions[,100:150]
# keep genes ENSG00000188305, ENSG00000257284, ENSG00000251655
expressions_sub <- expressions[c("ENSG00000188305", "ENSG00000257284", "ENSG00000251655"),]

# multimodal_ondisc_matrix
# keep all cells except 1 - 100
crispr_experiment_sub <- crispr_experiment[,-c(1:100)]

As with ondisc_matrices, the original objects remain unchanged.

expressions
#> A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
crispr_experiment
#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 300 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#>  A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#>  An ondisc_matrix with 250 features and 900 cells.
#>  A cell covariate matrix with columns n_nonzero.
#>  A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.

Notes and tips

  • To save and load a metadata_ondisc_matrix or multimodal_ondisc_matrix, simply use the functions saveRDS and readRDS, respectively.

  • It is not possible to apply the double-bracket “pull submatrix into memory” operator to a metadata_ondisc_matrix or multimodal_ondisc_matrix. To apply the double-bracket operator to an ondisc_matrix stored inside a metadata_ondisc_matrix or multimodal_ondisc_matrix, simply access the desired ondisc_matrix first.

  • The metadata_ondisc_matrix and multimodal_ondisc_matrix classes are inspired by analogous classes in the SingleCellExperiment and Seurat packages. However, unlike SingleCellExperiment and Seurat, ondisc does not (currently) support storing the dense matrix of normalized UMI counts. We view this as a feature: storing the normalized expression matrix is almost never necessary and is hugely expensive (in space and time) for large-scale data.