metadata_ondisc_matrix
and
multimodal_ondisc_matrix
Tutorial 1 covered
ondisc_matrix
, the core class implemented by
ondisc
. Thus tutorial covers
metadata_ondisc_matrix
and
multimodal_ondisc_matrix
, two additional classes provided
by the package. metadata_ondisc_matrix
stores cell-specific
and feature-specific covariate matrices alongside the expression matrix,
and multimodal_ondisc_matrix
stores multiple
metadata_ondisc_matrices
representing different cellular
modalities. Together, metadata_ondisc_matrix
and
multimodal_ondisc_matrix
facilitate feature selection,
quality control, and other common single-cell data preprocessing
tasks.
We begin by loading the package.
metadata_ondisc_matrix
classA metadata_ondisc_matrix
object consists of three
components: (i) an ondisc_matrix
representing the
expression data, (ii) a data frame storing the cell-specific covariates,
and (iii) a data frame storing the feature-specific covariates. The
easiest way to initialize a metadata_ondisc_matrix
is by
calling create_ondisc_matrix_from_mtx
on an mtx file and
associated metadata files, setting the optional parameter
return_metadata_ondisc_matrix
to TRUE
. Below,
we reproduce the example from Tutorial 1, this time returning a
metadata_ondisc_matrix
instead of a list.
# Set paths to the .mtx and .tsv files
raw_data_dir <- system.file("extdata", package = "ondisc")
mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/genes.tsv")
# Specify directory in which to store the .h5 file
temp_dir <- tempdir()
# Initialize metadata_ondisc_matrix
expressions <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir,
return_metadata_ondisc_matrix = TRUE)
#> |======== | 11%|================= | 23%|========================== | 36%|==================================== | 48%|============================================= | 61%|====================================================== | 73%|=============================================================== | 86%|=========================================================================| 98%|=========================================================================| 100%
#> |======== | 11%|================= | 23%|========================== | 36%|==================================== | 48%|============================================= | 61%|====================================================== | 73%|=============================================================== | 86%|=========================================================================| 98%|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.
The variable expressions
is an object of class
metadata_ondisc_matrix
; expressions
contains
the fields ondisc_matrix
, cell_covariates
, and
feature_covariates
.
# Print the variable
expressions
#> A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
We alternately can initialize a metadata_ondisc_matrix
by calling the constructor function of the
metadata_ondisc_matrix
class; see documentation (via
?metadata_ondisc_matrix) for details.
multimodal_ondisc_matrix
classThe multimodal_ondisc_matrix
class is used to represent
multimodal data. multimodal_ondisc_matrix
objects have two
fields: (i) a named list of metadata_ondisc_matrices
representing different modalities, and (ii) a global (i.e.,
cross-modality) cell-specific covariate matrix. The ondisc
package ships with example CRISPR perturbation data, which we use to
initialize a new perturbation modality via a call to
create_ondisc_matrix_from_mtx
.
# Set paths to the perturbation .mtx and .tsv files
mtx_fp <- paste0(raw_data_dir, "/perturbation.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/guides.tsv")
# Initialize metadata_ondisc_matrix
perturbations <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir,
return_metadata_ondisc_matrix = TRUE)
#>
#> Writing CSC data.
#> Writing CSR data.
Like expressions
, the variable
perturbations
is an object of class
metadata_ondisc_matrix
. However, because
perturbations
represents logical perturabtion data instead
of integer gene expression data, the cell-specific and feature-specific
covariates of perturbations
differ from those of
expressions
.
# These matrices have different columns
head(expressions@cell_covariates)
#> n_nonzero n_umis p_mito
#> 1 43 214 0.04672897
#> 2 26 169 0.00000000
#> 3 22 116 0.05172414
#> 4 37 258 0.08139535
#> 5 36 224 0.08035714
#> 6 31 147 0.07482993
head(perturbations@cell_covariates)
#> n_nonzero
#> 1 21
#> 2 11
#> 3 17
#> 4 14
#> 5 7
#> 6 11
The expressions
and perturbations
data are
multimodal – they are collected from the same set of cells. We can
create a multimodal_ondisc_matrix
by passing a named list
of metadata_ondisc_matrix
objects – in this case,
expressions
and perturbations
– to the
constructor function of the multimodal_ondisc_matrix
class.
modality_list <- list(expressions = expressions, perturbations = perturbations)
crispr_experiment <- multimodal_ondisc_matrix(modality_list)
The variable crispr_experiment
is an object of class
multimodal_ondisc_matrix
. The column names of the global
covariate matrix are derived from the names of the modalities.
# print variable
crispr_experiment
#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 250 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero.
#> A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.
# show the global covariate matrix
head(crispr_experiment@global_cell_covariates)
#> expressions_n_nonzero expressions_n_umis expressions_p_mito
#> 1 43 214 0.04672897
#> 2 26 169 0.00000000
#> 3 22 116 0.05172414
#> 4 37 258 0.08139535
#> 5 36 224 0.08035714
#> 6 31 147 0.07482993
#> perturbations_n_nonzero
#> 1 21
#> 2 11
#> 3 17
#> 4 14
#> 5 7
#> 6 11
The figure below summarizes the relationship between
ondisc_matrix
, metadata_ondisc_matrix
, and
multimodal_ondisc_matrix
.
We can use the functions get_feature_ids
,
get_feature_names
, and get_cell_barcodes
to
obtain the feature IDs, feature names (if applicable), and cell
barcodes, respectively, of a metadata_ondisc_matrix
or a
multimodal_ondisc_matrix
. get_feature_ids
and
get_feature_names
return a list when called on a
multimodal_ondisc_matrix
, as the different modalities
contain different features.
# metadata_ondisc_matrix
cell_barcodes <- get_cell_barcodes(expressions)
feature_ids <- get_feature_ids(expressions)
feature_names <- get_feature_names(expressions)
# multimodal_ondisc_matrix
cell_barcodes <- get_cell_barcodes(crispr_experiment)
feature_ids <- get_feature_ids(crispr_experiment)
We likewise can use dim
, nrow
, and
ncol
to query the dimension, number of rows, and number of
columns of a metadata_ondisc_matrix
or
multimodal_ondisc_matrix
. dim
and
nrow
again return lists when called on a
multimodal_ondisc_matrix
.
# metadata_ondisc_matrix
dim(expressions)
#> [1] 300 900
nrow(expressions)
#> [1] 300
ncol(expressions)
#> [1] 900
# multimodal_ondisc_matrix
dim(crispr_experiment)
#> $expressions
#> [1] 300 900
#>
#> $perturbations
#> [1] 250 900
nrow(crispr_experiment)
#> $expressions
#> [1] 300
#>
#> $perturbations
#> [1] 250
ncol(crispr_experiment)
#> [1] 900
Similar to ondisc_matrices
,
metadata_ondisc_matrices
and
multimodal_ondisc_matrices
can be subset using the
[
operator. metadata_ondisc_matrices
can be
subset either by feature or cell, while
multimodal_ondisc_matrices
can be subset by cell only.
# metadata_ondisc_matrix
# keep cells 100 - 150
expressions_sub <- expressions[,100:150]
# keep genes ENSG00000188305, ENSG00000257284, ENSG00000251655
expressions_sub <- expressions[c("ENSG00000188305", "ENSG00000257284", "ENSG00000251655"),]
# multimodal_ondisc_matrix
# keep all cells except 1 - 100
crispr_experiment_sub <- crispr_experiment[,-c(1:100)]
As with ondisc_matrices
, the original objects remain
unchanged.
expressions
#> A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
crispr_experiment
#> A multimodal_ondisc_matrix with the following modalities:
#> 1. expressions A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 300 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
#> 2. perturbations A metadata_ondisc_matrix with the following components:
#> An ondisc_matrix with 250 features and 900 cells.
#> A cell covariate matrix with columns n_nonzero.
#> A feature covariate matrix with columns n_nonzero.
#> Plus, a global cell-covariate matrix containing 4 covariates and 900 cells.
To save and load a metadata_ondisc_matrix
or
multimodal_ondisc_matrix
, simply use the functions
saveRDS
and readRDS
, respectively.
It is not possible to apply the double-bracket “pull submatrix
into memory” operator to a metadata_ondisc_matrix
or
multimodal_ondisc_matrix
. To apply the double-bracket
operator to an ondisc_matrix
stored inside a
metadata_ondisc_matrix
or
multimodal_ondisc_matrix
, simply access the desired
ondisc_matrix
first.
The metadata_ondisc_matrix
and
multimodal_ondisc_matrix
classes are inspired by analogous
classes in the SingleCellExperiment
and Seurat
packages. However, unlike SingleCellExperiment
and
Seurat
, ondisc
does not (currently) support
storing the dense matrix of normalized UMI counts. We view this as a
feature: storing the normalized expression matrix is almost never
necessary and is hugely expensive (in space and time) for large-scale
data.