ondisc_matrix
classThis tutorial shows how to use
ondisc_matrix
, the core class implemented by
ondisc
. An ondisc_matrix
is an R object that
represents an expression matrix stored on-disk rather than in-memory. We
cover the topics of initialization, querying basic information,
subsetting, and pulling submatrices into memory. We begin by loading the
ondisc
package.
ondisc
ships with several example datasets, stored in
the “extdata” subdirectory of the package.
raw_data_dir <- system.file("extdata", package = "ondisc")
list.files(raw_data_dir)
#> [1] "cell_barcodes.tsv" "gene_expression.mtx" "genes.tsv"
#> [4] "guides.tsv" "perturbation.mtx"
The files “gene_expression.mtx”, “cell_barcodes.tsv,” and “genes.tsv”
together define a gene-by-cell expression matrix. We save the full paths
to these files in the variables mtx_fp
,
barcodes_fp
, and features_fp
.
mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/genes.tsv")
An ondisc_matrix
consists of two parts: an HDF5 (i.e.,
.h5) file that stores the expression data on-disk in a novel format, and
an in-memory object that allows us to interact with the expression data
from within R. The easiest way to initialize an
ondisc_matrix
is by calling the function
create_ondisc_matrix_from_mtx
. We pass to this function (i)
a file path to the .mtx file storing the expression data, (ii) a file
path to the .tsv file storing the cell barcodes, and (iii) a file path
to the .tsv file storing the feature IDs and human-readable feature
names. We optionally can specify the directory in which to store the
initialized .h5 file, which in this tutorial we will take to be the
temporary directory.
temp_dir <- tempdir()
exp_mat_list <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir)
#> |======== | 11%|================= | 23%|========================== | 36%|==================================== | 48%|============================================= | 61%|====================================================== | 73%|=============================================================== | 86%|=========================================================================| 98%|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.
By default, create_ondisc_matrix_from_mtx
returns a list
of three elements: (i) an ondisc_matrix
representing the
expression data, (ii) a cell-wise covariate matrix, and (iii) a
feature-wise covariate matrix. The exact cell-wise and feature-wise
covariate matrices that are computed depend on the inputs to
create_ondisc_matrix_from_mtx
(see documentation via
?create_ondisc_matrix_from_mtx for full details). The advantage to
computing the cell-wise and feature-wise covariates at initialization is
that it obviates the need to load the entire dataset into memory a
second time.
expression_mat <- exp_mat_list$ondisc_matrix
head(expression_mat)
#> Showing 5 of 300 featuress and 6 of 900 cells:
#> Loading required package: Matrix
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 3 0 0 0 0 5
#> [2,] 0 2 0 0 0 0
#> [3,] 0 8 0 0 0 0
#> [4,] 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0
cell_covariates <- exp_mat_list$cell_covariates
head(cell_covariates)
#> n_nonzero n_umis p_mito
#> 1 43 214 0.04672897
#> 2 26 169 0.00000000
#> 3 22 116 0.05172414
#> 4 37 258 0.08139535
#> 5 36 224 0.08035714
#> 6 31 147 0.07482993
feature_covariates <- exp_mat_list$feature_covariates
head(feature_covariates)
#> mean_expression coef_of_variation n_nonzero
#> 1 0.7577778 2.981871 114
#> 2 0.5977778 3.302883 96
#> 3 0.5788889 3.539932 85
#> 4 0.6533333 3.341677 91
#> 5 0.5522222 3.578487 82
#> 6 0.5455556 3.541223 84
The initialized HDF5 file is named ondisc_matrix_1.h5
and is located in the temporary directory.
A strength of create_ondisc_matrix_from_mtx
is that it
does not assume that entire expression matrix fits into memory.
The optional argument n_lines_per_chunk
can be used to
specify the number of lines to read from the .mtx file at a time.
Additionally, create_ondisc_matrix_from_mtx
is fast: the
novel algorithm that underlies this function is highly efficient and
implemented in C++ for maximum speed. Typically,
create_ondisc_matrix_from_mtx
takes aboout 4-8 minutes/GB
to run. Finally, for a given dataset,
create_ondisc_matrix_from_mtx
only needs to be run once,
even after closing and opening new R sessions.
We can use the functions get_feature_ids
,
get_feature_names
, and get_cell_barcodes
to
obtain the feature IDs, feature names (if applicable), and cell
barcodes, respectively, of an ondisc_matrix
.
feature_ids <- get_feature_ids(expression_mat)
feature_names <- get_feature_names(expression_mat)
cell_barcodes <- get_cell_barcodes(expression_mat)
head(feature_ids)
#> [1] "ENSG00000198060" "ENSG00000237832" "ENSG00000267543" "ENSG00000103460"
#> [5] "ENSG00000229637" "ENSG00000174990"
head(feature_names)
#> [1] "MARCH5" "AL138808.1" "AC015802.3" "TOX3" "PRAC2"
#> [6] "CA5A"
head(cell_barcodes)
#> [1] "GCTTTCGTCTAGACCA-1" "ACGGTCGTCGTTAGAC-1" "TTTACGTTCACCTCGT-1"
#> [4] "TGGATCATCCTTCAGC-1" "ACAGGGAAGACGCCCT-1" "ACCTACCAGTGTTCCA-1"
Additionally, we can use dim
, nrow
, and
ncol
to obtain the dimension, number of rows (i.e., number
of features), and number of columns (i.e., number of cells) of an
ondisc_matrix
.
We can subset an ondisc_matrix
to obtain a new
ondisc_matrix
that is a submatrix of the original. To
subset an ondisc_matrix
, apply the [
operator
and pass a numeric, logical, or character vector indicating the cells or
features to keep. Character vectors are assumed to refer to feature IDs
(for rows) and cell barcodes (for columns).
# numeric vector examples
# keep genes 100-110
x <- expression_mat[100:110,]
# keep all cells except 10 and 20
x <- expression_mat[,-c(10,20)]
# keep genes 50-100 and 200-250 and cells 300-500
x <- expression_mat[c(50:100, 200:250), 300:500]
# character vector examples
# keep genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
# keep cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
x <- expression_mat[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]
# logical vector example
# keep all genes except ENSG00000237832 and ENSG00000229637
x <- expression_mat[!(get_feature_ids(expression_mat)
%in% c("ENSG00000237832", "ENSG00000229637")),]
Subsetting an ondisc_matrix
leaves the original object
unchanged.
This important property, called object persistence, makes
programming with ondisc_matrices
intuitive. The underlying
HDF5 file is not copied upon subset; instead, information is shared
across ondisc_matrix
objects, making subsets fast.
We can pull a submatrix of an ondisc_matrix
into memory,
allowing us to perform computations on a subset of the data. To pull a
submatrix into memory, use the [[
operator, passing a
numeric, character, or logical vector indicating the cells or features
to access. The data structure that underlies an
ondisc_matrix
enables fast access to both rows and columns
of the matrix.
# numeric vector examples
# pull gene 6
m <- expression_mat[[6,]]
# pull cells 200 - 250
m <- expression_mat[[,200:250]]
# pull genes 50 - 100 and cells 200 - 250
m <- expression_mat[[50:100, 200:250]]
# character vector examples
# pull genes ENSG00000107581 and ENSG00000286857
m <- expression_mat[[c("ENSG00000107581", "ENSG00000286857"),]]
# pull cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
m <- expression_mat[[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]]
# logical vector examples
# subset the matrix, keeping genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
# pull all genes except ENSG00000107581
m <- x[[get_feature_ids(x) != "ENSG00000107581",]]
The last example demonstrates that we can pull a submatrix of an
ondisc_matrix
into memory after having subset the
matrix.
One can remember the difference between [
and
[[
by recalling R lists: [
is used to subset a
list, and [[
is used to access elements stored within a
list. Similarly, [
is used to subset an
ondisc_matrix
, and [[
is used to access a
submatrix stored within an ondisc_matrix
.
ondisc_matrix
As discussed previously, there are two components to an
ondisc_matrix
: the HDF5 file stored on-disk, and the R
object stored in memory. The latter contains a file path to the former,
allowing us to interact with the expression data from within R.
To save an ondisc_matrix
, simply call
saveRDS
on the ondisc_matrix
R object to
create an .rds file.
saveRDS(object = expression_mat, file = paste0(temp_dir, "/expression_matrix.rds"))
rm(expression_mat)
We then can load the ondisc_matrix
by calling
readRDS
on the .rds file.
We also can use the constructor of the ondisc_matrix
class to create an ondisc_matrix
from an
already-initialized HDF5 file.