--- title: "Tutorial 1: Using the `ondisc_matrix` class" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tutorial 1: Using the `ondisc_matrix` class} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This tutorial shows how to use `ondisc_matrix`, the core class implemented by `ondisc`. An `ondisc_matrix` is an R object that represents an expression matrix stored on-disk rather than in-memory. We cover the topics of initialization, querying basic information, subsetting, and pulling submatrices into memory. We begin by loading the `ondisc` package. ```{r setup} library(ondisc) ``` # Initialization `ondisc` ships with several example datasets, stored in the "extdata" subdirectory of the package. ```{r} raw_data_dir <- system.file("extdata", package = "ondisc") list.files(raw_data_dir) ``` The files "gene_expression.mtx", "cell_barcodes.tsv," and "genes.tsv" together define a gene-by-cell expression matrix. We save the full paths to these files in the variables `mtx_fp`, `barcodes_fp`, and `features_fp`. ```{r} mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx") barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv") features_fp <- paste0(raw_data_dir, "/genes.tsv") ``` An `ondisc_matrix` consists of two parts: an HDF5 (i.e., .h5) file that stores the expression data on-disk in a novel format, and an in-memory object that allows us to interact with the expression data from within R. The easiest way to initialize an `ondisc_matrix` is by calling the function `create_ondisc_matrix_from_mtx`. We pass to this function (i) a file path to the .mtx file storing the expression data, (ii) a file path to the .tsv file storing the cell barcodes, and (iii) a file path to the .tsv file storing the feature IDs and human-readable feature names. We optionally can specify the directory in which to store the initialized .h5 file, which in this tutorial we will take to be the temporary directory. ```{r} temp_dir <- tempdir() exp_mat_list <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp, barcodes_fp = barcodes_fp, features_fp = features_fp, on_disk_dir = temp_dir) ``` By default, `create_ondisc_matrix_from_mtx` returns a list of three elements: (i) an `ondisc_matrix` representing the expression data, (ii) a cell-wise covariate matrix, and (iii) a feature-wise covariate matrix. The exact cell-wise and feature-wise covariate matrices that are computed depend on the inputs to `create_ondisc_matrix_from_mtx` (see documentation via ?create_ondisc_matrix_from_mtx for full details). The advantage to computing the cell-wise and feature-wise covariates at initialization is that it obviates the need to load the entire dataset into memory a second time. ```{r} expression_mat <- exp_mat_list$ondisc_matrix head(expression_mat) cell_covariates <- exp_mat_list$cell_covariates head(cell_covariates) feature_covariates <- exp_mat_list$feature_covariates head(feature_covariates) ``` The initialized HDF5 file is named `ondisc_matrix_1.h5` and is located in the temporary directory. ```{r} "ondisc_matrix_1.h5" %in% list.files(temp_dir) ``` A strength of `create_ondisc_matrix_from_mtx` is that it does *not* assume that entire expression matrix fits into memory. The optional argument `n_lines_per_chunk` can be used to specify the number of lines to read from the .mtx file at a time. Additionally, `create_ondisc_matrix_from_mtx` is fast: the novel algorithm that underlies this function is highly efficient and implemented in C++ for maximum speed. Typically, `create_ondisc_matrix_from_mtx` takes aboout 4-8 minutes/GB to run. Finally, for a given dataset, `create_ondisc_matrix_from_mtx` only needs to be run once, even after closing and opening new R sessions. # Querying basic information We can use the functions `get_feature_ids`, `get_feature_names`, and `get_cell_barcodes` to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of an `ondisc_matrix`. ```{r} feature_ids <- get_feature_ids(expression_mat) feature_names <- get_feature_names(expression_mat) cell_barcodes <- get_cell_barcodes(expression_mat) head(feature_ids) head(feature_names) head(cell_barcodes) ``` Additionally, we can use `dim`, `nrow`, and `ncol` to obtain the dimension, number of rows (i.e., number of features), and number of columns (i.e., number of cells) of an `ondisc_matrix`. ```{r} dim(expression_mat) nrow(expression_mat) ncol(expression_mat) ``` # Subsetting We can subset an `ondisc_matrix` to obtain a new `ondisc_matrix` that is a submatrix of the original. To subset an `ondisc_matrix`, apply the `[` operator and pass a numeric, logical, or character vector indicating the cells or features to keep. Character vectors are assumed to refer to feature IDs (for rows) and cell barcodes (for columns). ```{r} # numeric vector examples # keep genes 100-110 x <- expression_mat[100:110,] # keep all cells except 10 and 20 x <- expression_mat[,-c(10,20)] # keep genes 50-100 and 200-250 and cells 300-500 x <- expression_mat[c(50:100, 200:250), 300:500] # character vector examples # keep genes ENSG00000107581, ENSG00000286857, and ENSG00000266371 x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),] # keep cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1 x <- expression_mat[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")] # logical vector example # keep all genes except ENSG00000237832 and ENSG00000229637 x <- expression_mat[!(get_feature_ids(expression_mat) %in% c("ENSG00000237832", "ENSG00000229637")),] ``` Subsetting an `ondisc_matrix` leaves the original object unchanged. ```{r} expression_mat ``` This important property, called *object persistence*, makes programming with `ondisc_matrices` intuitive. The underlying HDF5 file is not copied upon subset; instead, information is shared across `ondisc_matrix` objects, making subsets fast. # Pulling a submatrix into memory We can pull a submatrix of an `ondisc_matrix` into memory, allowing us to perform computations on a subset of the data. To pull a submatrix into memory, use the `[[` operator, passing a numeric, character, or logical vector indicating the cells or features to access. The data structure that underlies an `ondisc_matrix` enables fast access to both rows and columns of the matrix. ```{r} # numeric vector examples # pull gene 6 m <- expression_mat[[6,]] # pull cells 200 - 250 m <- expression_mat[[,200:250]] # pull genes 50 - 100 and cells 200 - 250 m <- expression_mat[[50:100, 200:250]] # character vector examples # pull genes ENSG00000107581 and ENSG00000286857 m <- expression_mat[[c("ENSG00000107581", "ENSG00000286857"),]] # pull cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1 m <- expression_mat[[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]] # logical vector examples # subset the matrix, keeping genes ENSG00000107581, ENSG00000286857, and ENSG00000266371 x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),] # pull all genes except ENSG00000107581 m <- x[[get_feature_ids(x) != "ENSG00000107581",]] ``` The last example demonstrates that we can pull a submatrix of an `ondisc_matrix` into memory after having subset the matrix. One can remember the difference between `[` and `[[` by recalling R lists: `[` is used to subset a list, and `[[` is used to access elements stored within a list. Similarly, `[` is used to subset an `ondisc_matrix`, and `[[` is used to access a submatrix stored within an `ondisc_matrix`. # Saving and loading an `ondisc_matrix` As discussed previously, there are two components to an `ondisc_matrix`: the HDF5 file stored on-disk, and the R object stored in memory. The latter contains a file path to the former, allowing us to interact with the expression data from within R. To save an `ondisc_matrix`, simply call `saveRDS` on the `ondisc_matrix` R object to create an .rds file. ```{r} saveRDS(object = expression_mat, file = paste0(temp_dir, "/expression_matrix.rds")) rm(expression_mat) ``` We then can load the `ondisc_matrix` by calling `readRDS` on the .rds file. ```{r} expression_mat <- readRDS(paste0(temp_dir, "/expression_matrix.rds")) ``` We also can use the constructor of the `ondisc_matrix` class to create an `ondisc_matrix` from an already-initialized HDF5 file. ```{r} h5_file <- paste0(temp_dir, "/ondisc_matrix_1.h5") expression_mat <- ondisc_matrix(h5_file) ```