--- title: "Data importation in PLNmodels" author: "PLN team" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 4 df_print: paged bibliography: article/PLNreferences.bib link-citations: yes vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{import} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( screenshot.force = FALSE, echo = TRUE, rows.print = 5, message = FALSE, warning = FALSE) ``` ## Preliminaries This vignette documents the data format used in **PLNmodel** by `PLN` and its variants. It also shows how to create an object in the proper format for further analyses from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class objects. ## Format description We illustrate the format using trichoptera data set, a full description of which can be found in [the corresponding vignette](Trichoptera.html). ```{r data_load} library(PLNmodels) data(trichoptera) ``` The trichoptera data set is a list made of two data frames: `Abundance` (hereafter referred to as the _counts_) and `Covariate` (hereafter the _covariates_). ```{r trichoptera_structure} str(trichoptera, max.level = 1) ``` The covariates include, among others, the wind, pressure and humidity. ```{r covariates_overview} names(trichoptera$Covariate) ``` In the PLN framework, we model the counts from the covariates, let's say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called _formula interface_ and it would thus be natural to write something like ```{r first_try, eval = FALSE} PLN(Abundance ~ Wind + Pressure, data = trichoptera) ``` Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically **multivariate**: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame). We must therefore prepare a data structure where `Abundance` refers to a count *matrix* whereas `Wind` and `Pressure` refer to *vectors* before feeding it to `PLN`. That's the purpose of `prepare_data`. ```{r prepare_data_first_look} trichoptera2 <- prepare_data(counts = trichoptera$Abundance, covariates = trichoptera$Covariate) str(trichoptera2) ``` If you look carefully, you can notice a few difference between `trichoptera` and `trichoptera2`: - the first is a `list` whereas the second is a `data.frame`[^1]; - `Abundance` is a matrix-column of `trichoptera2` that you can extract using the usual functions `[` and `[[` to retrieve the count matrix; - `trichoptera2` has an additional `Offset` column (more on that later). ## Computing offsets It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The *proper way* to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods: - Total Sum Scaling (TSS), where the offset of a sample is the total count in that sample - Cumulative Sum Scaling (CSS), introduced in [@CSS], where the offset of a sample if the cumulative sum of counts in that sample, up to a quantile determined in a data driven way. - Relative Log-Expression (RLE), implemented in [@DESeq2], where all samples are used to compute a reference sample, each sample is compared to the reference sample using log-ratios and the offset is the median log-ratio. - Geometric Mean of Pairwise Ratio (GMPR), introduced in [@GMPR] where each sample is compared to each other to compute a median log-ratio and the offset of a sample is the geometric means of those pairwise ratios. - Wrench, introduced in [@Kumar2018] and fully implemented in the [Wrench package](https://bioconductor.org/packages/release/bioc/html/Wrench.html), where all samples are used to compute reference proportions and each sample is compared to the reference using ratios (and **not log-ratios**) of proportions to compute compositional correction factors. In that case, the offset is the product of (geometrically centered) compositional factors and (geometrically centered) depths. Each of these offset be computed from a counts matrix using the `compute_offset` function and changing its `offset` argument: ```{r compute_offset} ## same as compute_offset(trichoptera$Abundance, offset = "TSS") compute_offset(trichoptera$Abundance) ``` In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden) ```{r other_offsets, warning=TRUE, error = TRUE, results='hide'} compute_offset(trichoptera$Abundance, "CSS") compute_offset(trichoptera$Abundance, "RLE") compute_offset(trichoptera$Abundance, "GMPR") ``` We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks. ```{r pseudocounts} compute_offset(trichoptera$Abundance, "RLE", pseudocounts = 1) ``` A better solution consists in using only positive counts to compute the offsets: ```{r poscounts} compute_offset(trichoptera$Abundance, "RLE", type = "poscounts") ``` Finally, we can use wrench to compute the offsets: ```{r wrench} compute_offset(trichoptera$Abundance, "Wrench") ``` **Note** TSS is the only methods that produces offset on the same scale as the counts, all others produces offsets that are (hopefully) *proportional* to library sizes but on a different scale. To force the offsets to be on the same scale as the counts for all methods, you can use the option `scale = "count"`. ```{r wrench_counts} compute_offset(trichoptera$Abundance, "Wrench", scale = "count") ``` ## Building data frame using `prepare_data` We'll already learned that `prepare_data` can join counts and covariates into a single data.frame. It can also compute offset through `compute_offset` and does so by default with `offset = "TSS"`, hence the `Offset` column in `trichoptera2`. You can change the offset method and provide additional arguments that will passed on to `compute_offset`. ```{r prepare_data_other_offset} str(prepare_data(trichoptera$Abundance, trichoptera$Covariate, offset = "RLE", pseudocounts = 1)) ``` Different communities use different standard for the count data where samples are either or columns of the counts matrix. `prepare_data` uses heuristics to guess the direction of the counts matrix (or fail informatively doing so) and automatically transpose it if needed. Finally, `prepare_data` enforces sample-consistency between the counts and the covariates and automatically trims away: - samples for which only covariates or only counts are available; - samples with no positive counts For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2 = 47 samples left, as expected. ```{r trim_down_samples} nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample trichoptera$Covariate[-49,] ## remove last sample )) ``` ## Importing data from biom and phyloseq objects using `prepare_data_from_[phyloseq|biom]` Community composition data are quite popular in microbial ecology and usually stored in flat files using the [biom format](http://biom-format.org/) and/or imported in R as phyloseq-class objects [@phyloseq] using the Bioconductor [phyloseq](https://joey711.github.io/phyloseq/) package. We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object. ### Reading from a biom file Reading from a biom file requires the bioconductor package [biomformat](https://www.bioconductor.org/packages/release/bioc/html/biomformat.html). This package is **not** a standard dependency of PLNmodels and needs to be installed separately. You can easily prepare your data from a biom file using the following steps: - read your biom file with `biomformat::read_biom()` - extract the count table with `biomformat::biom_data()` - extract the covariates with `biomformat::sample_metadata()` (or build your own) - feed them to `prepare_data` as illustrated below: ```{r import_biom, eval = FALSE} ## If biomformat is not installed, uncomment the following lines # if (!requireNamespace("BiocManager", quietly = TRUE)) { # install.packages("BiocManager") # } # BiocManager::install("biomformat") library(biomformat) biomfile <- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat") biom <- biomformat::read_biom(biomfile) ## extract counts counts <- as(biomformat::biom_data(biom), "matrix") ## extract covariates (or prepare your own) covariates <- biomformat::sample_metadata(biom) ## prepare data my_data <- prepare_data(counts = counts, covariates = covariates) str(my_data) ``` ### Reading from a phyloseq-class object Likewise, preparing data from a phyloseq-class object requires the bioconductor package [phyloseq](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html). This package is **not** a standard dependency of PLNmodels and needs to be installed separately. You can easily prepare your data from a phyloseq object using the following steps: - extract the count table with `phyloseq::otu_table()` - extract the covariates with `phyloseq::sample_data()` (or build your own) - feed them to `prepare_data` as illustrated below: ```{r import_phyloseq, eval = FALSE} ## If biomformat is not installed, uncomment the following lines # if (!requireNamespace("BiocManager", quietly = TRUE)) { # install.packages("BiocManager") # } # BiocManager::install("phyloseq") library(phyloseq) data("enterotype") ## extract counts counts <- as(phyloseq::otu_table(enterotype), "matrix") ## extract covariates (or prepare your own) covariates <- phyloseq::sample_data(enterotype) ## prepare data my_data <- prepare_data(counts = counts, covariates = covariates) str(my_data) ``` ## Mathematical details about the offsets We detail here the mathematical background behind the various offsets and the way they are computed. Note $\mathbf{Y} = (Y_{ij})$ the counts matrix where $Y_{ij}$ is the count of species $j$ in sample $i$. Assume that there are $p$ species and $n$ samples in total. The offset of sample $i$ is noted $O_i$ and computed in the following way. ### Total Sum Scaling Offsets are simply the total counts of a sample (frequently called depths in the metabarcoding literature): $$ O_i = \sum_{j=1}^p Y_{ij} $$ ### Cumulative Sum Scaling Positive counts are used to compute sample-specific quantiles $q_i^l$ and cumulative sums $s_i^l$ defined as $$ q_i^l = \min \{q \text{ such that } \sum_j 1_{Y_{ij} \leq q} \geq l \sum_j 1_{Y_{ij} > 0} \} \qquad s_i^l = \sum_{j: Y_{ij} \leq q_i^l} Y_{ij} $$ The sample-specific quantiles are then used to compute reference quantiles defined as $q^l = \text{median} \{q^i_l\}$ and median average deviation around the quantile $q^l$ as $d^l = \text{median} |q_i^l - q^l|$. The method then searches for the smallest quantile $l$ for which it detects instability, defined as large relative increase in the $d^l$. Formally, $\hat{l}$ is the smallest $l$ satisfying $\frac{d^{l+1} - d^l}{d^l} \geq 0.1$. The scaling sample-specific offset are then chosen as: $$ O_i = s_i^{\hat{l}} / \text{median}_i \{ s_i^{\hat{l}} \} $$ Dividing by the median of the $s_i^{\hat{l}}$ ensures that offsets are centered around $1$ and compare sizes differences with respect to the reference sample. Note also that the reference quantiles $q^l$ can be computed using either the median (default, as in the original @CSS paper) or the mean, by specifying `reference = mean`, as implemented in `metagenomeseq`. ### Relative Log Expression A reference sample $(q_j)_j$ is first built by computing the geometric means of each species count: $$ q_j = \exp \left( \frac{1}{n} \sum_{i} \log(Y_{ij})\right) $$ Each sample is then compared to the reference sample to compute one ratio per species and the final offset $O_i$ is the median of those ratios: $$ O_i = \text{median}_j \frac{Y_{ij}}{q_j} $$ The method fails when no species is shared across all sample (as all $q_j$ are then $0$) or when a sample shares less than 50% of species with the reference (in which case the median of the ratios may be null or infinite). The problem can be alleviated by adding pseudocounts to the $c_{ij}$ with `pseudocounts = 1` or using positive counts in the computations (`type = "poscounts"`) ### Geometric Mean of Pairwise Ratio This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE) $$ r_{ii'} = {\text{median}}_{j: Y_{ij}.Y_{i'j} > 0} \frac{Y_{ij}}{Y_{i'j}} $$ The offset is then taken as the median of all the $r_{ii'}$: $$ O_i = \text{median}_{i' != i} r_{ii'} $$ The method fails when there is only one sample in the data set or when a sample shares no species with any other. ### Wrench normalisation This method is fully detailed in @Kumar2018 and we only provide a barebone implementation corresponding to the defaults parameters of `Wrench::wrench()`. Assume that samples belong to $K$ discrete groups and note $g_i$ the group of sample $i$. Wrench is based on the following (simplified) log-normal model for counts: $$ Y_{ij} \sim \pi_{ij} \delta_0 + (1 - \pi_{ij})\log\mathcal{N}(\mu_{ij}, \sigma^2_j) $$ where the $Y_{ij}$ are independent and the mean $\mu_{ij}$ is decomposed as: $$ \mu_{ij} = \underbrace{\log{p_{0j}}}_{\text{log-ref. prop.}} + \underbrace{\log{d_i}}_{\text{log-depth}} + \underbrace{\log{\zeta_{0g_i}}}_{\text{log effect of group } g_i} + \underbrace{a_{i}}_{\text{(f|m)ixed effect}} + \underbrace{b_{ij}}_{\text{mixed effects}} $$ where the random effects are independents centered gaussian and the depths is the total sum of counts: $$ \begin{align*} d_i & = \sum_{j=1}^p c_{ij} \\ b_{ij} & \sim \mathcal{N}(0, \eta^2_{g_i}) \\ \end{align*} $$ The **net** log fold change $\theta_{ij}$ of the **proportion ratio** $r_{ij} = c_{ij} / d_i p_{0j}$ of species $j$ relative to the reference is $\log(\theta_{ij}) \overset{\Delta}{=} \mathbb{E}[\log(r_{ij}) | a_i, b_{ij}] = \log{\zeta_{0g_i}} + a_i + b_{ij}$. We can decompose it as $\theta_{ij} = \Lambda_i^{-1} v_{ij}$ where $\Lambda_i^{-1}$ is the *compositional correction factor* and $v_{ij}$ is the fold change of **true abundances**. With the above notations, the net fold change compounds both the fold change of true abundances and the compositional correction factors. With the assumption that the $b_{ij}$ are centered, $\log(\hat{\Lambda}_i)$ can be estimated through a robust average of the $\hat{\theta}_{ij}$, which can themselves be computed from the log-ratio of proportions. We detail here how the different parameters and/or effects are estimated. - The reference proportions $p_{0j}$ are constructed as averages of the sample proportions $p_{ij}$ and the ratio are derived from both quantities $$ p_{ij} = \frac{Y_{ij}}{\sum_{j=1}^p Y_{ij}} \qquad p_{0j} = \frac{1}{n} \sum_{i=1}^n p_{ij} \qquad r_{ij} = \frac{p_{ij}}{p_{0j}} $$ - The probabilities of absence $\pi_{ij}$ are estimated by fitting the following Bernoulli models: $$ 1_{\{Y_{ij} = 0\}} \sim \mathcal{B}(\pi_{j}^{d_i}) $$ and setting $\hat{\pi}_{ij} = \hat{\pi}^{d_i}$ - The species variances $\sigma^2_j$ are estimated by fitting the following linear model (with no zero-inflation component) $$ \log Y_{ij} \sim \log(d_i) + \mu_{g_i} + \mathcal{N}(0, \sigma^2_j) $$ Note that in the original `Wrench::wrench()`, the log depth $\log(d_i)$ is used as predictor but I believe it makes more sense to use it an offset. - set the group proportions $p_{gj}$ and group ratios $r_{gj}$ to: $$ p_{gj} = \frac{\sum_{i : g_i = g} Y_{ij}}{\sum_{j, i : g_i = g} Y_{ij}} \qquad r_{gj} = \frac{p_{gj}}{p_{0j}} $$ - Estimate the location and dispersion parameters as: $$ \hat{\zeta}_{0g} = \frac{\sum_{j=1}^p r_{gj}}{p} \qquad \log{r_{g.}} = \frac{\sum_{j: r_{gj} \neq 1} \log{r_{gj}}}{\sum_{j: r_{gj} \neq 0} 1} \qquad \hat{\eta}_{g}^2 = \frac{\sum_{j: r_{gj} \neq 1} (\log{r_{gj}} - \log{r_{g.}})^2}{\sum_{j: r_{gj} \neq 0} 1} $$ - Estimate the mixed effects as shrunken (and scaled) averages of the ratios $$ \hat{a}_i = \frac{\sum_{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} (\log{r_{ij}} - \log{\hat{\zeta}_{0g_i}})}{\sum_{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j}} \qquad \hat{b}_{ij} = \frac{\hat{\eta}^2_{g_i}}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} \left( \log{r_{ij}} - \log\hat{\zeta}_{0g_i} - \hat{a}_i\right) $$ - Estimate the regularized ratios as: $$ \hat{\theta}_{ij} = \exp\left( \log\hat{\zeta}_{0g_i} + \hat{a}_i + \hat{b}_{ij} \right) $$ - Estimate the compositional correction factors as (weighted) means of the regularized (and possibly corrected) ratios: $$ \hat{\Lambda}_i = \begin{cases} \sum_{j = 1}^p \hat{\theta}_{ij} \bigg/ p& \text{ if type = "simple"} \\ \sum_{j = 1}^p \hat{\theta}_{ij} e^{-\hat{\sigma}_j^2 / 2} / w_{ij} \bigg/ \sum_{j=1}^p 1/w_{ij} & \text{ if type = "wrench"} \\ \end{cases} $$ where $w_{ij} = (1 - \hat{\pi}_{ij})(\hat{\pi}_{ij} + e^{\hat{\sigma}_j^2 + \hat{\eta}_i^2} - 1)$. The correction term $e^{\hat{\sigma}_j^2 / 2}$ arises from the relation $\mathbb{E}[r_{ij} | r_{ij} > 0] = \theta_{ij} e^{\sigma_j^2/2}$ and the weight $w_{ij}$ are marginal variances: $\mathbb{V}[r_{ij}] = w_{ij}$. The offsets are then the product of compositional correction factors and depths: $$ O_i = \frac{\hat{\Lambda}_i}{(\prod_{i = 1}^n \hat{\Lambda}_i)^{1/n}} \times \frac{d_i}{(\prod_{i = 1}^n d_i)^{1/n}} $$ [^1]: although a `data.frame` is technically a `list` ## References