---
title: "Relative protein quantification for DIA-MS-based proteomics"
author: "Thang V Pham"
date: "`r Sys.Date()`" 
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics}
  %\VignetteEngine{knitr::rmarkdown}
  %VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, include=FALSE, echo=FALSE}
require("knitr")
local_file_exist <- file.exists("Bruderer15-DIA-longformat-compact.txt.gz")
```

## Introduction

The **iq** package, short for ion-based protein quantification, implements the MaxLFQ maximal peptide ratio
extraction algorithm for data-independent acquisition (DIA) mass spectrometry (MS) based proteomics data. The algorithm was originally designed for data
dependent acquisition (DDA) data (Cox et al., MCP 2014).
The package also offers options for other quantitation methods including topN (using N most intense fragment ions), MeanInt (using all fragment ions), and median polish (Tukey, 1977). Finally, output from a MaxQuant experiment for DDA data can also be quantified using **iq**.


To install **iq** from CRAN (this needs to be done once)

```{r, eval=FALSE}
install.packages("iq")
```

and the package can be loaded for usage in the usual manner
```{r, eval=TRUE, include = TRUE}
library("iq")
```

This vignette presents three examples for raw data processing pipelines: Spectronaut (Bruderer et al., MCP 2015), OpenSWATH (Rost et al., Nat. Biotechnol. 2014), and MaxQuant (Cox & Mann, Nat. Biotechnol. 2008). The example data are available in the project GitHub release. To keep the file size manageable, we exclude data columns not necessary for the analysis, followed by gzip compression. The user may download the data to a local working folder to run the examples.

+ Spectronaut: [Bruderer15-DIA-longformat-compact.txt.gz](https://github.com/tvpham/iq/releases/download/v1.1/Bruderer15-DIA-longformat-compact.txt.gz)
+ OpenSWATH: [Schubert15-OpenSWATH.txt.gz](https://github.com/tvpham/iq/releases/download/v1.1/Schubert15-OpenSWATH.txt.gz)
+ MaxQuant: [evidence.txt.gz](https://github.com/tvpham/iq/releases/download/v1.1/evidence.txt.gz) and [proteinGroups.txt.gz](https://github.com/tvpham/iq/releases/download/v1.1/proteinGroups.txt.gz)

## Spectronaut output

### An example dataset

We present an analysis of a publicly available dataset which was used in a benchmark experiment for label-free DDA and DIA proteomics (Bruderer et al., MCP 2015).
The result of the MaxQuant (version 1.6.4.0) DDA search is used as a spectral library in Spectronaut version 13.0 to process DIA data. Each sample is assigned to a unique condition in Spectronaut. Subsequently, we use the conditions from C01 to C24 as sample names. We use the default Spectronaut long format export with the addition of columns: *PG.Genes*, *PG.ProteinNames*,
*F.ExcludedFromQuantification*, *F.FrgLossType*, *F.FrgIon*, *F.Charge*, and *F.PeakArea*. 

First we perform preprocessing to filter out ion fragments not used for quantification and to keep only relevant columns to keep the file size small. Note that the following code section cannot be executed here because the starting input file _Bruderer15-DIA-longformat.txt_ is not available. To process your own data, replace _Bruderer15-DIA-longformat.txt_ by the name of the Spectronaut export.

```{r, eval=FALSE, include = TRUE}
raw <- read.delim("Bruderer15-DIA-longformat.txt")

selected <- raw$F.ExcludedFromQuantification == "False" & 
            raw$F.FrgLossType == "noloss" &
            (is.na(raw$PG.Qvalue) | raw$PG.Qvalue <= 0.01) &
            (is.na(raw$EG.Qvalue) | raw$EG.Qvalue <= 0.01)

raw <- raw[selected, c("R.Condition","PG.ProteinGroups", "EG.ModifiedSequence", "FG.Charge",
                       "F.FrgIon", "F.Charge", "F.PeakArea", "PG.Genes", "PG.ProteinNames")]

write.table(raw, "Bruderer15-DIA-longformat-compact.txt", sep = "\t", row.names = FALSE)
```

Note that if the quantification filtering option is selected when exporting the report from Spectronaut, the software already filters out entries with high protein q-values and peptide q-values. We can check that by
```{r, eval=FALSE, include = TRUE}
all(raw$PG.Qvalue <= 0.01)
all(raw$EG.Qvalue <= 0.01)
```

We write the content of `raw` to a text file and gzip it to keep the file size manageable. In the following, we will continue with the compressed file  _Bruderer15-DIA-longformat-compact.txt.gz_. Note that the user does not need to write the data to disk and read it back like in this example when processing their own data.

```{r, eval=local_file_exist, include=TRUE}
raw <- read.delim(gzfile("Bruderer15-DIA-longformat-compact.txt.gz"))
```

The MaxLFQ quantification is performed in three steps
```{r eval=local_file_exist, echo=TRUE, message = FALSE, include=TRUE}
norm_data <- iq::preprocess(raw)
protein_list <- iq::create_protein_list(norm_data)
result <- iq::create_protein_table(protein_list)
```

The first step `iq::preprocess` prepares a long-format input including removing low-intensity ions and performing median normalization.
To select a threshold for removing low-intensity ions, plot a histogram of the data as follows

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
hist(log2(raw[, "F.PeakArea"]), 100, main = "Histogram of log2 intensities", 
     col = "steelblue", border = "steelblue", freq = FALSE)
```

Here the default threshold value of zero for `log2_intensity_cutoff` is appropriate as it removes the strange distribution of low-intensity data points on the left of the figure.

The default value for `sample_id` is `"R.Condition"`. However, in practice we find it convenient to use `"R.FileName"` because one does not need to prepare the condition column in Spectronaut. The `secondary_id` parameter enforces the uniqueness of the peptide species for the MaxLFQ algorithm. For example, when multiple spectral libraries are used, it is possible that a fragment comes from two or more libraries. In that case, we can use `secondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", "F.Charge")`. Note that one needs to export the necessary columns from Spectronaut.

The second step `iq::create_protein_list` converts the long-format table to a list where each element contains the underlying data of each protein. The column names are sample names and row names fragment ions. The result can be used to visualized protein quantification for manual validation as in the next section.

The last statement `iq::create_protein_table` quantifies the protein list one by one. 

The following statement produces a quantified protein table in a tab-separated text file _output-maxLFQ.txt_ containing the protein identifiers in the first column, the quantified proteins, and the annotation text in the last column. This file can be opened by a spreadsheet application such as Excel.

```{r, eval = FALSE, include=TRUE}
write.table(cbind(Protein = rownames(result$estimate),
                  result$estimate, 
                  annotation = result$annotation),
            "output-maxLFQ.txt", sep = "\t", row.names = FALSE)
```


### Protein visualization

The function _iq::plot\_protein()_ plots the underlying data for individual proteins. By default, the names of fragment ions are displayed on the right panel, which can be disabled by setting parameter _split_ to NULL. 

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
iq::plot_protein(protein_list$P00366, main = "Protein P00366", split = NULL)  
```

By attaching the result of the quantitation algorithm to the data table, we can demonstrate both protein quantitative values and the underlying data. Here we set the colors of fragment ions to gray.

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
iq::plot_protein(rbind(protein_list$P00366, 
                       MaxLFQ = iq::maxLFQ(protein_list$P00366)$estimate), 
                 main = "MaxLFQ quantification of P00366", 
                 col = c(rep("gray", nrow(protein_list$P00366)), "green"), 
                 split = NULL)  
```

The names of constituent ions are displayed on the right when `split` is not `NULL`. We can pass additional graphical parameters to fine-tune the plot. For example, in the following we make the font size smaller by setting `cex` to 0.4 due to the restricted plotting space.

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
iq::plot_protein(protein_list$P00366, main = "Protein P00366", cex = 0.4)  
```

We can also use the protein plotting function to display different quantitative methods for evaluation. For example, here we show the quantitative values of a spike-in protein using MaxFLQ and the spike-in ground truth values (rescaled so that the means of the two are the same).

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
MaxLFQ_estimate <- iq::maxLFQ(protein_list$P12799)$estimate

ground_truth <-  log2(rep(c(200, 125.99, 79.37, 50, 4, 2.52, 1.59, 1), each = 3))
ground_truth <- ground_truth - mean(ground_truth) + mean(MaxLFQ_estimate)

iq::plot_protein(rbind(MaxLFQ = MaxLFQ_estimate,
                       Groundtruth = ground_truth), 
                 main = "P12799 - MaxLFQ versus groundtruth",  
                 split = 0.75, 
                 col = c("green", "gold"))  
```


### Extracting additional protein annotation

The input data might contain extra annotations for the proteins. We provide a convenient function to extract additional annotation. The following script produces the same quantitative values as before, but with additional columns for gene names and protein names.

```{r, eval=FALSE, include = TRUE}
extra_names <- iq::extract_annotation(rownames(result$estimate), 
                                      raw, 
                                      annotation_columns = c("PG.Genes", "PG.ProteinNames"))

write.table(cbind(Protein = rownames(result$estimate),
                  extra_names[, c("PG.Genes", "PG.ProteinNames")],
                  result$estimate, 
                  annotation = result$annotation),
            "output-maxLFQ-annotation.txt", sep = "\t", row.names = FALSE)
```


## OpenSWATH output

We download OpenSWATH data from the publication of Schubert et al. (Cell Host & Microbe 2015). The data is in a long format with fragment ions and their corresponding intensities are concatenated in two entries for each peptide. We separate these entries into an extended long format. Again, we will only keep relevant columns and compress the result. Thus, the user can skip this code section and continue to the next section using the data available in the project GitHub release.


```{r, eval=FALSE, include = TRUE}
tab <- read.delim("./Mtb_feature_alignment_requant_filtered_max10_fixed_noUPS.tsv", 
                  stringsAsFactors = FALSE)

tab$Condition[tab$Condition == "d20_6h"] <- "d20_06h"

tab_list <- vector("list", nrow(tab))
for (i in 1:nrow(tab)) {
    a <- unlist(strsplit(tab[i, "aggr_Fragment_Annotation"], ";"))
    b <- unlist(strsplit(tab[i, "aggr_Peak_Area"], ";"))

    tab_list[[i]] <- NULL
    for (j in 1:length(a)) {
        tab[i, "aggr_Fragment_Annotation"] <- a[j]
        tab[i, "aggr_Peak_Area"] <- b[j]
        tab_list[[i]] <- rbind(tab_list[[i]], tab[i,])
    }
}

tab_extended <- do.call(rbind.data.frame, tab_list)

quant <- as.double(tab_extended$aggr_Peak_Area)

short_name <- paste(tab_extended$Condition, tab_extended$BioReplicate,
                    tab_extended$Run, sep = "_")

tab_small <- cbind(tab_extended[, c("ProteinName", "FullPeptideName", "Charge", 
                                    "aggr_Fragment_Annotation")], quant, short_name)

write.table(tab_small, "Schubert15-OpenSWATH.txt", sep = "\t", row.names = FALSE)
```

### Protein quantification

First we load the compressed data prepared by the previous step

```{r, eval=local_file_exist, include=TRUE}
tab_small <- read.delim(gzfile("Schubert15-OpenSWATH.txt.gz"))
```

and check the head of the data file

```{r, eval=local_file_exist, include=TRUE}
head(tab_small)
```

Then we plot a histogram of the data to examine the data distribution

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
hist(log2(tab_small[, "quant"]), 100, main = "Histogram of log2 intensities", 
     col = "steelblue", border = "steelblue", freq = FALSE)
```

Here we do not observe abnormal low-intensity values, and the default value intensity cutoff of zero is good.

We follow the same procedure as for Spectronaut output for protein quantification. Nevertheless, we have to adapt the column names appropriately. Unique values in the `primary_id` column form the list of proteins to be quantified. A concatenation of the columns in `secondary_id` determines the fragment ions used for quantification. Unique values in the `sample_id` column form the list of samples. Finally, column `intensity_col` specifies the quantitative values.

```{r, eval=FALSE, include = TRUE}
norm_data <- iq::preprocess(tab_small,
                            primary_id = "ProteinName", 
                            secondary_id = c("FullPeptideName", "Charge",
                                             "aggr_Fragment_Annotation"), 
                            sample_id = "short_name", 
                            intensity_col = "quant") 

protein_list <- iq::create_protein_list(norm_data)

result <- iq::create_protein_table(protein_list) 

write.table(cbind(Protein = rownames(result$estimate), 
                  result$estimate, 
                  annotation = result$annotation), 
            "Schubert-output-maxLFQ.txt", sep = "\t", row.names = FALSE)
```

The resulting output text file _Schubert-output-maxLFQ.txt_ can be loaded into a spreadsheet application.

##  MaxQuant output

The MaxLFQ algorithm is implemented in the software package MaxQuant. Nevertheless, there are cases in which MaxQuant fails in the label-free quantification step. One possible solution is to run MaxQuant without normalization and derive the MaxLFQ values using **iq**.

### Read the protein group data

First, we read the protein quantification from MaxQuant output, ignoring entries detected in the reversed fasta database. The protein group table provides links to the underlying data in the MaxQuant _evidence.txt_ file.
When LFQ values are available, we can store them in a table to compare the result of the **iq** implementation to that of MaxQuant.

```{r, eval=local_file_exist, include = TRUE}
dda <- read.delim(gzfile("proteinGroups.txt.gz"))
dda <- subset(dda, Reverse == "")    # remove reversed entries
rownames(dda) <- dda[,"Protein.IDs"] # use protein group ids as rownames
lfq <- grep("^LFQ", colnames(dda))
dda_log2 <- log2(dda[, lfq])
dda_log2[dda_log2 == -Inf] <- NA

colnames(dda_log2) <- sprintf("C%02d", 1:24)
```

### Building up a protein list from the MaxQuant evidence file

We first perform median normalization on intensities in the _evidence.txt_ file. Subsequently, we produce a list of protein as in the case of Spectronaut output in a variable _p\_list_.

```{r, eval=local_file_exist, include = TRUE}
evidence <- read.delim(gzfile("evidence.txt.gz"), stringsAsFactors = FALSE)

rownames(evidence) <- evidence[, "id"]

# median normalization
ex <- paste0(paste0("sample", rep(1:8, each=3),"_R0"), rep(1:3, 8))
ex_median <- rep(NA, length(ex))
names(ex_median) <- ex
for (i in ex) {
    tmp <- subset(evidence, Experiment == i)
    ex_median[i] <- median(tmp[,"Intensity"], na.rm = TRUE)
}
f <- mean(ex_median)/ex_median
evidence[, "Intensity"] <- evidence[, "Intensity"] * f[evidence[, "Experiment"]]

# create a protein list
p_list <- list()

for (i in 1:nrow(dda)) {
    
    tmp <- unlist(strsplit(as.character(dda[i, "Evidence.IDs"]), ";"))
    a <- evidence[tmp,]
    b <- data.frame(cn = a[, "Experiment"], 
                    rn = paste(a[, "Modified.sequence"], a[, "Charge"], sep="_"), 
                    quant = a[, "Intensity"])
    b <- b[!is.na(b$quant),]
    
    m <- matrix(NA, nrow = length(unique(b$rn)), ncol = length(ex), 
                dimnames = list(unique(b$rn), ex))
  
    if (nrow(b) > 0) {
        for (j in 1:nrow(b)) {
            rn <- as.character(b$rn[j])
            cn <- as.character(b$cn[j])
            if (is.na(m[rn, cn])) {
                m[rn, cn] <- b$quant[j]
            } else {
                m[rn, cn] <- m[rn, cn] + b$quant[j]
            }
        }
        p_list[[rownames(dda)[i]]] <- log2(m)
    } else {
        p_list[[rownames(dda)[i]]] <- NA
    }
}
```

Once the protein list is created, we can perform the MaxLFQ algorithm for DDA as for DIA data. An example for MaxLFQ by MaxQuant and **iq** is shown here (rescaled so that the means of the two quantification values are equal). 

```{r, eval=local_file_exist, include=TRUE, fig.width=6.5, fig.height=4.5}
w1 <- iq::maxLFQ(p_list$A1L0T0)$estimate
w2 <- as.numeric(dda_log2["A1L0T0", ])
w2 <- w2 - mean(w2, na.rm = TRUE) + mean(w1, na.rm = TRUE)
tmp <- rbind(p_list$A1L0T0,
             `MaxLFQ by iq` = w1,
             `MaxLFQ by MaxQuant` = w2)
colnames(tmp) <- sprintf("C%02d", 1:24)

iq::plot_protein(tmp,
                 main = "A1L0T0", cex = 0.5, split = 0.65, 
                 col = c(rep("gray", nrow(p_list$A1L0T0)), "green", "blue"))  
```

Finally, we write out the result of MaxLFQ quantification to a text file. 

```{r, eval=FALSE, include = TRUE}
output_mq <- iq::create_protein_table(p_list)

write.table(cbind(Protein = rownames(output_mq$estimate),
                  output_mq$estimate, 
                  annotation = output_mq$annotation),
            "output_IQ_LFQ.txt", sep = "\t", row.names = FALSE)
```

The tab-separated text file _output_IQ_LFQ.txt_ now contains the MaxLFQ values by the package **iq** implementation.


## Other quantitation methods: topN, MeanInt, and median polish

We also implement the topN method, the MeanInt method, and the median polish method by introducing the `method` parameter to the _iq::create\_protein\_table()_ function. We create outputs for all different methods in the following. Here, the variable _protein\_list_ is from the Spectronaut or OpenSWATH example.

```{r, eval=FALSE, include = TRUE}
# default MaxLFQ
output <- iq::create_protein_table(protein_list)

# median polish
output <- iq::create_protein_table(protein_list, method = "median_polish")

# top 3
output <- iq::create_protein_table(protein_list, method = "topN", N = 3)

# top 5
output <- iq::create_protein_table(protein_list, method = "topN", N = 5)

# MeanInt in the original intensity space
output <- iq::create_protein_table(protein_list, method = "meanInt",
                                   aggregation_in_log_space = FALSE)
```

For the topN and MeanInt method, the aggregating operation can be carried out in the log space by default or in the original intensity space by setting the parameter `aggregation_in_log_space` to `FALSE`.

## References

1. Bruderer R, Bernhardt OM, Gandhi T, et al. (2015) Extending the limits of quantitative proteome profiling with data-independent acquisition and application to
acetaminophen-treated three-dimensional liver microtissues. _Mol Cell Proteomics_ 14(5):1400–1410.
1. Cox J, Hein MY, Luber CA, et al. (2014) Accurate proteome-wide label-free quantification
by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. _Mol Cell Proteomics_ 13(9):2513–2526.
1. Cox J and Mann M (2008) MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. _Nat Biotechnol_ 26, pp 1367-72.
1. Pham TV, Henneman AA, Jimenez CR (2020) iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. _Bioinformatics_ 36(8):2611-2613.
1. Rost HL, Rosenberger G, Navarro P, et al. (2014). OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32, 219–223.
1. Schubert OT, Ludwig C, Kogadeeva M, et al. (2015) Absolute proteome composition and dynamics during dormancy and resuscitation of mycobacterium tuberculosis. _Cell Host Microbe_ 18(1):96-108.
1. Tukey JW. (1977) _Exploratory Data Analysis_, Reading Massachusetts: Addison-Wesley.