Title: | Protein Quantification in Mass Spectrometry-Based Proteomics |
---|---|
Description: | An implementation of the MaxLFQ algorithm by Cox et al. (2014) <doi:10.1074/mcp.M113.031591> in a comprehensive pipeline for processing proteomics data in data-independent acquisition mode (Pham et al. 2020 <doi:10.1093/bioinformatics/btz961>). It offers additional options for protein quantification using the N most intense fragment ions, using all fragment ions, and a wrapper for the median polish algorithm by Tukey (1977, ISBN:0201076160). In general, the tool can be used to integrate multiple proportional observations into a single quantitative value. |
Authors: | Thang Pham [aut, cre, cph, ctb] , Alex Henneman [ctb] |
Maintainer: | Thang Pham <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 1.9.12 |
Built: | 2024-11-01 06:28:25 UTC |
Source: | CRAN |
For each protein, a numerical matrix is formed where the columns are samples and rows are fragment ions.
create_protein_list(preprocessed_data)
create_protein_list(preprocessed_data)
preprocessed_data |
A data frame of four components as output of the |
A list where each element contains the quantitative data of a protein. The column names are sample names and the row names fragment ions.
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
data("spikeins") head(spikeins) # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL) protein_list <- iq::create_protein_list(norm_data)
data("spikeins") head(spikeins) # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL) protein_list <- iq::create_protein_list(norm_data)
Travels through the input list and quantifies all proteins one by one.
create_protein_table(protein_list, method = "maxLFQ", ...)
create_protein_table(protein_list, method = "maxLFQ", ...)
protein_list |
The input protein list |
method |
Possible values are "maxLFQ", "median_polish", "topN", and "meanInt". |
... |
Additional parameters for individual quantitation methods. |
A list of two components is returned
estimate |
A table of protein abundances for all samples in log2 space. |
annotation |
A vector of annotations, one for each protein. |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
create_protein_list
, maxLFQ
, median_polish
, topN
, meanInt
data("spikeins") # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL) protein_list <- iq::create_protein_list(norm_data) result <- iq::create_protein_table(protein_list) head(result)
data("spikeins") # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL) protein_list <- iq::create_protein_list(norm_data) result <- iq::create_protein_table(protein_list) head(result)
Extracts annotation columns from a long-format input
extract_annotation(protein_ids, quant_table, primary_id = "PG.ProteinGroups", annotation_columns = NULL)
extract_annotation(protein_ids, quant_table, primary_id = "PG.ProteinGroups", annotation_columns = NULL)
protein_ids |
A vector of protein ids. |
quant_table |
A long-format input table. The input is typically the same as input to the |
primary_id |
The column containing protein ids. |
annotation_columns |
A vector of columns for annotation. |
A table of proteins and associated annotation extracted from the input.
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
data("spikeins") extra_names <- iq::extract_annotation(levels(spikeins$PG.ProteinGroups), spikeins, annotation_columns = c("PG.Genes", "PG.ProteinNames"))
data("spikeins") extra_names <- iq::extract_annotation(levels(spikeins$PG.ProteinGroups), spikeins, annotation_columns = c("PG.Genes", "PG.ProteinNames"))
A fast implementation of the MaxLFQ algorithm.
fast_MaxLFQ(norm_data, row_names = NULL, col_names = NULL)
fast_MaxLFQ(norm_data, row_names = NULL, col_names = NULL)
norm_data |
A list of four vectors with equal length |
row_names |
A vector of character strings for row names. If |
col_names |
A vector of character strings for column names. If |
A list is returned with two components
estimate |
A quantification result table in log2 space. |
annotation |
A vector of strings indicating membership in case of multiple connected components for each row of |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
Filters out low intensities and performs median normalization.
fast_preprocess(quant_table, median_normalization = TRUE, log2_intensity_cutoff = 0, pdf_out = "qc-plots-fast.pdf", pdf_width = 12, pdf_height = 8, show_boxplot = TRUE)
fast_preprocess(quant_table, median_normalization = TRUE, log2_intensity_cutoff = 0, pdf_out = "qc-plots-fast.pdf", pdf_width = 12, pdf_height = 8, show_boxplot = TRUE)
quant_table |
The |
median_normalization |
A logical value. The default |
log2_intensity_cutoff |
Entries lower than this value in log2 space are ignored. Plot a histogram of all intensities to set this parameter. |
pdf_out |
A character string specifying the name of the PDF output. A |
pdf_width |
Width of the pdf output in inches. |
pdf_height |
Height of the pdf output in inches. |
show_boxplot |
A logical value. The default |
A list is returned with the same components as input data in which low intensities are filtered out and median normalization is performed if requested.
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
A highly efficient reading of a tab-separated text file for iq processing.
fast_read(filename, sample_id = "R.Condition", primary_id = "PG.ProteinGroups", secondary_id = c("EG.ModifiedSequence", "FG.Charge", "F.FrgIon", "F.Charge"), intensity_col = "F.PeakArea", annotation_col = c("PG.Genes", "PG.ProteinNames"), filter_string_equal = c("F.ExcludedFromQuantification" = "False"), filter_string_not_equal = NULL, filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"), filter_double_greater = NULL, intensity_col_sep = NULL, intensity_col_id = NULL, na_string = "0")
fast_read(filename, sample_id = "R.Condition", primary_id = "PG.ProteinGroups", secondary_id = c("EG.ModifiedSequence", "FG.Charge", "F.FrgIon", "F.Charge"), intensity_col = "F.PeakArea", annotation_col = c("PG.Genes", "PG.ProteinNames"), filter_string_equal = c("F.ExcludedFromQuantification" = "False"), filter_string_not_equal = NULL, filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"), filter_double_greater = NULL, intensity_col_sep = NULL, intensity_col_id = NULL, na_string = "0")
filename |
A long-format tab-separated text file with a primary column of protein identification, secondary columns of fragment ions, a column of sample names, a column for quantitative intensities, and extra columns for annotation. |
primary_id |
Unique values in this column form the list of proteins to be quantified. |
secondary_id |
A concatenation of these columns determines the fragment ions used for quantification. |
sample_id |
Unique values in this column form the list of samples. |
intensity_col |
The column for intensities. |
annotation_col |
Annotation columns |
filter_string_equal |
A named vector of strings. Only rows satisfying the condition are kept. |
filter_string_not_equal |
A named vector of strings. Only rows satisfying the condition are kept. |
filter_double_less |
A named vector of strings. Only rows satisfying the condition are kept. Default PG.Qvalue < 0.01 and EG.Qvalue < 0.01. |
filter_double_greater |
A named vector of strings. Only rows satisfying the condition are kept. |
intensity_col_sep |
A separator character when entries in the intensity column contain multiple values. |
intensity_col_id |
The column for identities of multiple quantitative values. |
na_string |
The value considered as NA. |
When entries in the intensity column contain multiple values, this function will replicate entries in other column and the secondary_id
will be appended with corresponding entries in intensity_col_id
when it is provided. Otherwise, integer values 1, 2, 3, etc... will be used.
A list is returned with following components
protein |
A table of proteins in the first column followed by annotation columns. |
sample |
A vector of samples. |
ion |
A vector of fragment ions to be used for quantification. |
quant_table |
A list of four components: protein_list (index pointing to |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
Estimates protein abundances by aiming to maintain the fragment intensity ratios between samples.
maxLFQ(X)
maxLFQ(X)
X |
A matrix of ion intensities in log2 space. Columns are samples and rows are fragment ions. |
A list of two components is returned
estimate |
A vector with length equal to the number of columns of the input containing the protein abundances in log2 space. |
annotation |
An empty string if all quantified samples are connected. Otherwise, a string of membership of the connected components is returned. |
Thang V. Pham
Cox J, Hein MY, Luber CA, et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol Cell Proteomics. 2014;13(9):2513–2526.
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
Estimates protein abundances by averaging all associated ion intensities
meanInt(X, aggregation_in_log_space = TRUE)
meanInt(X, aggregation_in_log_space = TRUE)
X |
A matrix of ion intensities in log2 space. Columns are samples and rows are fragment ions. |
aggregation_in_log_space |
A logical value. If |
A list of two components is returned
estimate |
A vector with length equal to the number of columns of the input containing the protein abundances in log2 space. |
annotation |
Reserved, currently an empty string. |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
Estimates protein abundances using the Tukey median polish algorithm.
median_polish(X)
median_polish(X)
X |
A matrix of ion intensities in log2 space. Columns are samples and rows are fragment ions. |
A list of two components is returned
estimate |
A vector with length equal to the number of columns of the input containing the protein abundances in log2 space. |
annotation |
Reserved, currently an empty string |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
Tukey JW. Exploratory Data Analysis, Reading Massachusetts: Addison-Wesley, 1977.
Displays the underlying data for a protein.
plot_protein(X, main = "", col = NULL, split = 0.6, ...)
plot_protein(X, main = "", col = NULL, split = 0.6, ...)
X |
Protein data matrix. |
main |
Title of the plot. |
col |
Colors of the rows of the data matrix. |
split |
Fraction of the plotting area for the main figure. The remaining one is for legend. Set this parameter to |
... |
Additional parameters for plotting. |
A NULL
value is returned.
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
data("spikeins") head(spikeins) # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL) protein_list <- iq::create_protein_list(norm_data) iq::plot_protein(protein_list$P00366, main = "Protein P00366", split = NULL)
data("spikeins") head(spikeins) # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL) protein_list <- iq::create_protein_list(norm_data) iq::plot_protein(protein_list$P00366, main = "Protein P00366", split = NULL)
Prepares a long-format input including removing low-intensity ions and performing median normalization.
preprocess(quant_table, primary_id = "PG.ProteinGroups", secondary_id = c("EG.ModifiedSequence", "FG.Charge", "F.FrgIon", "F.Charge"), sample_id = "R.Condition", intensity_col = "F.PeakArea", median_normalization = TRUE, log2_intensity_cutoff = 0, pdf_out = "qc-plots.pdf", pdf_width = 12, pdf_height = 8, intensity_col_sep = NULL, intensity_col_id = NULL, na_string = "0")
preprocess(quant_table, primary_id = "PG.ProteinGroups", secondary_id = c("EG.ModifiedSequence", "FG.Charge", "F.FrgIon", "F.Charge"), sample_id = "R.Condition", intensity_col = "F.PeakArea", median_normalization = TRUE, log2_intensity_cutoff = 0, pdf_out = "qc-plots.pdf", pdf_width = 12, pdf_height = 8, intensity_col_sep = NULL, intensity_col_id = NULL, na_string = "0")
quant_table |
A long-format table with a primary column of protein identification, secondary columns of fragment ions, a column of sample names, and a column for quantitative intensities. |
primary_id |
Unique values in this column form the list of proteins to be quantified. |
secondary_id |
A concatenation of these columns determines the fragment ions used for quantification. |
sample_id |
Unique values in this column form the list of samples. |
intensity_col |
The column for intensities. |
median_normalization |
A logical value. The default |
log2_intensity_cutoff |
Entries lower than this value in log2 space are ignored. Plot a histogram of all intensities to set this parameter. |
pdf_out |
A character string specifying the name of the PDF output. A |
pdf_width |
Width of the pdf output in inches. |
pdf_height |
Height of the pdf output in inches. |
intensity_col_sep |
A separator character when entries in the intensity column contain multiple values. |
intensity_col_id |
The column for identities of multiple quantitative values. |
na_string |
The value considered as NA. |
When entries in the intensity column contain multiple values, this function will replicate entries in other column and the secondary_id
will be appended with corresponding entries in intensity_col_id
when it is provided. Otherwise, integer values 1, 2, 3, etc... will be used.
A data frame is returned with following components
protein_list |
A vector of proteins. |
sample_list |
A vector of samples. |
id |
A vector of fragment ions to be used for quantification. |
quant |
A vector of log2 intensities. |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
data("spikeins") head(spikeins) # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL)
data("spikeins") head(spikeins) # This example set of spike-in proteins has been 'median-normalized'. norm_data <- iq::preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL)
A convenient function combining multiple steps to process a long format table using the MaxLFQ algorithm.
process_long_format(input_filename, output_filename, sample_id = "File.Name", primary_id = "Protein.Group", secondary_id = "Precursor.Id", intensity_col = "Fragment.Quant.Corrected", annotation_col = NULL, filter_string_equal = NULL, filter_string_not_equal = NULL, filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.01"), filter_double_greater = NULL, intensity_col_sep = ";", intensity_col_id = NULL, na_string = "0", normalization = "median", log2_intensity_cutoff = 0, pdf_out = "qc-plots.pdf", pdf_width = 12, pdf_height = 8, show_boxplot = TRUE, peptide_extractor = NULL)
process_long_format(input_filename, output_filename, sample_id = "File.Name", primary_id = "Protein.Group", secondary_id = "Precursor.Id", intensity_col = "Fragment.Quant.Corrected", annotation_col = NULL, filter_string_equal = NULL, filter_string_not_equal = NULL, filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.01"), filter_double_greater = NULL, intensity_col_sep = ";", intensity_col_id = NULL, na_string = "0", normalization = "median", log2_intensity_cutoff = 0, pdf_out = "qc-plots.pdf", pdf_width = 12, pdf_height = 8, show_boxplot = TRUE, peptide_extractor = NULL)
input_filename |
See |
output_filename |
Output filename. |
sample_id |
See |
primary_id |
See |
secondary_id |
See |
intensity_col |
See |
annotation_col |
See |
filter_string_equal |
See |
filter_string_not_equal |
See |
filter_double_less |
See |
filter_double_greater |
See |
intensity_col_sep |
See |
intensity_col_id |
See |
na_string |
See |
normalization |
Normalization type. Possible values are |
log2_intensity_cutoff |
See |
pdf_out |
See |
pdf_width |
See |
pdf_height |
See |
show_boxplot |
See |
peptide_extractor |
A function to parse peptides. |
After processing with fast_read
, fast_preprocess
, and fast_MaxLFQ
, the result table is written to output_filename
. The quantification values are in log2 space. A NULL
value is returned. If peptide_extractor
is not NULL
, fragment statistics for each protein will be calculated based on the result of the extractor function. Counting the number of peptides contributing to a protein is possible using an appropriate extractor function. An example value for peptide_extractor
is function(x) gsub("[0-9].*$", "", x)
, which removes the charge state and fragment descriptors in an ion descriptor to obtain unique peptide sequences. One can examine the ion
component returned by the fast_read
function to derive a regular expression to be used in the gsub
function above.
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
fast_read
, fast_preprocess
, fast_MaxLFQ
Collapses rows with identical values in a particular column in a table. When the values in each row are proportional such as intensities of multiple fragments of a protein, the MaxLFQ algorithm is recommended.
process_wide_format(input_filename, output_filename, id_column, quant_columns, data_in_log_space = FALSE, annotation_columns = NULL, method = "maxLFQ")
process_wide_format(input_filename, output_filename, id_column, quant_columns, data_in_log_space = FALSE, annotation_columns = NULL, method = "maxLFQ")
input_filename |
Input filename of a tab-separated value text file. |
output_filename |
Output filename. |
id_column |
The column where unique values will be kept. Rows with identical values in this column are merged. Rows with empty values here are removed. |
quant_columns |
Columns containing numerical data to be merged. |
data_in_log_space |
A logical value. If |
annotation_columns |
Columns in the input file apart from |
method |
Method for merging. Default value is |
Method "maxLFQ_R"
implements the MaxLFQ algorithm pure R. It is slower than "maxLFQ"
.
Method "maxInt"
selects row with maximum intensity (top 1).
Method "sum"
sum all intensities.
Method "least_na"
selects row with the least number of missing values.
The value of method
can be a function such as function(x) log2(colSums(2^x, na.rm = TRUE))
for summing all intensities in the original space.
The result table is written to output_filename
. A NULL
value is returned.
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.
A subset of the Bruderer 2015 dataset containing 12 spike-in proteins. The full dataset was exported from the Spectronaut software. The complete dataset has been median-normalized.
data("spikeins")
data("spikeins")
A data frame with 18189 observations on the following 9 variables.
R.Condition
Sample names.
PG.ProteinGroups
Protein identifiers.
EG.ModifiedSequence
Sequence of the fragment ions.
FG.Charge
Fragment group charge.
F.FrgIon
Fragment ions.
F.Charge
Fragment charges.
F.PeakArea
Quantitative values.
PG.Genes
Gene names.
PG.ProteinNames
Protein names.
data("spikeins") head(spikeins)
data("spikeins") head(spikeins)
Estimates protein abundances using the N most intense ions.
topN(X, N = 3, aggregation_in_log_space = TRUE)
topN(X, N = 3, aggregation_in_log_space = TRUE)
X |
A matrix of ion intensities in log2 space. Columns are samples and rows are fragment ions. |
N |
The number of top ions used for quantification. |
aggregation_in_log_space |
A logical value. If |
A list of two components is returned
estimate |
A vector with length equal to the number of columns of the input containing the protein abundances in log2 space. |
annotation |
Reserved, currently an empty string. |
Thang V. Pham
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics. Bioinformatics 2020 Apr 15;36(8):2611-2613.