Title: | Automated Classification of Mass Spectra |
---|---|
Description: | Functions to classify mass spectra in known categories, and to determine discriminant mass-over-charge values. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-over-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-over-charge values. Three vignettes illustrating how to use the functions of this package from real data sets are also available online to help users: <https://agodmer.github.io/MSclassifR_examples/Vignettes/Vignettemsclassifr_Ecrobiav3.html>, <https://agodmer.github.io/MSclassifR_examples/Vignettes/Vignettemsclassifr_Klebsiellav3.html> and <https://agodmer.github.io/MSclassifR_examples/Vignettes/Vignettemsclassifr_DAv3.html>. |
Authors: | Alexandre Godmer [aut, cre], Quentin Giai Gianetto [aut], Karen Druart [aut] |
Maintainer: | Alexandre Godmer <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.3 |
Built: | 2024-12-24 07:00:34 UTC |
Source: | CRAN |
Metadada of the CitrobacterRKIspectra
list of mass spectra.
data("CitrobacterRKImetadata", package = "MSclassifR")
data("CitrobacterRKImetadata", package = "MSclassifR")
A data frame with 14 rows (each corresponding to a mass spectrum), and five columns that contain (in order): the strain name, the species name, the spot, a sample number and the name of the strain associated with the spot.
The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra contains raw mass spectra. Only mass spectra of the Citrobacter bacterial species were collected. Metadata were manually reported from raw data.
The raw data were downloaded from this link : https://zenodo.org/record/163517#.YIkWiNZuJCp. The dataset focuses only on mass spectra from Citrobacter.
Lasch, Peter, Stammler, Maren, & Schneider, Andy. (2018). Version 3 (20181130) of the MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI) [Data set]. Zenodo.doi:10.5281/zenodo.163517
Mass spectra of the CitrobacterRKIspectra
dataset.
data("CitrobacterRKIspectra", package = "MSclassifR") ##### #Plotting the first mass spectrum #library("MSclassifR") #PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]],absx = "ALL", Peaks = NULL, # Peaks2 = NULL, col_spec = 1, col_peak = 2, shape_peak = 3, # col_peak2 = 2, shape_peak2 = 2)
data("CitrobacterRKIspectra", package = "MSclassifR") ##### #Plotting the first mass spectrum #library("MSclassifR") #PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]],absx = "ALL", Peaks = NULL, # Peaks2 = NULL, col_spec = 1, col_peak = 2, shape_peak = 3, # col_peak2 = 2, shape_peak2 = 2)
A list that contains 14 objects of class S4 corresponding each to a each mass spectrum.
The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra contains raw mass spectra. Only mass spectra of the Citrobacter bacterial species were collected.
The raw data were downloaded from this link : https://zenodo.org/record/163517#.YIkWiNZuJCp. The dataset focuses only on mass spectra from Citrobacter.
Lasch, Peter, Stammler, Maren, & Schneider, Andy. (2018). Version 3 (20181130) of the MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI) [Data set]. Zenodo.doi:10.5281/zenodo.163517
This function estimates a multinomial regression using cross-validation to predict the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-over-charge values corresponding to discriminant peaks. Two main kinds of models can be estimated: linear or nonlinear (with neural networks, random forests, support vector machines with linear kernel, or eXtreme Gradient Boosting). Hyperparameters are randomly searched, except for the eXtreme Gradient Boosting where a grid search is performed.
LogReg(X, moz, Y, number = 2, repeats = 2, Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"), kind="linear", Sampling = c("no", "up", "down", "smote"))
LogReg(X, moz, Y, number = 2, repeats = 2, Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"), kind="linear", Sampling = c("no", "up", "down", "smote"))
X |
|
moz |
|
Y |
|
number |
|
Metric |
a |
repeats |
|
kind |
If |
Sampling |
a |
This function estimates a model from a library of mass spectra for which we already know the category to which they belong (ex.: species, etc). This model can next be used to predict the category of a new coming spectrum for which the category is unknown (see PredictLogReg
).
The estimation is performed using the train
function of the caret
R package. For each kind of model, random parameters are tested to find a model according to the best metric
. The formulas for the metric
are as follows:
The adjusted Rand index ("AdjRankIndex"
) is defined as the corrected-for-chance version of the Rand index which allows comparing two groups (see mclust
package and adjustedRandIndex()
function for more details). The Matthews correlation coefficient ("MatthewsCorrelation"
) is estimated using mcc
function in the mltools
R package.
The Sampling
methods available for imbalanced data are: "up"
to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down"
to the down-sampling method which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote"
to the Synthetic Minority Over sampling Technique (SMOTE) algorithm for data augmentation which consists of creating new data from minority class using the K Nearest Neighbor algorithm.
Returns a list
with four items:
train_mod |
a |
conf_mat |
a confusion matrix containing percentages classes of predicted categories in function of actual categories, resulting from repeated cross-validation. |
stats_global |
a |
boxplot |
a |
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.
L. Hubert and P. Arabie (1985) Comparing Partitions, Journal of the Classification, 2, pp. 193-218.
Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1.
Matthews, B. W. (1975). "Comparison of the predicted and observed secondary structure of T4 phage lysozyme". Biochimica et Biophysica Acta (BBA) - Protein Structure. PMID 1180967.
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Selection of discriminant mass-over-charge values using RFERF # with 5 to 10 variables, # up sampling method and # trained with the Accuracy coefficient metric a <- MSclassifR::SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Accuracy", Sizes = c(2:5), Sampling = "up") sel_moz=a$sel_moz ############################################################################### ## 3. Perform LogReg from shortlisted discriminant mass-over-charge values # linear multinomial regression # without sampling mehod # and trained with the Kappa coefficient metric model_lm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, Metric = "Kappa") # Estimated model: model_lm # nonlinear multinomial regression using neural networks # with up-sampling method and # trained with the Kappa coefficient metric model_nn=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="nnet", Metric = "Kappa", Sampling = "up") # Estimated model: model_nn # nonlinear multinomial regression using random forests # without down-sampling method and # trained with the Kappa coefficient metric model_rf=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="rf", Metric = "Kappa", Sampling = "down") # Estimated model: model_rf # nonlinear multinomial regression using xgboost # with down-sampling method and # trained with the Kappa coefficient metric model_xgb=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="xgb", Metric = "Kappa", Sampling = "down") # Estimated model: model_xgb # nonlinear multinomial regression using svm # with down-sampling method and # trained with the Kappa coefficient metric model_svm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="svm", Metric = "Kappa", Sampling = "down") # Estimated model: model_svm ########## # Of note, step 3 can be performed several times # to find optimal models # because of random hyperparameter search ############################################################################### ## 4. Select best models in term of average Kappa and saving it for reuse Kappa_model=c(model_lm$stats_global[1,2],model_nn$stats_global[1,2], model_rf$stats_global[1,2],model_xgb$stats_global[1,2],model_svm$stats_global[1,2]) names(Kappa_model)=c("lm","nn","rf","xgb","svm") #Best models in term of accuracy Kappa_model[which(Kappa_model==max(Kappa_model))] #save best models for reuse #models=list(model_lm$train_mod,model_nn$train_mod,model_rf$train_mod, #model_xgb$train_mod,model_svm$train_mod) #models_best=models[which(Kappa_model==max(Kappa_model))] #for (i in 1:length(models_best)){ #save(models_best[[i]], file = paste0("model_best_",i,".rda",collapse="") #} #load a saved model #load("model_best_1.rda") ############################################################################### ## 5. Try other metrics to select the best model # linear multinomial regression # with up-sampling method and # trained with the Adjusted Rank index metric model_lm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=3, Metric = "AdjRankIndex", Sampling = "up")
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Selection of discriminant mass-over-charge values using RFERF # with 5 to 10 variables, # up sampling method and # trained with the Accuracy coefficient metric a <- MSclassifR::SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Accuracy", Sizes = c(2:5), Sampling = "up") sel_moz=a$sel_moz ############################################################################### ## 3. Perform LogReg from shortlisted discriminant mass-over-charge values # linear multinomial regression # without sampling mehod # and trained with the Kappa coefficient metric model_lm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, Metric = "Kappa") # Estimated model: model_lm # nonlinear multinomial regression using neural networks # with up-sampling method and # trained with the Kappa coefficient metric model_nn=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="nnet", Metric = "Kappa", Sampling = "up") # Estimated model: model_nn # nonlinear multinomial regression using random forests # without down-sampling method and # trained with the Kappa coefficient metric model_rf=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="rf", Metric = "Kappa", Sampling = "down") # Estimated model: model_rf # nonlinear multinomial regression using xgboost # with down-sampling method and # trained with the Kappa coefficient metric model_xgb=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="xgb", Metric = "Kappa", Sampling = "down") # Estimated model: model_xgb # nonlinear multinomial regression using svm # with down-sampling method and # trained with the Kappa coefficient metric model_svm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="svm", Metric = "Kappa", Sampling = "down") # Estimated model: model_svm ########## # Of note, step 3 can be performed several times # to find optimal models # because of random hyperparameter search ############################################################################### ## 4. Select best models in term of average Kappa and saving it for reuse Kappa_model=c(model_lm$stats_global[1,2],model_nn$stats_global[1,2], model_rf$stats_global[1,2],model_xgb$stats_global[1,2],model_svm$stats_global[1,2]) names(Kappa_model)=c("lm","nn","rf","xgb","svm") #Best models in term of accuracy Kappa_model[which(Kappa_model==max(Kappa_model))] #save best models for reuse #models=list(model_lm$train_mod,model_nn$train_mod,model_rf$train_mod, #model_xgb$train_mod,model_svm$train_mod) #models_best=models[which(Kappa_model==max(Kappa_model))] #for (i in 1:length(models_best)){ #save(models_best[[i]], file = paste0("model_best_",i,".rda",collapse="") #} #load a saved model #load("model_best_1.rda") ############################################################################### ## 5. Try other metrics to select the best model # linear multinomial regression # with up-sampling method and # trained with the Adjusted Rank index metric model_lm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=3, Metric = "AdjRankIndex", Sampling = "up")
This package provides R functions to classify mass spectra in known categories, and to determine discriminant mass-over-charge values. It was developed with the aim of identifying very similar species or phenotypes of bacteria from mass spectra obtained by Matrix Assisted Laser Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS). However, the different functions of this package can also be used to classify other categories associated to mass spectra; or from mass spectra obtained with other mass spectrometry techniques. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-over-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-over-charge values.
No return value. Package description.
Alexandre Godmer, Quentin Giai Gianetto
MassSpectrum
objects.This function performs a data analysis pipeline to pre-process mass spectra. It provides average intensities and detects peaks using functions of R packages MALDIquant
and MALDIrppa
.
PeakDetection(x, averageMassSpec = TRUE, labels = NULL, averageMassSpectraMethod = "median", SNRdetection = 3, binPeaks = TRUE, PeakDetectionMethod = "MAD", halfWindowSizeDetection = 11, AlignMethod = "strict", Tolerance = 0.002, ...)
PeakDetection(x, averageMassSpec = TRUE, labels = NULL, averageMassSpectraMethod = "median", SNRdetection = 3, binPeaks = TRUE, PeakDetectionMethod = "MAD", halfWindowSizeDetection = 11, AlignMethod = "strict", Tolerance = 0.002, ...)
x |
a |
averageMassSpec |
a |
labels |
a |
averageMassSpectraMethod |
a |
PeakDetectionMethod |
a |
SNRdetection |
a |
binPeaks |
a |
halfWindowSizeDetection |
a |
AlignMethod |
a |
Tolerance |
a |
... |
other arguments from |
The PeakDetection
function provides an analysis pipeline for MassSpectrum
objects including peaks detection and binning.
All the methods used for PeakDetection
functions are selected from MALDIquant
and MALDIrppa
packages.
Returns a list of MassPeaks
objects (see MALDIquant
R package) for each mass spectrum in x
.
Gibb S, Strimmer K. MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics. 2012 Sep 1;28(17):2270-1. doi:10.1093/bioinformatics/bts447. Epub 2012 Jul 12. PMID: 22796955.
Javier Palarea-Albaladejo, Kevin Mclean, Frank Wright, David G E Smith, MALDIrppa: quality control and robust analysis for mass spectrometry data, Bioinformatics, Volume 34, Issue 3, 01 February 2018, Pages 522 - 523, doi:10.1093/bioinformatics/btx628
library("MALDIquant") library("MSclassifR") ## Load mass spectra and metadata data("CitrobacterRKIspectra", "CitrobacterRKImetadata", package = "MSclassifR") ## Pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) ## Detection of peaks in pre-processed mass spectra peaks <- PeakDetection(x = spectra, averageMassSpec = FALSE, labels = CitrobacterRKImetadata$Strain_name_spot, averageMassSpectraMethod = "median", SNRdetection = 3, binPeaks = TRUE, halfWindowSizeDetection = 11, AlignFrequency = 0.20, AlignMethod = "strict", Tolerance = 0.002) # Plot peaks on a pre-processed mass spectrum PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
library("MALDIquant") library("MSclassifR") ## Load mass spectra and metadata data("CitrobacterRKIspectra", "CitrobacterRKImetadata", package = "MSclassifR") ## Pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) ## Detection of peaks in pre-processed mass spectra peaks <- PeakDetection(x = spectra, averageMassSpec = FALSE, labels = CitrobacterRKImetadata$Strain_name_spot, averageMassSpectraMethod = "median", SNRdetection = 3, binPeaks = TRUE, halfWindowSizeDetection = 11, AlignFrequency = 0.20, AlignMethod = "strict", Tolerance = 0.002) # Plot peaks on a pre-processed mass spectrum PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
This function performs a plot of a AbstractMassObject
object (see the MALDIquant
R package). It can be used to highlight peaks in a mass spectrum.
PlotSpectra(SpectralData, absx="ALL", Peaks=NULL, Peaks2=NULL, col_spec=1, col_peak=2, shape_peak=3, col_peak2=2, shape_peak2=2)
PlotSpectra(SpectralData, absx="ALL", Peaks=NULL, Peaks2=NULL, col_spec=1, col_peak=2, shape_peak=3, col_peak2=2, shape_peak2=2)
SpectralData |
|
absx |
|
Peaks |
|
Peaks2 |
numeric |
col_spec |
color of the mass spectrum. |
col_peak |
color of the peak points corresponding to |
shape_peak |
shape of the peak points corresponding to |
col_peak2 |
color of the peak points corresponding to |
shape_peak2 |
Shape of the peak points corresponding to |
A ggplot
object (see ggplot2
R package). Mass-over-charge values are in x-axis and intensities in y-axis.
library("MSclassifR") # Load mass spectra data("CitrobacterRKIspectra", package = "MSclassifR") # Plot raw mass spectrum PlotSpectra(SpectralData = CitrobacterRKIspectra[[1]]) # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # Plot pre-processed mass spectrum PlotSpectra(SpectralData=spectra[[1]]) # detection of peaks in pre-processed mass spectra peaks <- PeakDetection(x = spectra, averageMassSpec=FALSE) # Plot peaks on pre-processed mass spectrum PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
library("MSclassifR") # Load mass spectra data("CitrobacterRKIspectra", package = "MSclassifR") # Plot raw mass spectrum PlotSpectra(SpectralData = CitrobacterRKIspectra[[1]]) # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # Plot pre-processed mass spectrum PlotSpectra(SpectralData=spectra[[1]]) # detection of peaks in pre-processed mass spectra peaks <- PeakDetection(x = spectra, averageMassSpec=FALSE) # Plot peaks on pre-processed mass spectrum PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
For each mass peak in a list of mass peaks, a linear regression is performed between the mass spectrum and mass spectra corresponding to a category. This is performed for each category and associated to an Akaike Information Criterium. Next, the AIC are used to determine the belonging of a mass spectrum to a category. It also provides a probability that the mass spectrum does not belong to any of the input categories.
PredictFastClass(peaks, mod_peaks, Y_mod_peaks, moz="ALL", tolerance = 6, toleranceStep = 2, normalizeFun = TRUE, noMatch = 0)
PredictFastClass(peaks, mod_peaks, Y_mod_peaks, moz="ALL", tolerance = 6, toleranceStep = 2, normalizeFun = TRUE, noMatch = 0)
peaks |
a list of |
mod_peaks |
an intensity matrix corresponding to mass spectra for which the category is known. Each column is a mass-over-charge value, each row corresponds to a mass spectrum. |
Y_mod_peaks |
a |
moz |
a |
tolerance |
a |
toleranceStep |
a |
normalizeFun |
a |
noMatch |
a |
Returns a dataframe
containing AIC criteria by category for each mass spectrum in peaks
. The AIC criterion should be minimal for the most probable category. The pred_cat
column is the predicted category for each mass spectrum in peaks
. The p_not_in_DB
is the minimal p-value of several Fisher tests testing if all the linear coefficients associated to mass spectra of a category are null. It can be interpreted as a p-value that the mass spectrum is not present in the input database.
library("MSclassifR") library("MALDIquant") # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) #Predict species without peak selection using a tolerance of 1 Da res = PredictFastClass(peaks=peaks[1:5], mod_peaks=X, Y_mod_peaks=Y, tolerance = 1) #comparing predicted categories (species) and the truth cbind(res$pred_cat,as.character(Y[1:5])) # The method can be applied after a peak selection step a <- SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Kappa", Sizes = c(20:40), Sampling = "up") #Predict species from selected peaks using a tolerance of 1 Da res = PredictFastClass(peaks=peaks[1:5], moz = a$sel_moz, mod_peaks=X, Y_mod_peaks=Y, tolerance = 1) #comparing predicted categories (species) and the truth cbind(res$pred_cat,as.character(Y[1:5]))
library("MSclassifR") library("MALDIquant") # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) #Predict species without peak selection using a tolerance of 1 Da res = PredictFastClass(peaks=peaks[1:5], mod_peaks=X, Y_mod_peaks=Y, tolerance = 1) #comparing predicted categories (species) and the truth cbind(res$pred_cat,as.character(Y[1:5])) # The method can be applied after a peak selection step a <- SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Kappa", Sizes = c(20:40), Sampling = "up") #Predict species from selected peaks using a tolerance of 1 Da res = PredictFastClass(peaks=peaks[1:5], moz = a$sel_moz, mod_peaks=X, Y_mod_peaks=Y, tolerance = 1) #comparing predicted categories (species) and the truth cbind(res$pred_cat,as.character(Y[1:5]))
This function predicts the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-over-charge values of interest and a short-listed multinomial logistic regression model (see LogReg
).
PredictLogReg(peaks, model, moz, tolerance = 6, toleranceStep = 2, normalizeFun = TRUE, noMatch=0, Reference = NULL)
PredictLogReg(peaks, model, moz, tolerance = 6, toleranceStep = 2, normalizeFun = TRUE, noMatch=0, Reference = NULL)
peaks |
a list of |
model |
a model or a list of models estimated from a set of shortlisted mass-over-charge values (output of the |
moz |
a |
tolerance |
a |
toleranceStep |
a |
normalizeFun |
a |
noMatch |
a |
Reference |
a |
The PredictLogReg
function allows predicting the membership of a mass spectrum to a category from a multinomial regression model. The mass spectrum from the peaks
object will be matched to the discriminant mass-over-chage (m/z) values (sel_moz
object from the SelectionVar
or SelectionVarStat
functions) with a tolerance between 2 m/z and defined by the tolerance
parameter (by default this value is 6 Da). If a repetition of a same m/z occurs in the selection, only the m/z that is closest in mass peaks (moz
) is used. When no match, intensity values are replaced by the noMatch
argument. If no m/z values from peaks
object matched with the m/z in the object moz
, the tolerance will be increased according to a numeric value defined in the toleranceStep
parameter and a warning will be notified. Note that it is possible to not perform the SelectionVar
function prior to the PredictLogReg
function, and to replace the argument moz
by all the m/z values present in a mass spectrum.
Returns a dataframe
containing probabilities of membership by category for each mass spectrum in peaks
. The method used is provided in the method
column. The comb_fisher
method is the result of the Fisher's method when merging probabilities of membership of used prediction models.The max_vote
method is the result of the maximum voting from used prediction models.
If the Reference
parameter is not null, the function returns:
Confusion.Matrix |
a |
Gobal.stat |
a |
Details.stat |
a |
Correct.ClassificationFreq |
a |
Incorrect.ClassificationFreq |
a |
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Selection of discriminant mass-over-charge values using RFERF # with 5 to 10 variables, # without sampling method and trained # with the Accuracy coefficient metric a <- MSclassifR::SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Kappa", Sizes = c(5:10)) sel_moz=a$sel_moz ############################################################################### ## 3. Perform LogReg from shortlisted discriminant mass-over-charge values # linear multinomial regression # without sampling mehod and # trained with the Kappa coefficient metric model_lm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, Metric = "Kappa") # Estimated model: model_lm # nonlinear multinomial regression using neural networks # with up-sampling method and # trained with the Kappa coefficient metric model_nn=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="nnet", Metric = "Kappa", Sampling = "up") # Estimated model: model_nn # nonlinear multinomial regression using random forests # without down-sampling method and # trained with the Kappa coefficient metric model_rf=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="rf", Metric = "Kappa", Sampling = "down") # Estimated model: model_rf # nonlinear multinomial regression using xgboost # with down-sampling method and # trained with the Kappa coefficient metric model_xgb=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="xgb", Metric = "Kappa", Sampling = "down") # Estimated model: model_xgb # nonlinear multinomial regression using svm # with down-sampling method and # trained with the Kappa coefficient metric model_svm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="svm", Metric = "Kappa", Sampling = "down") # Estimated model: model_svm # Of note, you can also load a model already saved # (see example in LogReg function) for the next step ############################################################################### ## 4. Probabilities of belonging to each category for the mass spectra ## and associated statitics # Collect all the estimated models in a list Models <- list(model_lm$train_mod, model_nn$train_mod, model_rf$train_mod, model_xgb$train_mod, model_svm$train_mod) # Predict classes of mass spectra with 6 Da of tolerance for matching peaks. prob_cat=MSclassifR::PredictLogReg(peaks = peaks[c(1:5)], model = Models, moz = sel_moz, tolerance = 6, Reference = Y[c(1:5)]) ################################################################################ ## 5. Example of meta-classifiers based on several random forest models ## to optimize a Kappa value using the SMOTE method for imbalanced datasets. ## -> a merge of the prediction probabilities using the Fisher's method ## leads generally to robust prediction models. #Selecting peaks with mda method a=SelectionVar(X,Y,MethodSelection="mda",Ntree=5*ncol(X)) sel_moz=a$sel_moz #Building 4 Random Forest models models=NULL;nbm=4; for (i in 1:nbm){ model_rf=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=5, repeats=5, kind="rf", Metric = "Kappa", Sampling = "smote") models <- c(models,list(model_rf$train_mod)) } #Combining their prediction probabilities prob_cat=MSclassifR::PredictLogReg(peaks = peaks,model = models,moz = sel_moz, tolerance = 6,Reference = Y)
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Selection of discriminant mass-over-charge values using RFERF # with 5 to 10 variables, # without sampling method and trained # with the Accuracy coefficient metric a <- MSclassifR::SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Kappa", Sizes = c(5:10)) sel_moz=a$sel_moz ############################################################################### ## 3. Perform LogReg from shortlisted discriminant mass-over-charge values # linear multinomial regression # without sampling mehod and # trained with the Kappa coefficient metric model_lm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, Metric = "Kappa") # Estimated model: model_lm # nonlinear multinomial regression using neural networks # with up-sampling method and # trained with the Kappa coefficient metric model_nn=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="nnet", Metric = "Kappa", Sampling = "up") # Estimated model: model_nn # nonlinear multinomial regression using random forests # without down-sampling method and # trained with the Kappa coefficient metric model_rf=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="rf", Metric = "Kappa", Sampling = "down") # Estimated model: model_rf # nonlinear multinomial regression using xgboost # with down-sampling method and # trained with the Kappa coefficient metric model_xgb=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="xgb", Metric = "Kappa", Sampling = "down") # Estimated model: model_xgb # nonlinear multinomial regression using svm # with down-sampling method and # trained with the Kappa coefficient metric model_svm=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=2, repeats=2, kind="svm", Metric = "Kappa", Sampling = "down") # Estimated model: model_svm # Of note, you can also load a model already saved # (see example in LogReg function) for the next step ############################################################################### ## 4. Probabilities of belonging to each category for the mass spectra ## and associated statitics # Collect all the estimated models in a list Models <- list(model_lm$train_mod, model_nn$train_mod, model_rf$train_mod, model_xgb$train_mod, model_svm$train_mod) # Predict classes of mass spectra with 6 Da of tolerance for matching peaks. prob_cat=MSclassifR::PredictLogReg(peaks = peaks[c(1:5)], model = Models, moz = sel_moz, tolerance = 6, Reference = Y[c(1:5)]) ################################################################################ ## 5. Example of meta-classifiers based on several random forest models ## to optimize a Kappa value using the SMOTE method for imbalanced datasets. ## -> a merge of the prediction probabilities using the Fisher's method ## leads generally to robust prediction models. #Selecting peaks with mda method a=SelectionVar(X,Y,MethodSelection="mda",Ntree=5*ncol(X)) sel_moz=a$sel_moz #Building 4 Random Forest models models=NULL;nbm=4; for (i in 1:nbm){ model_rf=MSclassifR::LogReg(X=X, moz=sel_moz, Y=factor(Y), number=5, repeats=5, kind="rf", Metric = "Kappa", Sampling = "smote") models <- c(models,list(model_rf$train_mod)) } #Combining their prediction probabilities prob_cat=MSclassifR::PredictLogReg(peaks = peaks,model = models,moz = sel_moz, tolerance = 6,Reference = Y)
This function performs variable selection (i.e. selection of discriminant mass-over-charge values) using either recursive feature elimination (RFE) algorithm with Random Forest, or logistic regression model, or sparse partial least squares discriminant analysis (sPLS-DA) or methods based on the distribution of variable importances of random forests.
SelectionVar(X, Y, MethodSelection = c("RFERF", "RFEGlmnet", "VSURF", "sPLSDA", "mda", "cvp"), MethodValidation = c("cv", "repeatedcv", "LOOCV"), PreProcessing = c("center", "scale", "nzv", "corr"), Metric = c("Kappa", "Accuracy"), Sampling = c("no", "up","down", "smote"), NumberCV = NULL, RepeatsCV = NULL, Sizes, Ntree = 1000, ncores = 2, threshold = 0.01, ncomp.max = 10, nbf=0)
SelectionVar(X, Y, MethodSelection = c("RFERF", "RFEGlmnet", "VSURF", "sPLSDA", "mda", "cvp"), MethodValidation = c("cv", "repeatedcv", "LOOCV"), PreProcessing = c("center", "scale", "nzv", "corr"), Metric = c("Kappa", "Accuracy"), Sampling = c("no", "up","down", "smote"), NumberCV = NULL, RepeatsCV = NULL, Sizes, Ntree = 1000, ncores = 2, threshold = 0.01, ncomp.max = 10, nbf=0)
X |
a numeric |
Y |
a |
MethodSelection |
a |
MethodValidation |
a |
NumberCV |
a |
RepeatsCV |
a |
PreProcessing |
a |
Metric |
a |
Sampling |
a |
Sizes |
a numeric |
Ntree |
a |
ncores |
a |
ncomp.max |
a |
threshold |
a |
nbf |
a |
The selection of variables can be carried out with two different objectives: either to find a minimum number of variables allowing to obtain the highest possible accuracy (or Kappa coefficient), which involves the possible elimination of variables correlated between them (i.e. not bringing any additional predictive power with respect to some other variables); or to find all the variables in the dataset with a potential predictive power ("discriminant" variables).
The VSURF
method attempts to accomplish only the first objective.
The mda
and cvp
methods attempt to accomplish the second objective, as do the methods available in the SelectionVarStat
function of our MSclassifR
R package.
The RFERF
, RFEGlmnet
and sPLSDA
methods take as input a number of variables to be selected(Sizes
argument), and can therefore be used with both objectives.
Within the framework of the second objective, either the mda
or cvp
methods can be used to estimate a number of discriminant variables from the importances of variables. The SelectionVarStat
function can also be used to estimate this number from distributions of p-values. Of note, be sure that the Ntree
argument is high enough to get a robust estimation with the mda
or cvp
methods.
The "RFEGlmnet"
and "RFERF"
methods are based on recursive feature elimination and can either optimize the kappa coefficient or the accuracy as metrics when selecting variables.
The "sPLSDA"
method selects variables from the ones kept in latent components of the sparse PLS-DA model using an automatic choice of the number of components (when the balanced classification error rate (BER) reaches a plateau - see argument threshold
).
The "mda"
and "cvp"
methods use the distribution of variable importances to estimate the number of discriminant features (mass-over-charge values). Briefly, the distribution of variable importances for useless (not discriminant) features is firstly estimated from negative importance variables by the method proposed in section 2.6 of Janitza et al.(2018). Next, the following mixture model is assumed:
where
is the empirical cumulative distribution of variable importances of all the features,
the one of the useless features,
the one of the discriminative features, and
is the proportion of useless features in the dataset.
From the estimated distribution of useless features, we can estimate quantile values
and compute
for each quantile
. The minimum of the
corresponds to the estimated proportion of useless features in the dataset, what allows estimating the number of discriminant features by
where N is the total number of features. Next, the
features with the highest variable importances are selected.
The "VSURF"
and "sPLSDA"
methods use the minimum mean out-of-bag (OOB) and balanced classification error rate (BER) metrics respectively.
For Sampling
methods available for unbalanced data: "up"
corresponds to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down"
corresponds to the down-sampling method randomly which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote"
corresponds to the Synthetic Minority Over sampling Technique (SMOTE) specific algorithm for data augmentation which consist of creates new data from minority class using the K Nearest Neighbor algorithm.
See rfe
in the caret
R package, VSURF
in the VSURF
R package, splsda
in the mixOmics
R package, importance
function in the randomForest
R package, and CVPVI
function in the vita
R package for more details.
A list composed of:
sel_moz |
a |
For the "RFERF"
and "RFEGlmnet"
methods, it also returns the results of the rfe
function of the caret
R package.
For the "VSURF"
method, it also returns the results of the results of the VSURF
function of the VSURF
R package.
For the "sPLSDA"
method, it also returns the following items:
Raw_data |
a horizontal bar plot and containing the contribution of features on each component. |
selected_variables |
|
For the "mda"
and "cvp"
methods, it also returns the following items:
nb_to_sel |
a numeric value corresponding to an estimated number of mass-over-chage values where the intensities are significantly different between categories (see details). |
imp_sel |
a vector containing the variable importances for the selected features. |
Kuhn, Max. (2012). The caret Package. Journal of Statistical Software. 28.
Genuer, Robin, Jean-Michel Poggi and Christine Tuleau-Malot. VSURF : An R Package for Variable Selection Using Random Forests. R J. 7 (2015): 19.
Friedman J, Hastie T, Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.
Kim-Anh Le Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois, Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics. Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (January 2002), 321–357.
Branco P, Ribeiro R, Torgo L (2016). “UBL: an R Package for Utility-Based Learning.” CoRR, abs/1604.08079.
Janitza, S., Celik, E., Boulesteix, A. L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885-915.
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Perform variables selection using SelectionVar with RFE and random forest # with 5 to 10 variables, # up sampling method and trained with the Kappa coefficient metric a <- SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Kappa", Sizes = c(5:10), Sampling = "up") # Plotting peaks on the first pre-processed mass spectrum and highlighting the # discriminant mass-over-charge values with red lines PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]], Peaks2=a$sel_moz,col_spec="blue",col_peak="black") ############################################################################### ## 3. Perform variables selection using SelectionVar with VSURF # This function can last a few minutes b <- SelectionVar(X, Y, MethodSelection = c("VSURF")) summary(b$result) ############################################################################### ## 4. Perform variables selection using SelectionVar with "mda" or "cvp" # option 1: Using mean decrease in accuracy # with no sampling method c <- SelectionVar(X,Y,MethodSelection="mda",Ntree=10*ncol(X)) # Estimation of the number of peaks to discriminate species c$nb_to_sel # Discriminant mass-over-charge values c$sel_moz # Plotting peaks on the first pre-processed mass spectrum and highlighting the # discriminant mass-over-charge values with red lines PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]], Peaks2=c$sel_moz,col_spec="blue",col_peak="black") # option 2: Using cross-validated permutation variable importance measures (more "time-consuming") # with no sampling method d <- SelectionVar(X,Y,MethodSelection="cvp",NumberCV=2,ncores=2,Ntree=1000) # Estimation of the number of peaks to discriminate species d$nb_to_sel # Discriminant mass-over-charge values d$sel_moz # Plotting peaks on the first pre-processed mass spectrum and highlighting the # discriminant mass-over-charge values with red lines PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]], Peaks2=d$sel_moz,col_spec="blue",col_peak="black") # Mass-over charge values found with both methods ("mda" and "cvp") intersect(c$sel_moz,d$sel_moz)
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Perform variables selection using SelectionVar with RFE and random forest # with 5 to 10 variables, # up sampling method and trained with the Kappa coefficient metric a <- SelectionVar(X, Y, MethodSelection = c("RFERF"), MethodValidation = c("cv"), PreProcessing = c("center","scale","nzv","corr"), NumberCV = 2, Metric = "Kappa", Sizes = c(5:10), Sampling = "up") # Plotting peaks on the first pre-processed mass spectrum and highlighting the # discriminant mass-over-charge values with red lines PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]], Peaks2=a$sel_moz,col_spec="blue",col_peak="black") ############################################################################### ## 3. Perform variables selection using SelectionVar with VSURF # This function can last a few minutes b <- SelectionVar(X, Y, MethodSelection = c("VSURF")) summary(b$result) ############################################################################### ## 4. Perform variables selection using SelectionVar with "mda" or "cvp" # option 1: Using mean decrease in accuracy # with no sampling method c <- SelectionVar(X,Y,MethodSelection="mda",Ntree=10*ncol(X)) # Estimation of the number of peaks to discriminate species c$nb_to_sel # Discriminant mass-over-charge values c$sel_moz # Plotting peaks on the first pre-processed mass spectrum and highlighting the # discriminant mass-over-charge values with red lines PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]], Peaks2=c$sel_moz,col_spec="blue",col_peak="black") # option 2: Using cross-validated permutation variable importance measures (more "time-consuming") # with no sampling method d <- SelectionVar(X,Y,MethodSelection="cvp",NumberCV=2,ncores=2,Ntree=1000) # Estimation of the number of peaks to discriminate species d$nb_to_sel # Discriminant mass-over-charge values d$sel_moz # Plotting peaks on the first pre-processed mass spectrum and highlighting the # discriminant mass-over-charge values with red lines PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]], Peaks2=d$sel_moz,col_spec="blue",col_peak="black") # Mass-over charge values found with both methods ("mda" and "cvp") intersect(c$sel_moz,d$sel_moz)
This function performs a statistical test for each mass-over-charge value to determine which are discriminants between categories. Using the distribution of resulting multiple p-values, it determines an expected number of discriminant features, and adjusted p-values that can be used to control a false discovery rate threshold.
SelectionVarStat(X, Y, stat.test = "Limma", pi0.method="abh", fdr=0.05, Sampling = c("no", "up","down", "smote"))
SelectionVarStat(X, Y, stat.test = "Limma", pi0.method="abh", fdr=0.05, Sampling = c("no", "up","down", "smote"))
X |
a |
Y |
a |
stat.test |
a |
pi0.method |
a |
fdr |
a |
Sampling |
a |
The SelectionVarStat
function allows performing "quick" classification of mass-over-charge values. It tries to find all the mass-over-charge values (or the number of mass-over-charge values) that are discriminant between categories. This can conduct to select "correlated" mass-over-charge values (i.e. associated to intensities evolving similarly between categories). By default, multiple moderated t-tests using the limma
R package (bayesian regularization of variances) are performed and the p-values are corrected using an adaptive Benjamini and Hochberg procedure to control the false discovery rate. Different ways to estimate the proportion of true null hypotheses (object pi0
returned by the function - see the cp4p
R package for details) can be used for the adaptive Benjamini-Hochberg procedure ("abh
" by defaut).
A list composed of:
nb_to_sel |
a |
sel_moz |
a |
ap |
a |
Gianetto, Quentin & Combes, Florence & Ramus, Claire & Bruley, Christophe & Coute, Yohann & Burger, Thomas. (2015). Technical Brief Calibration Plot for Proteomics (CP4P): A graphical tool to visually check the assumptions underlying FDR control in quantitative experiments. Proteomics. 16. 10.1002/pmic.201500189.
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Estimate the optimal number of peaks to discriminate the different species OptiPeaks <- SelectionVarStat(X, Y, stat.test = "Limma", pi0.method="abh", fdr=0.05, Sampling="smote") ## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter) OptiPeaks$nb_to_sel ## discriminant mass-over-chage values estimated using a 5 per cent false discovery rate OptiPeaks$sel_moz ## p-values and adjusted p-values estimated for all the tested mass-over-charge values OptiPeaks$ap$adjp
library("MSclassifR") library("MALDIquant") ############################################################################### ## 1. Pre-processing of mass spectra # load mass spectra and their metadata data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR") # standard pre-processing of mass spectra spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra) # detection of peaks in pre-processed mass spectra peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot) # matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value) IntMat <- MALDIquant::intensityMatrix(peaks) rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot) # remove missing values in the matrix IntMat[is.na(IntMat)] <- 0 # normalize peaks according to the maximum intensity value for each mass spectrum IntMat <- apply(IntMat,1,function(x) x/(max(x))) # transpose the matrix for statistical analysis X <- t(IntMat) # define the known categories of mass spectra for the classification Y <- factor(CitrobacterRKImetadata$Species) ############################################################################### ## 2. Estimate the optimal number of peaks to discriminate the different species OptiPeaks <- SelectionVarStat(X, Y, stat.test = "Limma", pi0.method="abh", fdr=0.05, Sampling="smote") ## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter) OptiPeaks$nb_to_sel ## discriminant mass-over-chage values estimated using a 5 per cent false discovery rate OptiPeaks$sel_moz ## p-values and adjusted p-values estimated for all the tested mass-over-charge values OptiPeaks$ap$adjp
This function performs post acquisition signal processing for list
of MassSpectrum
objects using commonly used methods : transform intensities ("sqrt"), smoothing ("Wavelet"), remove baseline ("SNIP"), calibrate intensities ("TIC") and align spectra. Methods used are selected from the MALDIquant
and MALDIrppa
R packages.
SignalProcessing(x, transformIntensity_method = "sqrt", smoothing_method = "Wavelet", removeBaseline_method = "SNIP", removeBaseline_iterations = 25, calibrateIntensity_method = "TIC", alignSpectra_NoiseMethod = "MAD", alignSpectra_method = "lowess", alignSpectra_halfWs = 11, alignSpectra_SN = 3, tolerance_align = 0.002, referenceSpectra = NULL, minFrequency= 0.5, binPeaks_method = "strict", keepReferenceSpectra = FALSE, ...)
SignalProcessing(x, transformIntensity_method = "sqrt", smoothing_method = "Wavelet", removeBaseline_method = "SNIP", removeBaseline_iterations = 25, calibrateIntensity_method = "TIC", alignSpectra_NoiseMethod = "MAD", alignSpectra_method = "lowess", alignSpectra_halfWs = 11, alignSpectra_SN = 3, tolerance_align = 0.002, referenceSpectra = NULL, minFrequency= 0.5, binPeaks_method = "strict", keepReferenceSpectra = FALSE, ...)
x |
a |
transformIntensity_method |
a |
smoothing_method |
a |
removeBaseline_method |
a |
removeBaseline_iterations |
a |
calibrateIntensity_method |
a |
alignSpectra_NoiseMethod |
a |
alignSpectra_method |
a |
alignSpectra_halfWs |
a |
alignSpectra_SN |
a |
tolerance_align |
a |
referenceSpectra |
a |
minFrequency |
a |
binPeaks_method |
a |
keepReferenceSpectra |
a |
... |
other arguments from |
The SignalProcessing
function provides an analysis pipeline for MassSpectrum
objects including intensity transformation, smoothing, removing baseline.
The Wavelet
method relies on the wavShrink
function of the wmtsa
package and its dependencies (now archived by CRAN). The original C code by William Constantine and Keith L. Davidson, in turn including copyrighted routines by Insightful Corp., has been revised and included into MALDIrppa
for the method to work.
All the methods used for SignalProcessing
functions are selected from MALDIquant
and MALDIrppa
packages.
A list of modified MassSpectrum
objects (see MALDIquant
R package) according to chosen arguments. If the argument referenceSpectra
is not completed and the argument keepReferenceSpectra
is TRUE
, a list containing the MassSpectrum
objects modified named "spectra"
and the created reference spectrum named "RefS"
is returned.
Gibb S, Strimmer K. MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics. 2012 Sep 1;28(17):2270-1. doi:10.1093/bioinformatics/bts447. Epub 2012 Jul 12. PMID: 22796955.
Javier Palarea-Albaladejo, Kevin Mclean, Frank Wright, David G E Smith, MALDIrppa: quality control and robust analysis for mass spectrometry data, Bioinformatics, Volume 34, Issue 3, 01 February 2018, Pages 522 - 523, doi:10.1093/bioinformatics/btx628
library("MALDIquant") library("MSclassifR") ## Load mass spectra data("CitrobacterRKIspectra", package = "MSclassifR") # plot first unprocessed mass spectrum PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]], col_spec="blue") ## spectral treatment spectra <- SignalProcessing(CitrobacterRKIspectra, transformIntensity_method = "sqrt", smoothing_method = "Wavelet", removeBaseline_method = "SNIP", removeBaseline_iterations = 25, calibrateIntensity_method = "TIC", alignSpectra_Method = "MAD", alignSpectra_halfWs = 11, alignSpectra_SN = 3, tolerance_align = 0.002) # plot first processed mass spectrum PlotSpectra(SpectralData=spectra[[1]], col_spec="blue")
library("MALDIquant") library("MSclassifR") ## Load mass spectra data("CitrobacterRKIspectra", package = "MSclassifR") # plot first unprocessed mass spectrum PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]], col_spec="blue") ## spectral treatment spectra <- SignalProcessing(CitrobacterRKIspectra, transformIntensity_method = "sqrt", smoothing_method = "Wavelet", removeBaseline_method = "SNIP", removeBaseline_iterations = 25, calibrateIntensity_method = "TIC", alignSpectra_Method = "MAD", alignSpectra_halfWs = 11, alignSpectra_SN = 3, tolerance_align = 0.002) # plot first processed mass spectrum PlotSpectra(SpectralData=spectra[[1]], col_spec="blue")