Title: | A Collection of Methods for Left-Censored Missing Data Imputation |
---|---|
Description: | A collection of functions for left-censored missing data imputation. Left-censoring is a special case of missing not at random (MNAR) mechanism that generates non-responses in proteomics experiments. The package also contains functions to artificially generate peptide/protein expression data (log-transformed) as random draws from a multivariate Gaussian distribution as well as a function to generate missing data (both randomly and non-randomly). For comparison reasons, the package also contains several wrapper functions for the imputation of non-responses that are missing at random. * New functionality has been added: a hybrid method that allows the imputation of missing values in a more complex scenario where the missing data are both MAR and MNAR. |
Authors: | Cosmin Lazar [aut], Thomas Burger [aut], Samuel Wieczorek [cre, ctb] |
Maintainer: | Samuel Wieczorek <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.1 |
Built: | 2024-10-30 06:50:56 UTC |
Source: | CRAN |
this function generates artificial peptide abundance data with DA proteins samples are drawn from a gaussian distribution
generate.ExpressionData( nSamples1, nSamples2, meanSamples, sdSamples, nFeatures, nFeaturesUp, nFeaturesDown, meanDynRange, sdDynRange, meanDiffAbund, sdDiffAbund )
generate.ExpressionData( nSamples1, nSamples2, meanSamples, sdSamples, nFeatures, nFeaturesUp, nFeaturesDown, meanDynRange, sdDynRange, meanDiffAbund, sdDiffAbund )
nSamples1 |
number of samples in condition 1 |
nSamples2 |
number of samples in condition 2 |
meanSamples |
xxx |
sdSamples |
xxx |
nFeatures |
number of total features |
nFeaturesUp |
number of features up regulated |
nFeaturesDown |
number of features down regulated |
meanDynRange |
mean value of the dynamic range |
sdDynRange |
sd of the dynamic range |
meanDiffAbund |
xxx |
sdDiffAbund |
xxx |
A list containing the data, the conditions label and the regulation label (up/down/no)
Tthis function generates a map for peptide to protein roll-up
generate.RollUpMap(nProt, pep.Expr.Data)
generate.RollUpMap(nProt, pep.Expr.Data)
nProt |
number of proteins to map to the peptide expression data |
pep.Expr.Data |
matrix of peptide expression data |
the peptide to protein map (for each row in pep.prot.Map the corresponding value corresponds to the index of the protein that peptide is mapped to)
This function performs missing values imputation under MAR/MCAR hypothesis. The imputation of MVs is performed for each protein containing MAR/MCAR missing values
impute.MAR(dataSet.mvs, model.selector, method = "MLE")
impute.MAR(dataSet.mvs, model.selector, method = "MLE")
dataSet.mvs |
expression matrix containing abundances with MVs (either peptides or proteins) |
model.selector |
binary vector; "1" indicates MAR/MCAR proteins |
method |
the method to be used for MAR/MCAR missing values. Possible values: MLE (default), SVD, KNN |
dataset containing only MNAR (assumed to be left-censored) missing data
this function performs missing values imputation under MCAR and MNAR hypothesis
impute.MAR.MNAR( dataSet.mvs, model.selector, method.MAR = "KNN", method.MNAR = "QRILC" )
impute.MAR.MNAR( dataSet.mvs, model.selector, method.MAR = "KNN", method.MNAR = "QRILC" )
dataSet.mvs |
expression matrix containing abundances with MVs (either peptides or proteins) |
model.selector |
- binary vector; "1" indicates MCAR proteins |
method.MAR |
- the method to be used for MAR missing values - possible values: MLE (default), SVD, KNN |
method.MNAR |
- the method to be used for MAR missing values |
dataset containing complete abundances
this function performs missing values imputation by the minimum value observed
impute.MinDet(dataSet.mvs, q = 0.01)
impute.MinDet(dataSet.mvs, q = 0.01)
dataSet.mvs |
expression matrix with MVs (either peptides or proteins) |
q |
the q quantile used to estimate the minimum |
dataset containing complete abundances
This function performs missing values imputation by random draws from a gaussian
impute.MinProb(dataSet.mvs, q = 0.01, tune.sigma = 1)
impute.MinProb(dataSet.mvs, q = 0.01, tune.sigma = 1)
dataSet.mvs |
expression matrix containing abundances with MVs (either peptides or proteins) |
q |
the q-th quantile used to estimate the minimum value observed for each sample |
tune.sigma |
coefficient that controls the sd of the MNAR distribution |
dataset containing complete abundances
this function performs missing values imputation based quantile regression
impute.QRILC(dataSet.mvs, tune.sigma = 1)
impute.QRILC(dataSet.mvs, tune.sigma = 1)
dataSet.mvs |
expression matrix with MVs (either peptides or proteins) |
tune.sigma |
coefficient that controls the sd of the MNAR distribution |
a list containing: a matrix with the complete abundances, a list with the estimated parameters of the complete data distribution
This function performs missing values imputation based on KNN algorithm
impute.wrapper.KNN(dataSet.mvs, K)
impute.wrapper.KNN(dataSet.mvs, K)
dataSet.mvs |
expression matrix with MVs (either peptides or proteins) |
K |
the number of neighbors |
dataset containing complete abundances
This function performs missing values imputation using the EM algorithm
impute.wrapper.MLE(dataSet.mvs)
impute.wrapper.MLE(dataSet.mvs)
dataSet.mvs |
expression matrix with MVs (either peptides or proteins) |
expression matrix with MVs imputed
this function performs missing values imputation based on SVD algorithm
impute.wrapper.SVD(dataSet.mvs, K)
impute.wrapper.SVD(dataSet.mvs, K)
dataSet.mvs |
expression matrix with MVs (either peptides or proteins) |
K |
the number of PCs |
expression matrix with MVs imputed
This function performs missing values imputation by 0.
impute.ZERO(dataSet.mvs)
impute.ZERO(dataSet.mvs)
dataSet.mvs |
expression matrix containing abundances with MVs (either peptides or proteins) |
dataset containing complete abundances
this function generates missing data in a complete data matrix
insertMVs(original, mean.THR, sd.THR, MNAR.rate)
insertMVs(original, mean.THR, sd.THR, MNAR.rate)
original |
complete data matrix containing all measurements |
mean.THR , sd.THR
|
- parameters of the threshold distribution which controls the MVs rate (mean.THR should be initially set such that the result of the initial thresholding, in terms of no. of NAs, equals the desired total missing data rate) - example: if one wants to generate 30 mean.THR can be set as follows: mean.THR = quantile(pepExprsData, probs = 0.3) - sd.THR is usually set to a small value (e.g. 0.1) |
MNAR.rate |
percentage of MVs which are missing not at random |
A list that contains the original complete data matrix, the data matrix with missing data and the percentage of missing data
This dataset has been collected during a study designed to compare the protein content of the exosome-like vesicles (ELVs) released from C2C12 murine myoblasts during proliferation (ELV-MB), and after differentiation into myotuves (ELV-MT). The dataset within this package contains proteins intensity processed using MaxQuant. More information can be found on ProteomeExchange public repository (http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000022) or in the original paper (see reference).
data(intensity_PXD000022)
data(intensity_PXD000022)
A data frame with 660 observations on the following 7 variables.
Protein.IDs
Peptides/Proteins names
Intensity.MB.1
a numeric vector
Intensity.MB.2
a numeric vector
Intensity.MB.3
a numeric vector
Intensity.MT.1
a numeric vector
Intensity.MT.2
a numeric vector
Intensity.MT.3
a numeric vector
Original MaxQuant data: http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000022
Forterre A, Jalabert A, Berger E, Baudet M, Chikh K, et al. (2014) Proteomic Analysis of C2C12 Myoblast and Myotube Exosome-Like Vesicles: A New Paradigm for Myoblast-Myotube Cross Talk? PLoS ONE 9(1): e84153. doi:10.1371/journal.pone.0084153
This dataset has been collected during a study designed to perform the proteomic analysis of the SLP76 interactome in resting and activated primary mast cells. Four SLP76 replicates (with two analytical replicates each) have been affinity-purified from both resting and activated primary mast cells. The dataset within this package contains proteins intensity processed using MaxQuant. More information can be found on ProteomeExchange public repository (http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000052) or in the original paper (see reference).
data(intensity_PXD000052)
data(intensity_PXD000052)
A data frame with 1991 observations on the following 17 variables.
Protein.IDs
Peptides/Proteins names
iBAQ.stSLP_activ1
a numeric vector
iBAQ.stSLP_activ2
a numeric vector
iBAQ.stSLP_activ3
a numeric vector
iBAQ.stSLP_activ4
a numeric vector
iBAQ.stSLP_rest1
a numeric vector
iBAQ.stSLP_rest2
a numeric vector
iBAQ.stSLP_rest3
a numeric vector
iBAQ.stSLP_rest4
a numeric vector
iBAQ.WT_activ1
a numeric vector
iBAQ.WT_activ2
a numeric vector
iBAQ.WT_activ3
a numeric vector
iBAQ.WT_activ4
a numeric vector
iBAQ.WT_rest1
a numeric vector
iBAQ.WT_rest2
a numeric vector
iBAQ.WT_rest3
a numeric vector
iBAQ.WT_rest4
a numeric vector
Original MaxQuant data: http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000052
Bounab Y, Hesse AM, Iannascoli B, Grieco L, Coute Y, Niarakis A, Roncagalli R, Lie E, Lam KP, Demangel C, Thieffry D, Garin J, Malissen B, Da?ron M, Proteomic analysis of the SH2 domain-containing leukocyte protein of 76 kDa (SLP76) interactome in resting and activated primary mast cells [corrected]. Mol Cell Proteomics, 12(10):2874-89(2013).
This dataset has been collected during a study designed to compare human primary tumor-derived xenograph proteomes of the two major histological non-small cel lung cancer subtypes: adenocarcinoma (ADC) and squamous cell carcinoma (SCC). The dataset within this package contains proteins intensity for 6 ADC and 6 SCC samples, processed using MaxQuant. More information can be found on ProteomeExchange public repository(http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000438) or in the original paper (see reference).
data(intensity_PXD000438)
data(intensity_PXD000438)
A data frame with 3709 observations on the following 13 variables.
Protein.IDs
Peptides/Proteins names
Intensity.092.1
a numeric vector
Intensity.092.2
a numeric vector
Intensity.092.3
a numeric vector
Intensity.441.1
a numeric vector
Intensity.441.2
a numeric vector
Intensity.441.3
a numeric vector
Intensity.561.1
a numeric vector
Intensity.561.2
a numeric vector
Intensity.561.3
a numeric vector
Intensity.691.1
a numeric vector
Intensity.691.2
a numeric vector
Intensity.691.3
a numeric vector
Original MaxQuant data: http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000438
Zhang W, Wei Y, Ignatchenko V, Li L, Sakashita S, Pham NA, Taylor P, Tsao MS, Kislinger T, Moran MF, Proteomic profiles of human lung adeno and squamous cell carcinoma using super-SILAC and label-free quantification approaches. Proteomics, 14(6):795-803(2014).
data(intensity_PXD000438)
data(intensity_PXD000438)
This dataset contains three biological replicates with three technical replicates each for the conditiones media (CM) and the whole cell lysates (WCL) of C8-D1A cell lines. The dataset within this package contains proteins iBAQ intensity processed using MaxQuant. More information can be found on ProteomeExchange public repository (http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000501) or in the original paper (see reference).
data(intensity_PXD000501)
data(intensity_PXD000501)
A data frame with 7363 observations on the following 19 variables.
Protein.IDs
Peptides/Proteins names
iBAQ.secretome_set1_tech1
a numeric vector
iBAQ.secretome_set1_tech2
a numeric vector
iBAQ.secretome_set1_tech3
a numeric vector
iBAQ.secretome_set2_tech1
a numeric vector
iBAQ.secretome_set2_tech2
a numeric vector
iBAQ.secretome_set2_tech3
a numeric vector
iBAQ.secretome_set3_tech1
a numeric vector
iBAQ.secretome_set3_tech2
a numeric vector
iBAQ.secretome_set3_tech3
a numeric vector
iBAQ.whole_set1_tech1
a numeric vector
iBAQ.whole_set1_tech2
a numeric vector
iBAQ.whole_set1_tech3
a numeric vector
iBAQ.whole_set2_tech1
a numeric vector
iBAQ.whole_set2_tech2
a numeric vector
iBAQ.whole_set2_tech3
a numeric vector
iBAQ.whole_set3_tech1
a numeric vector
iBAQ.whole_set3_tech2
a numeric vector
iBAQ.whole_set3_tech3
a numeric vector
Original MaxQuant data: http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000501
Han D, Jin J, Woo J, Min H, Kim Y, Proteomic analysis of mouse astrocytes and their secretome by a combination of FASP and StageTip-based, high pH, reversed-phase fractionation. Proteomics, ():(2014).
data(intensity_PXD000501)
data(intensity_PXD000501)
- this function determines row in the data matrix affected by a MNAR missingness mechanism - it is based on the assumption that the distributions of the mean values of proteins follows a normal distribution - the method makes use of a decision function defined as a tradeoff between the empirical CDF of the proteins' means and the theoretical CDF assuming that no MVs are present
model.Selector(dataSet.mvs)
model.Selector(dataSet.mvs)
dataSet.mvs |
expression matrix containing abundances with MVs (either peptides or proteins) |
flags vector; "1" denotes rows containing random missing values; "0" denotes rows containing left-censored missing values
this function performs peptide to protein roll-up
pep2prot(pep.Expr.Data, rollup.map)
pep2prot(pep.Expr.Data, rollup.map)
pep.Expr.Data |
matrix of peptide expression data |
rollup.map |
the map to peptide to protein mapping |
matrix of peptide expression data