Title: | Analyzing Data with Cellwise Outliers |
---|---|
Description: | Tools for detecting cellwise outliers and robust methods to analyze data which may contain them. Contains the implementation of the algorithms described in Rousseeuw and Van den Bossche (2018) <doi:10.1080/00401706.2017.1340909> (open access) Hubert et al. (2019) <doi:10.1080/00401706.2018.1562989> (open access), Raymaekers and Rousseeuw (2021) <doi:10.1080/00401706.2019.1677270> (open access), Raymaekers and Rousseeuw (2021) <doi:10.1007/s10994-021-05960-5> (open access), Raymaekers and Rousseeuw (2021) <doi:10.52933/jdssv.v1i3.18> (open access), Raymaekers and Rousseeuw (2022) <arXiv:2207.13493> (open access) Rousseeuw (2022) <doi:10.1016/j.ecosta.2023.01.007> (open access). Examples can be found in the vignettes: "DDC_examples", "MacroPCA_examples", "wrap_examples", "transfo_examples", "DI_examples", "cellMCD_examples" , "Correspondence_analysis_examples", and "cellwise_weights_examples". |
Authors: | Jakob Raymaekers [aut, cre], Peter Rousseeuw [aut], Wannes Van den Bossche [ctb], Mia Hubert [ctb] |
Maintainer: | Jakob Raymaekers <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.5.3 |
Built: | 2024-10-31 21:20:34 UTC |
Source: | CRAN |
This function flags cellwise outliers in X
and imputes them, if robust estimates of the center mu
and scatter matrix Sigma
are given. When the latter are not known, as is typically the case, one can use the function DDC
which only requires the data matrix X
. Alternatively, the unknown center mu and scatter matrix Sigma can be estimated robustly from X
by the function DI
.
cellHandler(X, mu, Sigma, quant = 0.99)
cellHandler(X, mu, Sigma, quant = 0.99)
X |
|
mu |
An estimate of the center of the data |
Sigma |
An estimate of the covariance matrix of the data |
quant |
Cutoff used in the detection of cellwise outliers. Defaults to |
A list with components:
Ximp
The imputed data matrix.
indcells
Indices of the cells which were flagged in the analysis.
indNAs
Indices of the NAs in the data.
Zres
Matrix with standardized cellwise residuals of the flagged cells. Contains zeroes in the unflagged cells.
Zres_denom
Denominator of the standardized cellwise residuals.
cellPaths
Matrix with the same dimensions as X, in which each row contains the path of least angle regression through the cells of that row, i.e. the order of the coordinates in the path (1=first, 2=second,...)
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Journal of Data Science, Statistics, and Visualisation. doi:10.52933/jdssv.v1i3.18(link to open access pdf)
mu <- rep(0, 3) Sigma <- diag(3) * 0.1 + 0.9 X <- rbind(c(0.5, 1.0, 5.0), c(-3.0, 0.0, 1.0)) n <- nrow(X); d <- ncol(X) out <- cellHandler(X, mu, Sigma) Xres <- X - out$Ximp # unstandardized residual mean(abs(as.vector(Xres - out$Zres*out$Zres_denom))) # 0 W <- matrix(rep(0,n*d),nrow=n) # weight matrix W[out$Zres != 0] <- 1 # 1 indicates cells that were flagged # For more examples, we refer to the vignette: ## Not run: vignette("DI_examples") ## End(Not run)
mu <- rep(0, 3) Sigma <- diag(3) * 0.1 + 0.9 X <- rbind(c(0.5, 1.0, 5.0), c(-3.0, 0.0, 1.0)) n <- nrow(X); d <- ncol(X) out <- cellHandler(X, mu, Sigma) Xres <- X - out$Ximp # unstandardized residual mean(abs(as.vector(Xres - out$Zres*out$Zres_denom))) # 0 W <- matrix(rep(0,n*d),nrow=n) # weight matrix W[out$Zres != 0] <- 1 # 1 indicates cells that were flagged # For more examples, we refer to the vignette: ## Not run: vignette("DI_examples") ## End(Not run)
This function draws a cellmap, possibly of a subset of rows and columns of the data, and possibly combining cells into blocks. A cellmap shows which cells are missing and which ones are outlying, marking them in red for unusually large cell values and in blue for unusually low cell values. When cells are combined into blocks, the final color is the average of the colors in the individual cells.
cellMap(R, indcells = NULL, indrows = NULL, outrows = NULL, showcellvalues = NULL, D = NULL, rowlabels = NULL, columnlabels = NULL, mTitle = "cell map", rowtitle = "cases", columntitle = "variables", showrows = NULL, showcolumns = NULL, nrowsinblock = NULL, ncolumnsinblock = NULL, manualrowblocksizes = NULL, manualcolumnblocksizes = NULL, rowblocklabels = NULL, columnblocklabels = NULL, sizemain = 1.5, sizetitles = 1.2, sizerowlabels = 1, sizecolumnlabels = 1, sizecellvalues = 1, adjustrowlabels = 1, adjustcolumnlabels = 1, columnangle = 90, colContrast = 1, outlyingGrad = TRUE, darkestColor = sqrt(qchisq(0.999, 1)), drawCircles = FALSE, showVals = NULL, autolabel = TRUE)
cellMap(R, indcells = NULL, indrows = NULL, outrows = NULL, showcellvalues = NULL, D = NULL, rowlabels = NULL, columnlabels = NULL, mTitle = "cell map", rowtitle = "cases", columntitle = "variables", showrows = NULL, showcolumns = NULL, nrowsinblock = NULL, ncolumnsinblock = NULL, manualrowblocksizes = NULL, manualcolumnblocksizes = NULL, rowblocklabels = NULL, columnblocklabels = NULL, sizemain = 1.5, sizetitles = 1.2, sizerowlabels = 1, sizecolumnlabels = 1, sizecellvalues = 1, adjustrowlabels = 1, adjustcolumnlabels = 1, columnangle = 90, colContrast = 1, outlyingGrad = TRUE, darkestColor = sqrt(qchisq(0.999, 1)), drawCircles = FALSE, showVals = NULL, autolabel = TRUE)
R |
Matrix of standardized residuals of the cells (required input argument). After running |
indcells |
Indices of flagged cells. Defaults to |
indrows |
Indices of outlying rows (if available). If not |
outrows |
Outlyingness of each row (if available). If not |
showcellvalues |
Takes the values |
D |
A matrix of data values, of the same dimensions as |
rowlabels |
Labels of the rows of the matrix |
columnlabels |
Labels of the columns of the matrix |
mTitle |
Main title of the cellMap. Defaults to "cell map". |
rowtitle |
Title for the rows. Defaults to "cases". |
columntitle |
Title for the columns. Defaults to "variables". |
showrows |
Indices of the rows to be shown. Defaults to |
showcolumns |
Indices of the columns to be shown. Defaults to |
nrowsinblock |
How many rows are combined in a block. Defaults to |
ncolumnsinblock |
Defaults to |
manualrowblocksizes |
This allows the user to specify their own row blocks, unlike the argument nrowsinblock which makes all row blocks the same length. The argument takes the form |
manualcolumnblocksizes |
Analogous to manualrowblocksizes but for columns. It is allowed for one of them to be |
rowblocklabels |
This allows the user to specify labels for the row blocks, whether obtained from |
columnblocklabels |
Analogous to |
sizemain |
Size of main title. Defaults to |
sizetitles |
Size of row title and column title. Defaults to |
sizerowlabels |
Size of row labels. Defaults to |
sizecolumnlabels |
Size of column labels. Defaults to |
sizecellvalues |
Size of values in the cells, when showcellvalues = TRUE. Defaults to |
adjustrowlabels |
Adjust row labels: 0=left, 0.5=centered, 1=right. Defaults to |
adjustcolumnlabels |
Adjust column labels: 0=left, 0.5=centered, 1=right. Defaults to |
columnangle |
Angle of the column labels. Defaults to |
colContrast |
Parameter regulating the contrast of colors, should be in |
outlyingGrad |
If |
darkestColor |
Standardized residuals whose absolute value is bigger than this will get the darkest color. |
drawCircles |
Whether or not to draw circles indicating outlyingness of rows. When both |
showVals |
old name of argument |
autolabel |
obsoleted by the current machanism for creating blocks of cells. Is only in the list for backward compatibility. |
Rousseeuw P.J., Van den Bossche W.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
# For examples of the cellmap, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
# For examples of the cellmap, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
The cellwise minimum covariance determinant estimator
computes cellwise robust estimates of the center and covariance matrix of a data set X
. The algorithm guarantees a monotone decrease of an objective function,
which is based on observed Gaussian log-likelihood. By default, it starts by calling checkDataSet
to clean the data.
cellMCD(X, alpha = 0.75, quant = 0.99, crit = 1e-4, noCits = 100, lmin = 1e-4, checkPars = list())
cellMCD(X, alpha = 0.75, quant = 0.99, crit = 1e-4, noCits = 100, lmin = 1e-4, checkPars = list())
X |
|
alpha |
In each column, at least |
quant |
Determines the cutoff value to flag cells. Defaults to |
crit |
The iteration stops when successive covariance matrices (of the standardized data) differ by less than |
noCits |
The maximal number of C-steps used. |
lmin |
a lower bound on the eigenvalues of the estimated covariance matrix on the standardized data. Defaults to |
checkPars |
Optional list of parameters used in the call to
|
The matrix raw.S
in the output is the raw estimate of scatter produced by cellMCD. The final S
is obtained from raw.S
by rescaling such that its diagonal entries equal the squares of the univariate scales in locsca$scale
. This reduces the bias at Gaussian data, which matters mainly for large sample sizes.
A list with components:
mu
the cellMCD estimate of location.
S
the cellMCD estimate of scatter, after bias correction (see details).
W
the cellMCD estimate of W
, a binary matrix indicating all outlying cells as zero.
preds
predictions (=conditional expectations) of the flagged cells, given the clean cells in the same row.
csds
conditional standard deviations of the flagged cells, given the clean cells in the same row.
Ximp
imputed data matrix.
Zres
matrix of cellwise standardized residuals.
raw.S
the raw cellMCD estimate of scatter, without bias correction.
locsca
list containing robust locations and scales used to standardize the data before running the algorithm. The results m
, S
, preds
, Ximp
are returned in their original location/scale.
nosteps
number of steps the algorithm took to converge.
X
the data on which the algorithm was executed.
quant
the cutoff used to flag the cells.
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2022). The cellwise MCD estimator, Journal of the American Statistical Association, to appear. doi:10.1080/01621459.2023.2267777(link to open access pdf)
mu <- rep(0, 3) Sigma <- diag(3) * 0.5 + 0.5 set.seed(123) X <- MASS::mvrnorm(1000, mu, Sigma) X[1:5, 1] <- X[1:5, 1] + 5 X[6:10, 2] <- X[6:10, 2] - 10 X[12, 1:2] <- c(-4,8) colnames(X) <- c("X1","X2","X3") cellMCD.out <- cellMCD(X) cellMCD.out$mu cov2cor(cellMCD.out$S) cellMCD.out$W[1:15,] cellMCD.out$Ximp[1:15,] cellMap(cellMCD.out$Zres[1:15,]) # For more examples, we refer to the vignette: ## Not run: vignette("cellMCD_examples") ## End(Not run)
mu <- rep(0, 3) Sigma <- diag(3) * 0.5 + 0.5 set.seed(123) X <- MASS::mvrnorm(1000, mu, Sigma) X[1:5, 1] <- X[1:5, 1] + 5 X[6:10, 2] <- X[6:10, 2] - 10 X[12, 1:2] <- c(-4,8) colnames(X) <- c("X1","X2","X3") cellMCD.out <- cellMCD(X) cellMCD.out$mu cov2cor(cellMCD.out$S) cellMCD.out$W[1:15,] cellMCD.out$Ximp[1:15,] cellMap(cellMCD.out$Zres[1:15,]) # For more examples, we refer to the vignette: ## Not run: vignette("cellMCD_examples") ## End(Not run)
This function checks the dataset X, and sets aside certain
columns and rows that do not satisfy the conditions.
It is used by the DDC
and MacroPCA
functions but can be used by itself, to clean a dataset for a different type of analysis.
checkDataSet(X, fracNA = 0.5, numDiscrete = 3, precScale = 1e-12, silent = FALSE, cleanNAfirst = "automatic")
checkDataSet(X, fracNA = 0.5, numDiscrete = 3, precScale = 1e-12, silent = FALSE, cleanNAfirst = "automatic")
X |
|
fracNA |
Only retain columns and rows with fewer NAs than this fraction.
Defaults to |
numDiscrete |
A column that takes on numDiscrete or fewer values
will be considered discrete and not retained in the cleaned data.
Defaults to |
precScale |
Only consider columns whose scale is larger than precScale.
Here scale is measured by the median absolute deviation.
Defaults to |
silent |
Whether or not the function progress messages should be printed.
Defaults to |
cleanNAfirst |
If |
A list with components:
colInAnalysis
Column indices of the columns used in the analysis.
rowInAnalysis
Row indices of the rows used in the analysis.
namesNotNumeric
Names of the variables which are not numeric.
namesCaseNumber
The name of the variable(s) which contained the case numbers and was therefore removed.
namesNAcol
Names of the columns left out due to too many NA
's.
namesNArow
Names of the rows left out due to too many NA
's.
namesDiscrete
Names of the discrete variables.
namesZeroScale
Names of the variables with zero scale.
remX
Remaining (cleaned) data after checkDataSet.
Rousseeuw P.J., Van den Bossche W.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
library(MASS) set.seed(12345) n <- 100; d = 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 100, FALSE)] <- NA x <- cbind(1:n, x) checkedx <- checkDataSet(x) # For more examples, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
library(MASS) set.seed(12345) n <- 100; d = 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 100, FALSE)] <- NA x <- cbind(1:n, x) checkedx <- checkDataSet(x) # For more examples, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
Computes different estimators of multivariate location and scatter for cellwise weighted data.
cwLocScat(X, W, methods = "all", lmin = 1e-3, crit = 1e-12, maxiter= 1000, initCwCov = FALSE, initEst = NULL)
cwLocScat(X, W, methods = "all", lmin = 1e-3, crit = 1e-12, maxiter= 1000, initCwCov = FALSE, initEst = NULL)
X |
An |
W |
An |
methods |
either |
lmin |
if not |
crit |
convergence criterion of successive mu and Sigma estimates in the EM algorithm. |
maxiter |
maximal number of iteration steps in EM. |
initCwCov |
if |
initEst |
if not |
A list with components:
cwMean
the explicit cellwise weighted mean.
cwCov
explicit cellwise weighted covariance matrix.
Is asymptotically normal but not necessarily
PSD (unless a nonnegative lmin
was specified).
sqrtCov
the cellwise weighted covariance matrix of Van
Aelst et al (2011). Also asymptotically normal
but not necessarily PSD (unless a nonnegative
lmin
was specified).
cwMLEmu
the location estimate obtained by the cwMLE.
cwMLEsigma
the covariance matrix obtained by the cwMLE.
Is PSD when the EM algorithm converges.
P.J. Rousseeuw
P.J. Rousseeuw (2022). Analyzing cellwise weighted data, ArXiv:2209.12697. (link to open access pdf)
data("data_personality_traits") X <- data_personality_traits$X W <- data_personality_traits$W fit <- cwLocScat(X, W) fit$cwMLEiter # number of iteration steps taken round(fit$cwMLEmu, 2) round(fit$cwMean, 2) round(fit$cwMLEsigma, 2) round(fit$cwCov, 2) # For more examples, we refer to the vignette: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
data("data_personality_traits") X <- data_personality_traits$X W <- data_personality_traits$W fit <- cwLocScat(X, W) fit$cwMLEiter # number of iteration steps taken round(fit$cwMLEmu, 2) round(fit$cwMean, 2) round(fit$cwMLEsigma, 2) round(fit$cwCov, 2) # For more examples, we refer to the vignette: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
The brands data is a contingency table summarizing the 2014 Auto Brand Perception survey by Consumer Reports (USA), which is publicly available on https://boraberan.wordpress.com/2016/09/22/. The survey questioned 1578 participants on what they considered attributes of 39 different car brands.
data("data_brands")
data("data_brands")
A matrix with 39 observations of 7 attributes. The attributes (columns) are Fuel Economy, Innovation, Performance, Quality, Safety, Style and Value.
https://boraberan.wordpress.com/2016/09/22/.
Riani, M., Atkinson, A. C., Torti, F., Corbellini, A. (2022). Robust correspondence analysis. Journal of the Royal Statistical Society Series C: Applied Statistics, 71(5), 1381–1401.
Raymaekers and Rousseeuw (2022), Challenges of cellwise outliers.
data(data_brands)
data(data_brands)
The clothes dataset contains a contingency table of trade flows from outside the European Union into each of its 28 member states. The columns in the contingency table in Riani et al. (2022) are five different price brackets, from lowest to highest.
data("data_clothes")
data("data_clothes")
A matrix with 28 observations of 5 price brackets.
Riani, M., Atkinson, A. C., Torti, F., Corbellini, A. (2022). Robust correspondence analysis. Journal of the Royal Statistical Society Series C: Applied Statistics, 71(5), 1381–1401.
Raymaekers and Rousseeuw (2022), Challenges of cellwise outliers.
data(data_clothes)
data(data_clothes)
A dataset containing the image sequence of a video. The sequence consists of 54 frames of 144 by 180 pixels pixels in Red/Geen/Blue (RGB) format.
data("data_dogWalker")
data("data_dogWalker")
An array of dimensions .
http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html
data("data_dogWalker") # For more examples, we refer to the vignette: ## Not run: vignette("Wrap_examples") ## End(Not run)
data("data_dogWalker") # For more examples, we refer to the vignette: ## Not run: vignette("Wrap_examples") ## End(Not run)
This is a random subset of 20'000 stars from the Digitized Palomar Sky Survey (DPOSS) described by Odewahn et al. (1998).
data("data_dposs")
data("data_dposs")
A matrix of dimensions .
Odewahn, S., S. Djorgovski, R. Brunner, and R. Gal (1998). Data From the Digitized Palomar Sky Survey. Technical report, California Institute of Technology.
data("data_dposs") # For more examples, we refer to the vignette: ## Not run: vignette("MacroPCA_examples") ## End(Not run)
data("data_dposs") # For more examples, we refer to the vignette: ## Not run: vignette("MacroPCA_examples") ## End(Not run)
A dataset containing spectra with wavelengths collected on
archeological glass samples.
data("data_glass")
data("data_glass")
A data frame with 180 observations of 750 wavelengths.
Lemberge, P., De Raedt, I., Janssens, K.H., Wei, F., and Van Espen, P.J. (2000).
Quantitative Z-analysis of 16th-17th century archaeological glass vessels using
PLS regression of EPXMA and -XRF data. Journal of Chemometrics, 14,
751–763.
data("data_glass")
data("data_glass")
This dataset contains the mortality by age for males in France, from 1816 to 2013 as obtained from the Human Mortality Database.
data("data_mortality")
data("data_mortality")
A data frame with 198 calendar years (rows) and 91 age brackets (columns).
Human Mortality Database. University of California, Berkeley (USA), and Max Planck Institute for Demographic Research (Germany). Available at https://www.mortality.org (data downloaded in November 2015).
Hyndman, R.J., and Shang, H.L. (2010), Rainbow plots, bagplots, and boxplots for functional data, Journal of Computational and Graphical Statistics, 19, 29–45.
data("data_mortality")
data("data_mortality")
This dataset describes personality traits of 10 persons. The variables are the 6 traits Anxiety, Agoraphobia, Arachnophobia, Adventurous, Extraversion, and Sociability.
data("data_personality_traits")
data("data_personality_traits")
The data contains a list with two elements:
X
a by
matrix of values describing
personality traits for each of the
participants.
W
a by
matrix of cellwise weights. Each weight is the inverse of the length of the support of the membership function of the fuzzy number in the original data set.
G. Hesamian, and Akbari, M. G. (2019), Principal component analysis based on intuitionistic fuzzy random variables, Computational and Applied Mathematics, 38(158), 1–14.
P.J. Rousseeuw (2022). Analyzing cellwise weighted data, ArXiv:2209.12697. (link to open access pdf)
data(data_personality_traits) # For the examples in Rousseeuw (2022), see: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
data(data_personality_traits) # For the examples in Rousseeuw (2022), see: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
A dataset containing measurements of characteristics
of
diaphragm parts, used in the production of TV sets.
data("data_philips")
data("data_philips")
A matrix with rows and
columns.
The data were provided in 1997 by Gertjan Otten and permission to analyze them was given by Herman Veraa and Frans Van Dommelen at Philips Mecoma in The Netherlands.
Rousseeuw, P.J., and Van Driessen, K. (1999). A fast algorithm for the Minimum Covariance Determinant estimator. Technometrics, 41, 212–223.
data("data_philips")
data("data_philips")
This dataset contains the data on volatile organic components (VOCs) in urine of children between 3 and 10 years old. It is composed of pubicly available data from the National Health and Nutrition Examination Survey (NHANES) and was analyzed in Raymaekers and Rousseeuw (2020). See below for details and references.
data("data_VOC")
data("data_VOC")
A matrix of dimensions .
The first 16 variables are the VOC, the last 3 are:
SMD460
: number of smokers that live in the same home as the subject
SMD470
: number of people that smoke inside the home of the subject
RIDAGEYR
: age of the subject
Note that the original variable names are kept.
All of the data was collected from the NHANES website, and was part of the NHANES 2015-2016 survey. This was the most recent epoch with complete data at the time of extraction. Three datasets were matched in order to assemble this data:
UVOC_I: contains the information on the Volative organic components in urine
DEMO_I: contains the demographical information such as age
SMQFAM_I: contains the data on the smoking habits of family members
The dataset was constructed as follows:
Select the relevant VOCs from the UVOC_I data (see column names) and transform by taking the logarithm
Match the subjects in the UVOC_I data with their age in the DEMO_I data
Select all subjects with age at most 10
Match the data on smoking habits with the selected subjects.
https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2015
https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2015
https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2015
J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Journal of Data Science, Statistics, and Visualisation. doi:10.52933/jdssv.v1i3.18(link to open access pdf)
data("data_VOC") # For an analysis of this data, we refer to the vignette: ## Not run: vignette("DI_examples") ## End(Not run)
data("data_VOC") # For an analysis of this data, we refer to the vignette: ## Not run: vignette("DI_examples") ## End(Not run)
This function aims to detect cellwise outliers in the data. These are entries in the data matrix which are substantially higher or lower than what could be expected based on the other cells in its column as well as the other cells in its row, taking the relations between the columns into account. Note that this function first calls checkDataSet
and analyzes the remaining cleaned data.
DDC(X, DDCpars = list())
DDC(X, DDCpars = list())
X |
|
DDCpars |
A list of available options:
|
A list with components:
DDCpars
The list of options used.
colInAnalysis
The column indices of the columns used in the analysis.
rowInAnalysis
The row indices of the rows used in the analysis.
namesNotNumeric
The names of the variables which are not numeric.
namesCaseNumber
The name of the variable(s) which contained the case numbers and was therefore removed.
namesNAcol
Names of the columns left out due to too many NA
's.
namesNArow
Names of the rows left out due to too many NA
's.
namesDiscrete
Names of the discrete variables.
namesZeroScale
Names of the variables with zero scale.
remX
Cleaned data after checkDataSet
.
locX
Estimated location of X
.
scaleX
Estimated scales of X
.
Z
Standardized remX
.
nbngbrs
Number of neighbors used in estimation.
ngbrs
Indicates neighbors of each column, i.e. the columns most correlated with it.
robcors
Robust correlations.
robslopes
Robust slopes.
deshrinkage
The deshrinkage factor used for every connected (i.e. non-standalone) column of X
.
Xest
Predicted X
.
scalestres
Scale estimate of the residuals X - Xest
.
stdResid
Residuals of orginal X
minus the estimated Xest
, standardized by column.
indcells
Indices of the cells which were flagged in the analysis.
Ti
Outlyingness value of each row.
medTi
Median of the Ti values.
madTi
Mad of the Ti values.
indrows
Indices of the rows which were flagged in the analysis.
indNAs
Indices of all NA cells.
indall
Indices of all cells which were flagged in the analysis plus all cells in flagged rows plus the indices of the NA cells.
Ximp
Imputed X
.
Raymaekers J., Rousseeuw P.J., Van den Bossche W.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
Raymaekers, J., Rousseeuw P.J. (2019). Fast robust correlation for high dimensional data. Technometrics, 63(2), 184-198. (link to open access pdf)
library(MASS); set.seed(12345) n <- 50; d <- 20 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 x[sample(1:(n * d), 50, FALSE)] <- -10 x <- cbind(1:n, x) DDCx <- DDC(x) cellMap(DDCx$stdResid) # For more examples, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
library(MASS); set.seed(12345) n <- 50; d <- 20 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 x[sample(1:(n * d), 50, FALSE)] <- -10 x <- cbind(1:n, x) DDCx <- DDC(x) cellMap(DDCx$stdResid) # For more examples, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
Based on a DDC
fit on an initial (training) data set X
, this function
analyzes a new (test) data set Xnew
.
DDCpredict(Xnew, InitialDDC, DDCpars = NULL)
DDCpredict(Xnew, InitialDDC, DDCpars = NULL)
Xnew |
The new data (test data), which must be a matrix or a data frame. It must always be provided. Its columns (variables) should correspond to those of |
InitialDDC |
The output of the |
DDCpars |
The input options to be used for the prediction. By default the options of InitialDDC are used. |
A list with components:
DDCpars |
the options used in the call, see |
locX |
the locations of the columns, from |
scaleX |
the scales of the columns, from |
Z |
|
nbngbrs |
predictions use a combination of |
ngbrs |
for each column, the list of its neighbors, from |
robcors |
for each column, the correlations with its neighbors, from |
robslopes |
slopes to predict each column by its neighbors, from |
deshrinkage |
for each connected column, its deshrinkage factor used in |
Xest |
predicted values for every cell of |
scalestres |
scale estimate of the residuals ( |
stdResid |
columnwise standardized residuals of |
indcells |
positions of cellwise outliers in |
Ti |
outlyingness of rows in |
medTi |
median of the |
madTi |
mad of the |
indrows |
row numbers of the outlying rows in |
indNAs |
positions of the |
indall |
positions of |
Ximp |
|
Rousseeuw P.J., Van den Bossche W.
Hubert, M., Rousseeuw, P.J., Van den Bossche W. (2019). MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, 61(4), 459-473. (link to open access pdf)
library(MASS) set.seed(12345) n <- 100; d <- 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 x <- cbind(1:n, x) DDCx <- DDC(x) xnew <- mvrnorm(50, rep(0,d), A) xnew[sample(1:(50 * d), 50, FALSE)] <- 10 predict.out <- DDCpredict(xnew, DDCx) cellMap(D = xnew, R = predict.out$stdResid, columnlabels = 1:d, rowlabels = 1:50) # For more examples, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
library(MASS) set.seed(12345) n <- 100; d <- 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 x <- cbind(1:n, x) DDCx <- DDC(x) xnew <- mvrnorm(50, rep(0,d), A) xnew[sample(1:(50 * d), 50, FALSE)] <- 10 predict.out <- DDCpredict(xnew, DDCx) cellMap(D = xnew, R = predict.out$stdResid, columnlabels = 1:d, rowlabels = 1:50) # For more examples, we refer to the vignette: ## Not run: vignette("DDC_examples") ## End(Not run)
The Detection-Imputation algorithm computes cellwise robust estimates of the center and covariance matrix of a data set X
. The algorithm alternates between the detection of cellwise outliers and their imputation combined with re-estimation of the center and covariance matrix. By default, it starts by calling checkDataSet
to clean the data.
DI(X, initEst = "DDCWcov", crit = 0.01, maxits = 10, quant = 0.99, maxCol = 0.25, checkPars = list())
DI(X, initEst = "DDCWcov", crit = 0.01, maxits = 10, quant = 0.99, maxCol = 0.25, checkPars = list())
X |
|
initEst |
An initial estimator for the center and covariance matrix. Should be one of |
crit |
The algorithm converges when the subsequent estimates of the center and covariance matrix do not differ more than |
maxits |
Maximum number of DI-iterations. |
quant |
The cutoff used to detect cellwise outliers. |
maxCol |
The maximum number of cellwise outliers allowed in a column. |
checkPars |
Optional list of parameters used in the call to
|
A list with components:
center
The final estimate of the center of the data.
cov
The final estimate of the covariance matrix.
nits
Number of DI-iterations executed to reach convergence.
Ximp
The imputed data.
indcells
Indices of the cells which were flagged in the analysis.
indNAs
Indices of the NAs in the data.
Zres
Matrix with standardized cellwise residuals of the flagged cells. Contains zeroes in the unflagged cells.
Zres_denom
Denominator of the standardized cellwise residuals.
cellPaths
Matrix with the same dimensions as X, in which each row contains the path of least angle regression through the cells of that row, i.e. the order of the coordinates in the path (1=first, 2=second,...)
checkDataSet_out
Output of the call to checkDataSet
which is used to clean the data.
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Journal of Data Science, Statistics, and Visualisation. doi:10.52933/jdssv.v1i3.18(link to open access pdf)
mu <- rep(0, 3) Sigma <- diag(3) * 0.1 + 0.9 X <- MASS::mvrnorm(100, mu, Sigma) DI.out <- DI(X) DI.out$cov # For more examples, we refer to the vignette: ## Not run: vignette("DI_examples") ## End(Not run)
mu <- rep(0, 3) Sigma <- diag(3) * 0.1 + 0.9 X <- MASS::mvrnorm(100, mu, Sigma) DI.out <- DI(X) DI.out$cov # For more examples, we refer to the vignette: ## Not run: vignette("DI_examples") ## End(Not run)
Estimate a robust location estimate and scale estimate of every column in X
.
estLocScale(X, type = "wrap", precScale = 1e-12, center = TRUE, alpha = 0.5, nLocScale = 25000, silent = FALSE)
estLocScale(X, type = "wrap", precScale = 1e-12, center = TRUE, alpha = 0.5, nLocScale = 25000, silent = FALSE)
X |
The input data. It must be an |
type |
The type of estimators used. One of:
Defaults to "wrap". |
precScale |
The precision scale used throughout the algorithm. Defaults to |
center |
Whether or not the data has to be centered before calculating the scale. Not in use for |
alpha |
The value of |
nLocScale |
If |
silent |
Whether or not a warning message should be printed when very small scales are found. Defauts to |
A list with components:
loc
A vector with the estimated locations.
scale
A vector with the estimated scales.
Raymaekers, J. and Rousseeuw P.J.
Raymaekers, J., Rousseeuw P.J. (2019). Fast robust correlation for high dimensional data. Technometrics, 63(2), 184-198. (link to open access pdf)
library(MASS) set.seed(12345) n = 100; d = 10 X = mvrnorm(n, rep(0, 10), diag(10)) locScale = estLocScale(X) # For more examples, we refer to the vignette: ## Not run: vignette("wrap_examples") ## End(Not run)
library(MASS) set.seed(12345) n = 100; d = 10 X = mvrnorm(n, rep(0, 10), diag(10)) locScale = estLocScale(X) # For more examples, we refer to the vignette: ## Not run: vignette("wrap_examples") ## End(Not run)
This function generates correlation matrices frequently used in simulation studies.
generateCorMat(d, corrType = "ALYZ", CN = 100, seed = NULL)
generateCorMat(d, corrType = "ALYZ", CN = 100, seed = NULL)
d |
The dimension of the correlation matrix. The resulting matrix is |
corrType |
The type of correlation matrix to be generated. Should be one of:
Note that the option |
CN |
Condition number of the correlation matrix. Only used for |
seed |
Seed used in |
A correlation matrix of the given type.
J. Raymaekers and P.J. Rousseeuw
C. Agostinelli, Leung, A., Yohai, V. J., and Zamar, R. H. (2015). Robust Estimation of Multivariate Location and Scatter in the Presence of Cellwise and Casewise Contamination. Test, 24, 441-461.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)
d <- 5 Sigma <- generateCorMat(d, corrType = "ALYZ", seed = 1) Sigma
d <- 5 Sigma <- generateCorMat(d, corrType = "ALYZ", seed = 1) Sigma
This function generates multivariate normal datasets with several possible types of outliers. It is used in several simulation studies. For a detailed description, see the referenced papers.
generateData(n, d, mu, Sigma, perout, gamma, outlierType = "casewise", seed = NULL)
generateData(n, d, mu, Sigma, perout, gamma, outlierType = "casewise", seed = NULL)
n |
The number of observations |
d |
The dimension of the data. |
mu |
The center of the clean data. |
Sigma |
The covariance matrix of the clean data. Could be obtained from |
outlierType |
The type of contamination to be generated. Should be one of:
|
perout |
The percentage of generated outliers. For |
gamma |
How far outliers are from the center of the distribution. |
seed |
Seed used to generate the data. |
A list with components:
X
The generated data matrix of size .
indcells
A vector with the indices of the contaminated cells.
indrows
A vector with the indices of the rowwise outliers.
J. Raymaekers and P.J. Rousseeuw
C. Agostinelli, Leung, A., Yohai, V. J., and Zamar, R. H. (2015). Robust Estimation of Multivariate Location and Scatter in the Presence of Cellwise and Casewise Contamination. Test, 24, 441-461.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)
n <- 100 d <- 5 mu <- rep(0, d) Sigma <- diag(d) perout <- 0.1 gamma <- 10 data <- generateData(n, d, mu, Sigma, perout, gamma, outlierType = "cellwisePlain", seed = 1) pairs(data$X) data$indcells
n <- 100 d <- 5 mu <- rep(0, d) Sigma <- diag(d) perout <- 0.1 gamma <- 10 data <- generateData(n, d, mu, Sigma, perout, gamma, outlierType = "cellwisePlain", seed = 1) pairs(data$X) data$indcells
This function carries out classical PCA when the data may contain missing values, by an iterative algorithm. It is based on a Matlab function from the Missing Data Imputation Toolbox v1.0 by A. Folch-Fortuny, F. Arteaga and A. Ferrer.
ICPCA(X, k, scale = FALSE, maxiter = 20, tol = 0.005, tolProb = 0.99, distprob = 0.99)
ICPCA(X, k, scale = FALSE, maxiter = 20, tol = 0.005, tolProb = 0.99, distprob = 0.99)
X |
the input data, which must be a matrix or a data frame. It may contain NA's. It must always be provided. |
k |
the desired number of principal components |
scale |
a value indicating whether and how the original
variables should be scaled. If |
maxiter |
maximum number of iterations. Default is 20. |
tol |
tolerance for iterations. Default is 0.005. |
tolProb |
tolerance probability for residuals. Defaults to 0.99. |
distprob |
probability determining the cutoff values for orthogonal and score distances. Default is 0.99. |
A list with components:
scaleX |
the scales of the columns of X. |
k |
the number of principal components. |
loadings |
the columns are the k loading vectors. |
eigenvalues |
the k eigenvalues. |
center |
vector with the fitted center. |
covmatrix |
estimated covariance matrix. |
It |
number of iteration steps. |
diff |
convergence criterion. |
X.NAimp |
data with all NA's imputed. |
scores |
scores of X.NAimp. |
OD |
orthogonal distances of the rows of X.NAimp. |
cutoffOD |
cutoff value for the OD. |
SD |
score distances of the rows of X.NAimp. |
cutoffSD |
cutoff value for the SD. |
highOD |
row numbers of cases whose |
highSD |
row numbers of cases whose |
residScale |
scale of the residuals. |
stdResid |
standardized residuals. Note that these are NA
for all missing values of |
indcells |
indices of cellwise outliers. |
Wannes Van Den Bossche
Folch-Fortuny, A., Arteaga, F., Ferrer, A. (2016). Missing Data Imputation Toolbox for MATLAB. Chemometrics and Intelligent Laboratory Systems, 154, 93-100.
library(MASS) set.seed(12345) n <- 100; d <- 10 A <- diag(d) * 0.1 + 0.9 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 100, FALSE)] <- NA ICPCA.out <- ICPCA(x, k = 2) plot(ICPCA.out$scores)
library(MASS) set.seed(12345) n <- 100; d <- 10 A <- diag(d) * 0.1 + 0.9 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 100, FALSE)] <- NA ICPCA.out <- ICPCA(x, k = 2) plot(ICPCA.out$scores)
This function performs the MacroPCA algorithm, which can deal with Missing values and Cellwise
and Rowwise Outliers. Note that this function first calls checkDataSet
and analyzes the remaining cleaned data.
MacroPCA(X, k = 0, MacroPCApars = NULL)
MacroPCA(X, k = 0, MacroPCApars = NULL)
X |
|
k |
|
MacroPCApars |
A list of available options detailed below. If MacroPCApars = NULL the defaults below are used.
|
A list with components:
MacroPCApars |
the options used in the call. |
remX |
Cleaned data after |
DDC |
results of the first step of MacroPCA. These are needed to run MacroPCApredict on new data. |
scaleX |
the scales of the columns of |
k |
the number of principal components. |
loadings |
the columns are the |
eigenvalues |
the |
center |
vector with the center. |
alpha |
|
h |
|
It |
number of iteration steps. |
diff |
convergence criterion. |
X.NAimp |
data with all |
scores |
scores of |
OD |
orthogonal distances of the rows of |
cutoffOD |
cutoff value for the OD. |
SD |
score distances of the rows of |
cutoffSD |
cutoff value for the SD. |
highOD |
row numbers of cases whose |
highSD |
row numbers of cases whose |
residScale |
scale of the residuals. |
stdResid |
standardized residuals. Note that these are |
indcells |
indices of cellwise outliers. |
NAimp |
various results for the NA-imputed data. |
Cellimp |
various results for the cell-imputed data. |
Fullimp |
various result for the fully imputed data. |
Rousseeuw P.J., Van den Bossche W.
Hubert, M., Rousseeuw, P.J., Van den Bossche W. (2019). MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, 61(4), 459-473. (link to open access pdf)
library(MASS) set.seed(12345) n <- 50; d <- 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 MacroPCA.out <- MacroPCA(x, 2) cellMap(MacroPCA.out$stdResid) # For more examples, we refer to the vignette: ## Not run: vignette("MacroPCA_examples") ## End(Not run)
library(MASS) set.seed(12345) n <- 50; d <- 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 MacroPCA.out <- MacroPCA(x, 2) cellMap(MacroPCA.out$stdResid) # For more examples, we refer to the vignette: ## Not run: vignette("MacroPCA_examples") ## End(Not run)
Based on a MacroPCA
fit of an initial (training) data set X
, this function analyzes a
new (test) data set Xnew
.
MacroPCApredict(Xnew, InitialMacroPCA, MacroPCApars = NULL)
MacroPCApredict(Xnew, InitialMacroPCA, MacroPCApars = NULL)
Xnew |
The new data (test data), which must be a matrix or a data frame.
It must always be provided. Its columns (variables) should correspond to those of |
InitialMacroPCA |
The output of the MacroPCA function on the initial (training) dataset. Must be provided. |
MacroPCApars |
The input options to be used for the prediction.
By default the options of InitialMacroPCA are used. For the complete list of
options see the function |
A list with components:
MacroPCApars |
the options used in the call. |
DDC |
result of DDCpredict which is the first step of MacroPCApredict.
See the function |
scaleX |
the scales of the columns of |
k |
the number of principal components. |
loadings |
the columns are the |
eigenvalues |
the |
center |
vector with the fitted center. |
It |
number of iteration steps. |
diff |
convergence criterion. |
Xnew.NAimp |
|
scores |
scores of |
OD |
orthogonal distances of the rows of |
cutoffOD |
cutoff value for the OD. |
SD |
score distances of the rows of |
cutoffSD |
cutoff value for the SD. |
highOD |
row numbers of cases in |
highSD |
row numbers of cases in |
residScale |
scale of the residuals. |
stdResid |
standardized residuals. Note that these are |
indcells |
indices of cellwise outliers. |
NAimp |
various results for the NA-imputed Xnew. |
Cellimp |
various results for the cell-imputed Xnew. |
Fullimp |
various result for the fully imputed Xnew. |
Rousseeuw P.J., Van den Bossche W.
Hubert, M., Rousseeuw, P.J., Van den Bossche W. (2019). MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, 61(4), 459-473. (link to open access pdf)
checkDataSet
, cellMap
,
DDC
, DDCpredict
,
MacroPCA
library(MASS) set.seed(12345) n <- 50; d <- 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 MacroPCA.out <- MacroPCA(x, 2) xnew <- mvrnorm(25, rep(0,d), A) xnew[sample(1:(25 * d), 12, FALSE)] <- 10 predict.out <- MacroPCApredict(xnew, MacroPCA.out) cellMap(predict.out$stdResid) # For more examples, we refer to the vignette: ## Not run: vignette("MacroPCA_examples") ## End(Not run)
library(MASS) set.seed(12345) n <- 50; d <- 10 A <- matrix(0.9, d, d); diag(A) = 1 x <- mvrnorm(n, rep(0,d), A) x[sample(1:(n * d), 50, FALSE)] <- NA x[sample(1:(n * d), 50, FALSE)] <- 10 MacroPCA.out <- MacroPCA(x, 2) xnew <- mvrnorm(25, rep(0,d), A) xnew[sample(1:(25 * d), 12, FALSE)] <- 10 predict.out <- MacroPCApredict(xnew, MacroPCA.out) cellMap(predict.out$stdResid) # For more examples, we refer to the vignette: ## Not run: vignette("MacroPCA_examples") ## End(Not run)
The outlier map is a diagnostic plot for the output of MacroPCA
.
outlierMap(res,title="Robust PCA",col="black", pch=16,labelOut=TRUE,id=3, xlim = NULL, ylim = NULL, cex = 1, cex.main=1.2, cex.lab=NULL, cex.axis=NULL)
outlierMap(res,title="Robust PCA",col="black", pch=16,labelOut=TRUE,id=3, xlim = NULL, ylim = NULL, cex = 1, cex.main=1.2, cex.lab=NULL, cex.axis=NULL)
res |
A list containing the orthogonal distances ( |
title |
Title of the plot, default is "Robust PCA". |
col |
Colour of the points in the plot, this can be a single colour for all points or a vector or list specifying the colour for each point. The default is "black". |
pch |
Plotting characters or symbol used in the plot, see points for more details. The default is 16 which corresponds to filled circles. |
labelOut |
Logical indicating if outliers should be labelled on the plot, default is |
id |
Number of OD outliers and number of SD outliers to label on the plot, default is 3. |
xlim |
Optional argument to set the limits of the |
ylim |
Optional argument to set the limits of the |
cex |
Optional argument determining the size of the plotted points. See |
cex.main |
Optional argument determining the size of the main title. See |
cex.lab |
Optional argument determining the size of the labels. See |
cex.axis |
Optional argument determining the size of the axes. See |
The outlier map contains the score distances on the x-axis and the orthogonal distances on the y-axis. To detect outliers, cut-offs for both distances are shown, see Hubert et al. (2005).
P.J. Rousseeuw
Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005). ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics, 47, 64-79.
# empty for now
# empty for now
Function for making plots based on the output of cellMCD
.
plot_cellMCD(cellout, type = "Zres/X", whichvar = NULL, horizvar = NULL, vertivar = NULL, hband = NULL, vband = NULL, drawellipse = T, opacity = 0.5, identify = FALSE, ids = NULL, labelpoints = T, vlines = FALSE, clines = TRUE, main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, cex = 1, cex.main = 1.2, cex.txt = 0.8, cex.lab = 1, line = 2.0)
plot_cellMCD(cellout, type = "Zres/X", whichvar = NULL, horizvar = NULL, vertivar = NULL, hband = NULL, vband = NULL, drawellipse = T, opacity = 0.5, identify = FALSE, ids = NULL, labelpoints = T, vlines = FALSE, clines = TRUE, main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, cex = 1, cex.main = 1.2, cex.txt = 0.8, cex.lab = 1, line = 2.0)
cellout |
output of function |
type |
type of diagnostic plot. Should be one of |
whichvar |
number or name of the variable to be plotted. Not applicable when |
horizvar |
number or name of the variable to be plotted on the horizontal axis. Only when |
vertivar |
number or name of the variable to be plotted on the vertical axis. Only when |
hband |
whether to draw a horizontal tolerance band. |
vband |
whether to draw a vertical tolerance band. |
drawellipse |
whether to draw a |
opacity |
opacity of the plotted points: 1 is fully opaque, less is more transparent. |
identify |
if |
ids |
vector of case numbers to be emphasized (colored red) in the plot. If |
labelpoints |
if |
vlines |
for the points in |
clines |
only for type == "bivariate". If TRUE, draws a red connecting line from each point in ids to its imputed point, shown in blue. |
main |
main title of the plot. If |
xlab |
overriding label for x-axis, unless |
ylab |
overriding label for y-axis, unless |
xlim |
overriding limits of horizontal axis. |
ylim |
overriding limits of vertical axis. |
cex |
size of plotted points. |
cex.main |
size of the main title. |
cex.lab |
size of the axis labels. |
cex.txt |
size of the point labels. |
line |
distance of axis labels to their axis. |
NULL
, unless identify = TRUE
. Then a list with components:
ids
the case number(s) that were identified
coords
coordinates of all points in the plot.
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2022). The cellwise MCD estimator, Journal of the American Statistical Association, to appear. doi:10.1080/01621459.2023.2267777(link to open access pdf)
mu <- rep(0, 3) Sigma <- diag(3) * 0.5 + 0.5 set.seed(123) X <- MASS::mvrnorm(1000, mu, Sigma) X[1:5, 1] <- X[1:5, 1] + 5 X[6:10, 2] <- X[6:10, 2] - 10 X[12, 1:2] <- c(-4,8) cellMCD.out <- cellMCD(X) plot_cellMCD(cellMCD.out, type="bivariate", horizvar=1, vertivar=2, ids=c(1:10,12)) # For more examples, we refer to the vignette: ## Not run: vignette("cellMCD_examples") ## End(Not run)
mu <- rep(0, 3) Sigma <- diag(3) * 0.5 + 0.5 set.seed(123) X <- MASS::mvrnorm(1000, mu, Sigma) X[1:5, 1] <- X[1:5, 1] + 5 X[6:10, 2] <- X[6:10, 2] - 10 X[12, 1:2] <- c(-4,8) cellMCD.out <- cellMCD(X) plot_cellMCD(cellMCD.out, type="bivariate", horizvar=1, vertivar=2, ids=c(1:10,12)) # For more examples, we refer to the vignette: ## Not run: vignette("cellMCD_examples") ## End(Not run)
This function uses reweighted maximum likelihood to robustly fit the
Box-Cox or Yeo-Johnson transformation to each variable in a dataset.
Note that this function first calls checkDataSet
to ensure that the variables to be transformed are not too discrete.
transfo(X, type = "YJ", robust = TRUE, standardize = TRUE, quant = 0.99, nbsteps = 2, checkPars = list())
transfo(X, type = "YJ", robust = TRUE, standardize = TRUE, quant = 0.99, nbsteps = 2, checkPars = list())
X |
A data matrix of dimensions n x d. Its columns are the variables to be transformed. |
type |
The type of transformation to be fit. Should be one of:
|
robust |
if |
standardize |
whether to standardize the variables before and after the power transformation. See Details below. |
quant |
quantile for determining the weights in the
reweighting step (ignored when |
nbsteps |
number of reweighting steps (ignored when
|
checkPars |
Optional list of parameters used in the call to
|
In case standardize = TRUE
, the variables is standardized before and after transformation.
For BC the variable is divided by its median before transformation.
For YJ and robust = TRUE
this subtracts its median and divides by its mad (median absolute deviation) before transformation. For YJ and robust = FALSE
this subtracts the mean and divides by the standard deviation before transformation. For the standardization after the transformation, the classical mean and standard deviation are used in case robust = FALSE
. If robust = TRUE
, the mean and standard deviation are calculated robustly on a subset of inliers.
A list with components:
lambdahats
the estimated transformation parameter for each column of X
.
Y
A matrix in which each column is the transformed version of the
corresponding column of X
.
The transformed version includes pre- and post-standardization if standardize=TRUE
.
muhat
The estimated location of each column of Y
.
sigmahat
The estimated scale of each column of Y
.
weights
The final weights from the reweighting.
ttypes
The type of transform used in each column.
objective
Value of the (reweighted) maximum likelihood objective function.
values of checkDataSet
, unless coreOnly
is TRUE
.
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2021). Transforming variables to central normality. Machine Learning. doi:10.1007/s10994-021-05960-5(link to open access pdf)
transfo_newdata
, transfo_transformback
# find Box-Cox transformation parameter for lognormal data: set.seed(123) x <- exp(rnorm(1000)) transfo.out <- transfo(x, type = "BC") # estimated parameter: transfo.out$lambdahat # value of the objective function: transfo.out$objective # the transformed variable: transfo.out$Y # the type of transformation used: transfo.out$ttypes # qqplot of the transformed variable: qqnorm(transfo.out$Y); abline(0,1) # For more examples, we refer to the vignette: ## Not run: vignette("transfo_examples") ## End(Not run)
# find Box-Cox transformation parameter for lognormal data: set.seed(123) x <- exp(rnorm(1000)) transfo.out <- transfo(x, type = "BC") # estimated parameter: transfo.out$lambdahat # value of the objective function: transfo.out$objective # the transformed variable: transfo.out$Y # the type of transformation used: transfo.out$ttypes # qqplot of the transformed variable: qqnorm(transfo.out$Y); abline(0,1) # For more examples, we refer to the vignette: ## Not run: vignette("transfo_examples") ## End(Not run)
transfo
.
Based on the output of transfo
, transform the variables using Yeo-Johnson and/or Box-Cox transformations with the previously estimated parameters and standardization.
transfo_newdata(Xnew, transfo.out)
transfo_newdata(Xnew, transfo.out)
Xnew |
A data matrix with d columns, which contain the variables to be transformed. The number of columns and their names must be the same as those of the original data on which |
transfo.out |
The output of a call to |
Returns a matrix with transformed variables.
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2021). Transforming variables to central normality. Machine Learning. doi:10.1007/s10994-021-05960-5(link to open access pdf)
set.seed(123); tempraw <- matrix(rnorm(2000), ncol = 2) tempx <- cbind(tempraw[, 1],exp(tempraw[, 2])) tempy <- 0.5 * tempraw[, 1] + 0.5 * tempraw[, 2] + 1 x <- tempx[1:900, ] y <- tempy[1:900] tx.out <- transfo(x, type = "bestObj") tx.out$ttypes tx.out$lambdahats tx <- tx.out$Y lm.out <- lm(y ~ tx) summary(lm.out) xnew <- tempx[901:1000, ] xtnew <- transfo_newdata(xnew, tx.out) yhatnew <- tcrossprod(lm.out$coefficients, cbind(1, xtnew)) plot(tempy[901:1000], yhatnew); abline(0, 1)
set.seed(123); tempraw <- matrix(rnorm(2000), ncol = 2) tempx <- cbind(tempraw[, 1],exp(tempraw[, 2])) tempy <- 0.5 * tempraw[, 1] + 0.5 * tempraw[, 2] + 1 x <- tempx[1:900, ] y <- tempy[1:900] tx.out <- transfo(x, type = "bestObj") tx.out$ttypes tx.out$lambdahats tx <- tx.out$Y lm.out <- lm(y ~ tx) summary(lm.out) xnew <- tempx[901:1000, ] xtnew <- transfo_newdata(xnew, tx.out) yhatnew <- tcrossprod(lm.out$coefficients, cbind(1, xtnew)) plot(tempy[901:1000], yhatnew); abline(0, 1)
transfo
.
Based on the output of transfo
, backtransform the variables to their original shape through the inverse Yeo-Johnson and/or Box-Cox transformations with the previusly estimated parameters and standardization.
transfo_transformback(Ynew, transfo.out)
transfo_transformback(Ynew, transfo.out)
Ynew |
A data matrix with d columns, which contain the variables to be backtransformed. The number of columns must be the same as the output |
transfo.out |
The output of a call to |
Returns a matrix with backtransformed variables.
J. Raymaekers and P.J. Rousseeuw
J. Raymaekers and P.J. Rousseeuw (2021). Transforming variables to central normality. Machine Learning. doi:10.1007/s10994-021-05960-5(link to open access pdf)
set.seed(123); x <- matrix(rnorm(2000), ncol = 2) y <- sqrt(abs(0.3 * x[, 1] + 0.5 * x[, 2] + 4)) ty.out <- transfo(y, type = "BC") ty.out$lambdahats ty <- ty.out$Y lm.out <- lm(ty ~ x) yhat <- transfo_transformback(lm.out$fitted.values, ty.out) plot(y, yhat); abline(0, 1)
set.seed(123); x <- matrix(rnorm(2000), ncol = 2) y <- sqrt(abs(0.3 * x[, 1] + 0.5 * x[, 2] + 4)) ty.out <- transfo(y, type = "BC") ty.out$lambdahats ty <- ty.out$Y lm.out <- lm(ty ~ x) yhat <- transfo_transformback(lm.out$fitted.values, ty.out) plot(y, yhat); abline(0, 1)
Similar usage to robustbase::classPC except for the new argument ncomb
which is the desired number of components. Only this many PC's are computed in order to save computation time. Makes use of propack.svd
of package svd.
truncPC(X, ncomp = NULL, scale = FALSE, center = TRUE, signflip = TRUE, via.svd = NULL, scores = FALSE)
truncPC(X, ncomp = NULL, scale = FALSE, center = TRUE, signflip = TRUE, via.svd = NULL, scores = FALSE)
X |
a numeric matrix. |
ncomp |
the desired number of components (if not specified, all components are computed). |
scale |
logical, or numeric vector for scaling the columns. |
center |
logical or numeric vector for centering the matrix. |
signflip |
logical indicating if the signs of the loadings should be flipped such that the absolutely largest value is always positive. |
via.svd |
dummy argument for compatibility with classPC calls, will be ignored. |
scores |
logical indicating whether or not scores should be returned. |
A list with components:
rank |
the (numerical) matrix rank of |
eigenvalues |
the |
loadings |
the loadings, a |
scores |
if the |
center |
a vector of means, unless the center argument was |
scale |
a vector of column scales, unless the scale argument was false. |
P.J. Rousseeuw
library(MASS) set.seed(12345) n <- 100; d <- 10 A <- diag(d) * 0.1 + 0.9 x <- mvrnorm(n, rep(0,d), A) truncPCA.out <- truncPC(x, ncomp = 2, scores = TRUE) plot(truncPCA.out$scores)
library(MASS) set.seed(12345) n <- 100; d <- 10 A <- diag(d) * 0.1 + 0.9 x <- mvrnorm(n, rep(0,d), A) truncPCA.out <- truncPC(x, ncomp = 2, scores = TRUE) plot(truncPCA.out$scores)
This function transforms a dataset X with cellwise weights W to an extended data matrix U with the same number of columns but more rows, and containing more NA's. Its rows have the case weights v.
unpack(X,W)
unpack(X,W)
X |
An |
W |
An |
A list with components:
U
unpacked data matrix, with the same columns as X
but typically more rows.
V
vector with the rowwise (=casewise) weights of U
.
P.J. Rousseeuw
P.J. Rousseeuw (2023). Analyzing cellwise weighted data. Econometrics and Statistics, appeared online. doi:10.1016/j.ecosta.2023.01.007(link to open access pdf)
X <- matrix(c(2.8, 5.3, 4.9, 7.4, 2.3, 5.7, 4.3, 7.2, 2.5, 5.1, 4.4, 7.6), nrow = 3, byrow = TRUE) W <- matrix(c(0.8, 1.0, 0.3, 0.4, 0.3, 0.5, 0.9, 0.5, 1.0, 0.6, 0, 0.7), nrow = 3, byrow = TRUE) rownames(X) <- rownames(W) <- c("A", "B", "C") colnames(X) <- colnames(W) <- c("V1", "V2", "V3", "V4") X W out <- unpack(X, W) cbind(out$U, out$v) # For more examples, we refer to the vignette: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
X <- matrix(c(2.8, 5.3, 4.9, 7.4, 2.3, 5.7, 4.3, 7.2, 2.5, 5.1, 4.4, 7.6), nrow = 3, byrow = TRUE) W <- matrix(c(0.8, 1.0, 0.3, 0.4, 0.3, 0.5, 0.9, 0.5, 1.0, 0.6, 0, 0.7), nrow = 3, byrow = TRUE) rownames(X) <- rownames(W) <- c("A", "B", "C") colnames(X) <- colnames(W) <- c("V1", "V2", "V3", "V4") X W out <- unpack(X, W) cbind(out$U, out$v) # For more examples, we refer to the vignette: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
Carries out a rowwise weighted EM algorithm to estimate mu and Sigma of incomplete Gaussian data.
weightedEM(X, w=NULL, lmin=NULL, crit=1e-4, maxiter=1000, initEst=NULL, computeloglik=F)
weightedEM(X, w=NULL, lmin=NULL, crit=1e-4, maxiter=1000, initEst=NULL, computeloglik=F)
X |
n by d data matrix or data frame. |
w |
vector with n nonnegative rowwise (casewise)
weights. If |
lmin |
if not |
crit |
convergence criterion of successive mu and Sigma estimates. |
maxiter |
maximal number of iteration steps. |
initEst |
if not |
computeloglik |
if |
A list with components:
mu
the estimated location vector.
Sigma
the estimated covariance matrix.
impX
the imputed data matrix.
niter
the number of iteration steps taken.
loglikhd
vector with the total log(likelihood) at every
iteration step. When computeloglik = FALSE
this
array contains NA's.
P.J. Rousseeuw
P.J. Rousseeuw (2023). Analyzing cellwise weighted data. Econometrics and Statistics, appeared online. doi:10.1016/j.ecosta.2023.01.007(link to open access pdf)
Sigma <- matrix(0.7, 3, 3); diag(Sigma) <- 1 set.seed(12345); X <- MASS::mvrnorm(1000, rep(0, 3), Sigma) X[1, 3] <- X[2, 2] <- X[3, 1] <- X[4, 1] <- X[5, 2] <- NA w <- runif(1000, 0, 1) # rowwise weights out <- weightedEM(X, w, crit = 1e-12, computeloglik = TRUE) out$niter # number of iteration steps taken plot(1:out$niter, out$loglikhd[1:out$niter], type = 'l', lty = 1, col = 4, xlab = 'step', ylab = 'log(likelihood)', main = 'log(likelihood) of weighted EM iterations') out$mu # estimated center round(out$Sigma, 6) # estimated covariance matrix head(X) # the data has NA's head(out$impX) # imputed data, has no NA's # For more examples, we refer to the vignette: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
Sigma <- matrix(0.7, 3, 3); diag(Sigma) <- 1 set.seed(12345); X <- MASS::mvrnorm(1000, rep(0, 3), Sigma) X[1, 3] <- X[2, 2] <- X[3, 1] <- X[4, 1] <- X[5, 2] <- NA w <- runif(1000, 0, 1) # rowwise weights out <- weightedEM(X, w, crit = 1e-12, computeloglik = TRUE) out$niter # number of iteration steps taken plot(1:out$niter, out$loglikhd[1:out$niter], type = 'l', lty = 1, col = 4, xlab = 'step', ylab = 'log(likelihood)', main = 'log(likelihood) of weighted EM iterations') out$mu # estimated center round(out$Sigma, 6) # estimated covariance matrix head(X) # the data has NA's head(out$impX) # imputed data, has no NA's # For more examples, we refer to the vignette: ## Not run: vignette("cellwise_weights_examples") ## End(Not run)
Transforms multivariate data X
using the wrapping function with b = 1.5
and c = 4
. By default, it starts by calling checkDataSet
to clean the data and estLocScale
to estimate the location and scale of the variables in the cleaned data, yielding the vectors and
where
is the number of variables. Alternatively, the user can specify such vectors in the arguments
locX
and scaleX
. In either case, the data cell containing variable
of case
is transformed to
in which and
are such that for any fixed
the average of
equals
and the standard deviation of
equals
.
wrap(X, locX = NULL, scaleX = NULL, precScale = 1e-12, imputeNA = TRUE, checkPars = list())
wrap(X, locX = NULL, scaleX = NULL, precScale = 1e-12, imputeNA = TRUE, checkPars = list())
X |
the input data. It must be an |
locX |
The location estimates of the columns of the input data |
scaleX |
The scale estimates of the columns of the input data |
precScale |
The precision scale used throughout the algorithm. Defaults to |
imputeNA |
Whether or not to impute the |
checkPars |
Optional list of parameters used in the call to
|
A list with components:
Xw
The wrapped data.
colInWrap
The column numbers of the variables which were wrapped. Variables which were filtered out by checkDataSet
(because of a (near) zero scale for example), will not appear in this output.
loc
The location estimates for all variables used for wrapping.
scale
The scale estimates for all variables used for wrapping.
Raymaekers, J. and Rousseeuw P.J.
Raymaekers, J., Rousseeuw P.J. (2019). Fast robust correlation for high dimensional data. Technometrics, 63(2), 184-198. (link to open access pdf)
library(MASS) set.seed(12345) n <- 100; d <- 10 X <- mvrnorm(n, rep(0, 10), diag(10)) locScale <- estLocScale(X) Xw <- wrap(X, locScale$loc, locScale$scale)$Xw # For more examples, we refer to the vignette: ## Not run: vignette("wrap_examples") ## End(Not run)
library(MASS) set.seed(12345) n <- 100; d <- 10 X <- mvrnorm(n, rep(0, 10), diag(10)) locScale <- estLocScale(X) Xw <- wrap(X, locScale$loc, locScale$scale)$Xw # For more examples, we refer to the vignette: ## Not run: vignette("wrap_examples") ## End(Not run)