Title: | Robust Methods for High-Dimensional Data |
---|---|
Description: | Robust methods for high-dimensional data, in particular linear model selection techniques based on least angle regression and sparse regression. Specifically, the package implements robust least angle regression (Khan, Van Aelst & Zamar, 2007; <doi:10.1198/016214507000000950>), (robust) groupwise least angle regression (Alfons, Croux & Gelper, 2016; <doi:10.1016/j.csda.2015.02.007>), and sparse least trimmed squares regression (Alfons, Croux & Gelper, 2013; <doi:10.1214/12-AOAS575>). |
Authors: | Andreas Alfons [aut, cre] , Dirk Eddelbuettel [ctb] |
Maintainer: | Andreas Alfons <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.8.1 |
Built: | 2024-11-28 06:53:38 UTC |
Source: | CRAN |
Robust methods for high-dimensional data, in particular linear model selection techniques based on least angle regression and sparse regression. Specifically, the package implements robust least angle regression (Khan, Van Aelst & Zamar, 2007; <doi:10.1198/016214507000000950>), (robust) groupwise least angle regression (Alfons, Croux & Gelper, 2016; <doi:10.1016/j.csda.2015.02.007>), and sparse least trimmed squares regression (Alfons, Croux & Gelper, 2013; <doi:10.1214/12-AOAS575>).
The DESCRIPTION file:
Package: | robustHD |
Type: | Package |
Title: | Robust Methods for High-Dimensional Data |
Version: | 0.8.1 |
Date: | 2024-06-30 |
Depends: | R (>= 3.5.0), ggplot2 (>= 0.9.2), perry (>= 0.3.0), robustbase (>= 0.9-5) |
Imports: | MASS, Rcpp (>= 0.9.10), grDevices, parallel, stats, utils |
LinkingTo: | Rcpp (>= 0.9.10), RcppArmadillo (>= 0.3.0) |
Suggests: | lars, mvtnorm, testthat |
Description: | Robust methods for high-dimensional data, in particular linear model selection techniques based on least angle regression and sparse regression. Specifically, the package implements robust least angle regression (Khan, Van Aelst & Zamar, 2007; <doi:10.1198/016214507000000950>), (robust) groupwise least angle regression (Alfons, Croux & Gelper, 2016; <doi:10.1016/j.csda.2015.02.007>), and sparse least trimmed squares regression (Alfons, Croux & Gelper, 2013; <doi:10.1214/12-AOAS575>). |
License: | GPL (>= 2) |
URL: | https://github.com/aalfons/robustHD |
BugReports: | https://github.com/aalfons/robustHD/issues |
LazyLoad: | yes |
Authors@R: | c(person("Andreas", "Alfons", email = "[email protected]", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-2513-3788")), person("Dirk", "Eddelbuettel", role = "ctb")) |
Author: | Andreas Alfons [aut, cre] (<https://orcid.org/0000-0002-2513-3788>), Dirk Eddelbuettel [ctb] |
Maintainer: | Andreas Alfons <[email protected]> |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.1 |
NeedsCompilation: | yes |
Packaged: | 2024-06-30 19:56:06 UTC; alfons |
Repository: | CRAN |
Date/Publication: | 2024-06-30 20:40:02 UTC |
Index of help topics:
AIC.seqModel Information criteria for a sequence of regression models TopGear Top Gear car data coef.seqModel Extract coefficients from a sequence of regression models coefPlot Coefficient plot of a sequence of regression models corHuber Robust correlation based on winsorization critPlot Optimality criterion plot of a sequence of regression models diagnosticPlot Diagnostic plots for a sequence of regression models fitted.seqModel Extract fitted values from a sequence of regression models getScale Extract the residual scale of a robust regression model grplars (Robust) groupwise least angle regression lambda0 Penalty parameter for sparse LTS regression nci60 NCI-60 cancer cell panel partialOrder Find partial order of smallest or largest values perry.seqModel Resampling-based prediction error for a sequential regression model plot.seqModel Plot a sequence of regression models predict.seqModel Predict from a sequence of regression models residuals.seqModel Extract residuals from a sequence of regression models rlars Robust least angle regression robustHD-package Robust Methods for High-Dimensional Data rstandard.seqModel Extract standardized residuals from a sequence of regression models setupCoefPlot Set up a coefficient plot of a sequence of regression models setupCritPlot Set up an optimality criterion plot of a sequence of regression models setupDiagnosticPlot Set up a diagnostic plot for a sequence of regression models sparseLTS Sparse least trimmed squares regression standardize Data standardization tsBlocks Construct predictor blocks for time series models tslars (Robust) least angle regression for time series data tslarsP (Robust) least angle regression for time series data with fixed lag length weights.sparseLTS Extract outlier weights from sparse LTS regression models winsorize Data cleaning by winsorization
Andreas Alfons [aut, cre] (<https://orcid.org/0000-0002-2513-3788>), Dirk Eddelbuettel [ctb]
Maintainer: Andreas Alfons <[email protected]>
Alfons (2021) robustHD: An R package for robust regression with high-dimensional data. Journal of Open Source Software, 6(67), 3786. doi:10.21105/joss.03786.
Useful links:
Compute the Akaike or Bayes information criterion for for a sequence of regression models, such as submodels along a robust least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
## S3 method for class 'seqModel' AIC(object, ..., k = 2) ## S3 method for class 'sparseLTS' AIC(object, ..., fit = c("reweighted", "raw", "both"), k = 2) ## S3 method for class 'seqModel' BIC(object, ...) ## S3 method for class 'sparseLTS' BIC(object, ...)
## S3 method for class 'seqModel' AIC(object, ..., k = 2) ## S3 method for class 'sparseLTS' AIC(object, ..., fit = c("reweighted", "raw", "both"), k = 2) ## S3 method for class 'seqModel' BIC(object, ...) ## S3 method for class 'sparseLTS' BIC(object, ...)
object |
the model fit for which to compute the information criterion. |
... |
for the |
k |
a numeric value giving the penalty per parameter to be used. The
default is to use |
fit |
a character string specifying for which fit to compute the
information criterion. Possible values are |
The information criteria are computed as
,
where
denotes the number of observations,
is the robust residual scale estimate,
is the number of nonzero
coefficient estimates, and
is penalty per parameter. The usual
definition of the AIC uses
, whereas the BIC uses
. Consequently, the former is used as the
default penalty of the
AIC
method, whereas the BIC
method calls
the AIC
method with the latter penalty.
A numeric vector or matrix giving the information criteria for the requested model fits.
Computing information criteria for several objects supplied via the
...
argument (as for the default methods of AIC
and BIC
) is currently not implemented.
Andreas Alfons
Akaike, H. (1970) Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 22(2), 203–217.
Schwarz, G. (1978) Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # compute AIC and BIC AIC(fitRlars) BIC(fitRlars) ## fit sparse LTS model over a grid of values for lambda frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # compute AIC and BIC AIC(fitSparseLTS) BIC(fitSparseLTS)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # compute AIC and BIC AIC(fitRlars) BIC(fitRlars) ## fit sparse LTS model over a grid of values for lambda frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # compute AIC and BIC AIC(fitSparseLTS) BIC(fitSparseLTS)
Extract coefficients from a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
## S3 method for class 'seqModel' coef(object, s = NA, zeros = TRUE, drop = !is.null(s), ...) ## S3 method for class 'tslars' coef(object, p, ...) ## S3 method for class 'perrySeqModel' coef(object, ...) ## S3 method for class 'sparseLTS' coef( object, s = NA, fit = c("reweighted", "raw", "both"), zeros = TRUE, drop = !is.null(s), ... )
## S3 method for class 'seqModel' coef(object, s = NA, zeros = TRUE, drop = !is.null(s), ...) ## S3 method for class 'tslars' coef(object, p, ...) ## S3 method for class 'perrySeqModel' coef(object, ...) ## S3 method for class 'sparseLTS' coef( object, s = NA, fit = c("reweighted", "raw", "both"), zeros = TRUE, drop = !is.null(s), ... )
object |
the model fit from which to extract coefficients. |
s |
for the |
zeros |
a logical indicating whether to keep zero coefficients
( |
drop |
a logical indicating whether to reduce the dimension to a vector in case of only one submodel. |
... |
for the |
p |
an integer giving the lag length for which to extract coefficients (the default is to use the optimal lag length). |
fit |
a character string specifying which coefficients to extract.
Possible values are |
A numeric vector or matrix containing the requested regression coefficients.
Andreas Alfons
coef
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract coefficients coef(fitRlars, zeros = FALSE) coef(fitRlars, s = 1:5, zeros = FALSE) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract coefficients coef(fitSparseLTS, zeros = FALSE) coef(fitSparseLTS, fit = "both", zeros = FALSE) coef(fitSparseLTS, s = NULL, zeros = FALSE) coef(fitSparseLTS, fit = "both", s = NULL, zeros = FALSE)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract coefficients coef(fitRlars, zeros = FALSE) coef(fitRlars, s = 1:5, zeros = FALSE) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract coefficients coef(fitSparseLTS, zeros = FALSE) coef(fitSparseLTS, fit = "both", zeros = FALSE) coef(fitSparseLTS, s = NULL, zeros = FALSE) coef(fitSparseLTS, fit = "both", s = NULL, zeros = FALSE)
Produce a plot of the coefficients from a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
coefPlot(object, ...) ## S3 method for class 'seqModel' coefPlot(object, zeros = FALSE, labels = NULL, ...) ## S3 method for class 'tslars' coefPlot(object, p, zeros = FALSE, labels = NULL, ...) ## S3 method for class 'sparseLTS' coefPlot( object, fit = c("reweighted", "raw", "both"), zeros = FALSE, labels = NULL, ... ) ## S3 method for class 'setupCoefPlot' coefPlot( object, abscissa = NULL, size = c(0.5, 2, 4), offset = 1, facets = object$facets, ... )
coefPlot(object, ...) ## S3 method for class 'seqModel' coefPlot(object, zeros = FALSE, labels = NULL, ...) ## S3 method for class 'tslars' coefPlot(object, p, zeros = FALSE, labels = NULL, ...) ## S3 method for class 'sparseLTS' coefPlot( object, fit = c("reweighted", "raw", "both"), zeros = FALSE, labels = NULL, ... ) ## S3 method for class 'setupCoefPlot' coefPlot( object, abscissa = NULL, size = c(0.5, 2, 4), offset = 1, facets = object$facets, ... )
object |
the model fit to be plotted. |
... |
additional arguments to be passed down, eventually to
|
zeros |
a logical indicating whether predictors that never enter the
model and thus have zero coefficients should be included in the plot
( |
labels |
an optional character vector containing labels for the
predictors. Plotting labels can be suppressed by setting this to
|
p |
an integer giving the lag length for which to produce the plot (the default is to use the optimal lag length). |
fit |
a character string specifying for which estimator to produce the
plot. Possible values are |
abscissa |
a character string specifying what to plot on the
|
size |
a numeric vector of length three giving the line width, the point size and the label size, respectively. |
offset |
an integer giving the offset of the labels from the corresponding coefficient values from the last step (i.e., the number of blank characters to be prepended to the label). |
facets |
a faceting formula to override the default behavior. If
supplied, |
An object of class "ggplot"
(see ggplot
).
Andreas Alfons
ggplot
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plot coefPlot(fitRlars) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # create plot coefPlot(fitSparseLTS) coefPlot(fitSparseLTS, fit = "both")
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plot coefPlot(fitRlars) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # create plot coefPlot(fitSparseLTS) coefPlot(fitSparseLTS, fit = "both")
Compute a robust correlation estimate based on winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.
corHuber( x, y, type = c("bivariate", "adjusted", "univariate"), standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, ... )
corHuber( x, y, type = c("bivariate", "adjusted", "univariate"), standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, ... )
x |
a numeric vector. |
y |
a numeric vector. |
type |
a character string specifying the type of winsorization to be
used. Possible values are |
standardized |
a logical indicating whether the data are already robustly standardized. |
centerFun |
a function to compute a robust estimate for the center to
be used for robust standardization (defaults to
|
scaleFun |
a function to compute a robust estimate for the scale to
be used for robust standardization (defaults to |
const |
numeric; tuning constant to be used in univariate or adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
tol |
a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since the points form almost a straight line, and the initial estimate is returned. |
... |
additional arguments to be passed to
|
The borders of the main part of the data are defined on the scale of the
robustly standardized data. In univariate winsorization, the borders for
each variable are given by const
, thus a symmetric
distribution is assumed. In adjusted univariate winsorization, the borders
for the two diagonally opposing quadrants containing the minority of the
data are shrunken by a factor that depends on the ratio between the number of
observations in the major and minor quadrants. It is thus possible to
better account for the bivariate structure of the data while maintaining
fast computation. In bivariate winsorization, a bivariate normal
distribution is assumed and the data are shrunken towards the boundary of a
tolerance ellipse with coverage probability prob
. The boundary of
this ellipse is thereby given by all points that have a squared Mahalanobis
distance equal to the quantile of the
distribution given by
prob
. Furthermore, the initial correlation
matrix required for the Mahalanobis distances is computed based on adjusted
univariate winsorization.
The robust correlation estimate.
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. doi:10.1198/016214507000000950
## generate data library("mvtnorm") set.seed(1234) # for reproducibility Sigma <- matrix(c(1, 0.6, 0.6, 1), 2, 2) xy <- rmvnorm(100, sigma=Sigma) x <- xy[, 1] y <- xy[, 2] ## introduce outlier x[1] <- x[1] * 10 y[1] <- y[1] * (-5) ## compute correlation cor(x, y) corHuber(x, y)
## generate data library("mvtnorm") set.seed(1234) # for reproducibility Sigma <- matrix(c(1, 0.6, 0.6, 1), 2, 2) xy <- rmvnorm(100, sigma=Sigma) x <- xy[, 1] y <- xy[, 2] ## introduce outlier x[1] <- x[1] * 10 y[1] <- y[1] * (-5) ## compute correlation cor(x, y) corHuber(x, y)
Produce a plot of the values of the optimality criterion for a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
critPlot(object, ...) ## S3 method for class 'seqModel' critPlot(object, which = c("line", "dot"), ...) ## S3 method for class 'tslars' critPlot(object, p, which = c("line", "dot"), ...) ## S3 method for class 'sparseLTS' critPlot( object, which = c("line", "dot"), fit = c("reweighted", "raw", "both"), ... ) ## S3 method for class 'perrySeqModel' critPlot(object, which = c("line", "dot", "box", "density"), ...) ## S3 method for class 'perrySparseLTS' critPlot( object, which = c("line", "dot", "box", "density"), fit = c("reweighted", "raw", "both"), ... ) ## S3 method for class 'setupCritPlot' critPlot(object, ...)
critPlot(object, ...) ## S3 method for class 'seqModel' critPlot(object, which = c("line", "dot"), ...) ## S3 method for class 'tslars' critPlot(object, p, which = c("line", "dot"), ...) ## S3 method for class 'sparseLTS' critPlot( object, which = c("line", "dot"), fit = c("reweighted", "raw", "both"), ... ) ## S3 method for class 'perrySeqModel' critPlot(object, which = c("line", "dot", "box", "density"), ...) ## S3 method for class 'perrySparseLTS' critPlot( object, which = c("line", "dot", "box", "density"), fit = c("reweighted", "raw", "both"), ... ) ## S3 method for class 'setupCritPlot' critPlot(object, ...)
object |
the model fit to be plotted, , or an object containing
all necessary information for plotting (as generated by
|
... |
additional arguments to be passed down, eventually to
|
which |
a character string specifying the type of plot. Possible
values are |
p |
an integer giving the lag length for which to produce the plot (the default is to use the optimal lag length). |
fit |
a character string specifying for which estimator to produce the
plot. Possible values are |
An object of class "ggplot"
(see ggplot
).
Function perryPlot
is used to create the plot,
even if the optimality criterion does not correspond to resampling-based p
rediction error estimation. While this can be seen as as a misuse of its
functionality, it ensures that all optimality criteria are displayed in the
same way.
Andreas Alfons
ggplot
, perryPlot
,
rlars
, grplars
, rgrplars
,
tslarsP
, rtslarsP
, tslars
,
rtslars
, sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plot critPlot(fitRlars) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # create plot critPlot(fitSparseLTS) critPlot(fitSparseLTS, fit = "both")
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plot critPlot(fitRlars) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # create plot critPlot(fitSparseLTS) critPlot(fitSparseLTS, fit = "both")
Produce diagnostic plots for a sequence of regression models, such as submodels along a robust least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter. Four plots are currently implemented.
diagnosticPlot(object, ...) ## S3 method for class 'seqModel' diagnosticPlot(object, s = NA, covArgs = list(), ...) ## S3 method for class 'perrySeqModel' diagnosticPlot(object, covArgs = list(), ...) ## S3 method for class 'tslars' diagnosticPlot(object, p, s = NA, covArgs = list(), ...) ## S3 method for class 'sparseLTS' diagnosticPlot( object, s = NA, fit = c("reweighted", "raw", "both"), covArgs = list(), ... ) ## S3 method for class 'perrySparseLTS' diagnosticPlot( object, fit = c("reweighted", "raw", "both"), covArgs = list(), ... ) ## S3 method for class 'setupDiagnosticPlot' diagnosticPlot( object, which = c("all", "rqq", "rindex", "rfit", "rdiag"), ask = (which == "all"), facets = object$facets, size = c(2, 4), id.n = NULL, ... )
diagnosticPlot(object, ...) ## S3 method for class 'seqModel' diagnosticPlot(object, s = NA, covArgs = list(), ...) ## S3 method for class 'perrySeqModel' diagnosticPlot(object, covArgs = list(), ...) ## S3 method for class 'tslars' diagnosticPlot(object, p, s = NA, covArgs = list(), ...) ## S3 method for class 'sparseLTS' diagnosticPlot( object, s = NA, fit = c("reweighted", "raw", "both"), covArgs = list(), ... ) ## S3 method for class 'perrySparseLTS' diagnosticPlot( object, fit = c("reweighted", "raw", "both"), covArgs = list(), ... ) ## S3 method for class 'setupDiagnosticPlot' diagnosticPlot( object, which = c("all", "rqq", "rindex", "rfit", "rdiag"), ask = (which == "all"), facets = object$facets, size = c(2, 4), id.n = NULL, ... )
object |
the model fit for which to produce diagnostic plots, or an
object containing all necessary information for plotting (as generated
by |
... |
additional arguments to be passed down, eventually to
|
s |
for the |
covArgs |
a list of arguments to be passed to
|
p |
an integer giving the lag length for which to produce the plot (the default is to use the optimal lag length). |
fit |
a character string specifying for which fit to produce
diagnostic plots. Possible values are |
which |
a character string indicating which plot to show. Possible
values are |
ask |
a logical indicating whether the user should be asked before
each plot (see |
facets |
a faceting formula to override the default behavior. If
supplied, |
size |
a numeric vector of length two giving the point and label size, respectively. |
id.n |
an integer giving the number of the most extreme observations to be identified by a label. The default is to use the number of identified outliers, which can be different for the different plots. See “Details” for more information. |
In the normal Q-Q plot of the standardized residuals, a reference line is
drawn through the first and third quartile. The id.n
observations
with the largest distances from that line are identified by a label (the
observation number). The default for id.n
is the number of
regression outliers, i.e., the number of observations whose residuals are
too large (cf. weights
).
In the plots of the standardized residuals versus their index or the fitted
values, horizontal reference lines are drawn at 0 and +/-2.5. The
id.n
observations with the largest absolute values of the
standardized residuals are identified by a label (the observation
number). The default for id.n
is the number of regression outliers,
i.e., the number of observations whose absolute residuals are too large (cf.
weights
).
For the regression diagnostic plot, the robust Mahalanobis distances of the
predictor variables are computed via the minimum covariance determinant
(MCD) estimator based on only those predictors with non-zero coefficients
(see covMcd
). Horizontal reference lines are
drawn at +/-2.5 and a vertical reference line is drawn at the upper 97.5%
quantile of the distribution with
degrees of freedom, where
denotes the number of predictors with
non-zero coefficients. The
id.n
observations with the largest
absolute values of the standardized residuals and/or largest robust
Mahalanobis distances are identified by a label (the observation number).
The default for id.n
is the number of all outliers: regression
outliers (i.e., observations whose absolute residuals are too large, cf.
weights
) and leverage points (i.e.,
observations with robust Mahalanobis distance larger than the 97.5%
quantile of the distribution with
degrees of freedom).
Note that the argument alpha
for controlling the subset size
behaves differently for sparseLTS
than for
covMcd
. For sparseLTS
, the subset
size is determined by the fraction
alpha
of the number of
observations . For
covMcd
, on the other
hand, the subset size also depends on the number of variables (see
h.alpha.n
). However, the "sparseLTS"
and
"perrySparseLTS"
methods attempt to compute the MCD using the same
subset size that is used to compute the sparse least trimmed squares
regressions. This may not be possible if the number of selected variables
is large compared to the number of observations. In such cases,
setupDiagnosticPlot
returns NA
s for the robust
Mahalanobis distances, and the regression diagnostic plot fails.
If only one plot is requested, an object of class "ggplot"
(see
ggplot
), otherwise a list of such objects.
Andreas Alfons
ggplot
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
, plot.lts
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plot diagnosticPlot(fitRlars) ## sparse LTS # fit model fitSparseLTS <- sparseLTS(x, y, lambda = 0.05, mode = "fraction") # create plot diagnosticPlot(fitSparseLTS) diagnosticPlot(fitSparseLTS, fit = "both")
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plot diagnosticPlot(fitRlars) ## sparse LTS # fit model fitSparseLTS <- sparseLTS(x, y, lambda = 0.05, mode = "fraction") # create plot diagnosticPlot(fitSparseLTS) diagnosticPlot(fitSparseLTS, fit = "both")
Extract fitted values from a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
## S3 method for class 'seqModel' fitted(object, s = NA, drop = !is.null(s), ...) ## S3 method for class 'tslars' fitted(object, p, ...) ## S3 method for class 'perrySeqModel' fitted(object, ...) ## S3 method for class 'sparseLTS' fitted( object, s = NA, fit = c("reweighted", "raw", "both"), drop = !is.null(s), ... )
## S3 method for class 'seqModel' fitted(object, s = NA, drop = !is.null(s), ...) ## S3 method for class 'tslars' fitted(object, p, ...) ## S3 method for class 'perrySeqModel' fitted(object, ...) ## S3 method for class 'sparseLTS' fitted( object, s = NA, fit = c("reweighted", "raw", "both"), drop = !is.null(s), ... )
object |
the model fit from which to extract fitted values. |
s |
for the |
drop |
a logical indicating whether to reduce the dimension to a vector in case of only one step. |
... |
for the |
p |
an integer giving the lag length for which to extract fitted values (the default is to use the optimal lag length). |
fit |
a character string specifying which fitted values to extract.
Possible values are |
A numeric vector or matrix containing the requested fitted values.
Andreas Alfons
fitted
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract fitted values fitted(fitRlars) head(fitted(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract fitted values fitted(fitSparseLTS) head(fitted(fitSparseLTS, fit = "both")) head(fitted(fitSparseLTS, s = NULL)) head(fitted(fitSparseLTS, fit = "both", s = NULL))
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract fitted values fitted(fitRlars) head(fitted(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract fitted values fitted(fitSparseLTS) head(fitted(fitSparseLTS, fit = "both")) head(fitted(fitSparseLTS, s = NULL)) head(fitted(fitSparseLTS, fit = "both", s = NULL))
Extract the robust scale estimate of the residuals from a robust regression model.
getScale(x, ...) ## S3 method for class 'seqModel' getScale(x, s = NA, ...) ## S3 method for class 'sparseLTS' getScale(x, s = NA, fit = c("reweighted", "raw", "both"), ...)
getScale(x, ...) ## S3 method for class 'seqModel' getScale(x, s = NA, ...) ## S3 method for class 'sparseLTS' getScale(x, s = NA, fit = c("reweighted", "raw", "both"), ...)
x |
the model fit from which to extract the robust residual scale estimate. |
... |
additional arguments to be passed down to methods. |
s |
for the |
fit |
a character string specifying from which fit to extract the
robust residual scale estimate. Possible values are |
Methods are implemented for models of class "lmrob"
(see
lmrob
), "lts"
(see
ltsReg
), "rlm"
(see
rlm
), "seqModel"
(see rlars
) and
"sparseLTS"
(see sparseLTS
). The default method
computes the MAD of the residuals.
A numeric vector or matrix giving the robust residual scale estimates for the requested model fits.
Andreas Alfons
AIC
, lmrob
,
ltsReg
, rlm
,
rlars
, sparseLTS
data("coleman") fit <- lmrob(Y ~ ., data=coleman) getScale(fit)
data("coleman") fit <- lmrob(Y ~ ., data=coleman) getScale(fit)
(Robustly) sequence groups of candidate predictors according to their predictive content and find the optimal model along the sequence.
grplars(x, ...) ## S3 method for class 'formula' grplars(formula, data, ...) ## S3 method for class 'data.frame' grplars(x, y, ...) ## Default S3 method: grplars( x, y, sMax = NA, assign, fit = TRUE, s = c(0, sMax), crit = c("BIC", "PE"), splits = foldControl(), cost = rmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... ) rgrplars(x, ...) ## S3 method for class 'formula' rgrplars(formula, data, ...) ## S3 method for class 'data.frame' rgrplars(x, y, ...) ## Default S3 method: rgrplars( x, y, sMax = NA, assign, centerFun = median, scaleFun = mad, regFun = lmrob, regArgs = list(), combine = c("min", "euclidean", "mahalanobis"), const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
grplars(x, ...) ## S3 method for class 'formula' grplars(formula, data, ...) ## S3 method for class 'data.frame' grplars(x, y, ...) ## Default S3 method: grplars( x, y, sMax = NA, assign, fit = TRUE, s = c(0, sMax), crit = c("BIC", "PE"), splits = foldControl(), cost = rmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... ) rgrplars(x, ...) ## S3 method for class 'formula' rgrplars(formula, data, ...) ## S3 method for class 'data.frame' rgrplars(x, y, ...) ## Default S3 method: rgrplars( x, y, sMax = NA, assign, centerFun = median, scaleFun = mad, regFun = lmrob, regArgs = list(), combine = c("min", "euclidean", "mahalanobis"), const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
x |
a matrix or data frame containing the candidate predictors. |
... |
additional arguments to be passed down. |
formula |
a formula describing the full model. |
data |
an optional data frame, list or environment (or object coercible
to a data frame by |
y |
a numeric vector containing the response. |
sMax |
an integer giving the number of predictor groups to be
sequenced. If it is |
assign |
an integer vector giving the predictor group to which each predictor variable belongs. |
fit |
a logical indicating whether to fit submodels along the sequence
( |
s |
an integer vector of length two giving the first and last
step along the sequence for which to compute submodels. The default
is to start with a model containing only an intercept (step 0) and
iteratively add all groups along the sequence (step |
crit |
a character string specifying the optimality criterion to be
used for selecting the final model. Possible values are |
splits |
an object giving data splits to be used for prediction error
estimation (see |
cost |
a cost function measuring prediction loss (see
|
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
selectBest , seFactor
|
arguments specifying a criterion for selecting
the best model (see |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
model |
a logical indicating whether the model data should be included in the returned object. |
centerFun |
a function to compute a robust estimate for the center
(defaults to |
scaleFun |
a function to compute a robust estimate for the scale
(defaults to |
regFun |
a function to compute robust linear regressions that can be
interpreted as weighted least squares (defaults to
|
regArgs |
a list of arguments to be passed to |
combine |
a character string specifying how to combine the data
cleaning weights from the robust regressions with each predictor group.
Possible values are |
const |
numeric; tuning constant for multivariate winsorization to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
If fit
is FALSE
, an integer vector containing the indices of
the sequenced predictor groups.
Else if crit
is "PE"
, an object of class
"perrySeqModel"
(inheriting from classes "perryTuning"
,
see perryTuning
). It contains information on the
prediction error criterion, and includes the final model as component
finalModel
.
Otherwise an object of class "grplars"
(inheriting from class
"seqModel"
) with the following components:
active
an integer vector containing the sequence of predictor groups.
s
an integer vector containing the steps for which submodels along the sequence have been computed.
coefficients
a numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence.
fitted.values
a numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence.
residuals
a numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence.
df
an integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients).
robust
a logical indicating whether a robust fit was computed.
scale
a numeric vector giving the robust residual scale estimates for the submodels along the sequence (only returned for a robust fit).
crit
an object of class "bicSelect"
containing the
BIC values and indicating the final model (only returned if argument
crit
is "BIC"
and argument s
indicates more than one
step along the sequence).
muX
a numeric vector containing the center estimates of the predictor variables.
sigmaX
a numeric vector containing the scale estimates of the predictor variables.
muY
numeric; the center estimate of the response.
sigmaY
numeric; the scale estimate of the response.
x
the matrix of candidate predictors (if model
is
TRUE
).
y
the response (if model
is TRUE
).
assign
an integer vector giving the predictor group to which each predictor variable belongs.
w
a numeric vector giving the data cleaning weights (only returned for a robust fit).
call
the matched function call.
Andreas Alfons
Alfons, A., Croux, C. and Gelper, S. (2016) Robust groupwise least angle regression. Computational Statistics & Data Analysis, 93, 421–435. doi:10.1016/j.csda.2015.02.007
coef
,
fitted
,
plot
,
predict
,
residuals
,
rstandard
,
lmrob
data("TopGear") # keep complete observations keep <- complete.cases(TopGear) TopGear <- TopGear[keep, ] # remove information on car model info <- TopGear[, 1:3] TopGear <- TopGear[, -(1:3)] # log-transform price TopGear$Price <- log(TopGear$Price) # robust groupwise LARS rgrplars(MPG ~ ., data = TopGear, sMax = 15)
data("TopGear") # keep complete observations keep <- complete.cases(TopGear) TopGear <- TopGear[keep, ] # remove information on car model info <- TopGear[, 1:3] TopGear <- TopGear[, -(1:3)] # log-transform price TopGear$Price <- log(TopGear$Price) # robust groupwise LARS rgrplars(MPG ~ ., data = TopGear, sMax = 15)
Use bivariate winsorization to estimate the smallest value of the penalty parameter for sparse least trimmed squares regression that sets all coefficients to zero.
lambda0( x, y, normalize = TRUE, intercept = TRUE, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, eps = .Machine$double.eps, ... )
lambda0( x, y, normalize = TRUE, intercept = TRUE, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, eps = .Machine$double.eps, ... )
x |
a numeric matrix containing the predictor variables. |
y |
a numeric vector containing the response variable. |
normalize |
a logical indicating whether the winsorized predictor
variables should be normalized to have unit |
intercept |
a logical indicating whether a constant term should be
included in the model (the default is |
const |
numeric; tuning constant to be used in univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
tol |
a small positive numeric value used to determine singularity
issues in the computation of correlation estimates for bivariate
winsorization (see |
eps |
a small positive numeric value used to determine whether the robust scale estimate of a variable is too small (an effective zero). |
... |
additional arguments to be passed to
|
The estimation procedure is inspired by the calculation of the respective penalty parameter in the first step of the classical LARS algorithm. First, two-dimensional data blocks consisting of the response with each predictor variable are cleaned via bivariate winsorization. For each block, the following computations are then performed. If an intercept is included in the model, the cleaned response is centered and the corresponding cleaned predictor is centered and scaled to have unit norm. Otherwise the variables are not centered, but the predictor is scaled to have unit norm. Finally, the dot product of the response and the corresponding predictor is computed. The largest absolute value of those dot products, rescaled to fit the parametrization of the sparse LTS definition, yields the estimate of the smallest penalty parameter that sets all coefficients to zero.
A robust estimate of the smallest value of the penalty parameter for sparse LTS regression that sets all coefficients to zero.
Andreas Alfons
Alfons, A., Croux, C. and Gelper, S. (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248. doi:10.1214/12-AOAS575
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32(2), 407–499. doi:10.1214/009053604000000067
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. doi:10.1198/016214507000000950
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## estimate smallest value of the penalty parameter ## that sets all coefficients to 0 lambda0(x, y)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## estimate smallest value of the penalty parameter ## that sets all coefficients to 0 lambda0(x, y)
The data set is a pre-processed version of the NCI-60 cancer cell panel as used in Alfons, Croux & Gelper (2013). One observation was removed since all values in the gene expression data were missing.
data("nci60")
data("nci60")
Protein and gene expression data on 59 observations are stored in two separate matrices:
protein
a matrix containing protein expressions based on antibodies (162 columns), acquired via reverse-phase protein lysate arrays and log2 transformed.
gene
a matrix containing gene expression data (22283 columns), obtained with an Affymetrix HG-U133A chip and normalized with the GCRMA method.
In addition, meta information on the proteins, genes, and cancer cell lines is stored in three separate data frames:
proteinInfo
a data frame with 162 rows and the following 4
columns: Experiment
(the name of the experiment for collecting
the data), Probe
(the name of the individual probe), Symbol
(the symbol of the protein in Human Genome Organisation (HUGO)
nomenclature), and ID
(identifier of the protein per the National
Center for Biotechnology Information (NCBI) Entrez database). The rows of
this data frame correspond to the columns of the matrix protein
.
geneInfo
a data frame with 22283 rows and the following 4
columns: Experiment
(the name of the experiment for collecting
the data), Probe
(the name of the individual probe), Symbol
(the symbol of the gene in Human Genome Organisation (HUGO) nomenclature),
and ID
(identifier of the gene per the National Center for
Biotechnology Information (NCBI) Entrez database). The rows of this
data frame correspond to the columns of the matrix gene
.
cellLineInfo
a data frame with 59 rows and 15 columns
containing various information on the cancer cell lines, such as tissue of
origin and histology, or age and sex of the patient. The rows of this data
frame correspond to the rows of the matrices protein
and
gene
.
The original data were downloaded from https://discover.nci.nih.gov/cellminer/ on 2012-01-27.
The exact version of the data used in Alfons, Croux & Gelper (2013) can be
obtained from https://github.com/aalfons/nci60, together with the
script for pre-processing. The data in package robustHD differ in
that the matrix of the gene expressions is called gene
and that they
include the three data frames with meta information on proteins, genes, and
cancer cell lines.
Reinhold, W.C., Sunshine, M., Liu, H., Varma, S., Kohn, K.W., Morris, J., Doroshow, J. and Pommier, Y. (2012) CellMiner: A Web-Based Suite of Genomic and Pharmacologic Tools to Explore Transcript and Drug Patterns in the NCI-60 Cell Line Set. Cancer Research, 72(14), 3499–3511. doi:10.1158/0008-5472.CAN-12-1370
Alfons, A., Croux, C. and Gelper, S. (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248. doi:10.1214/12-AOAS575
# load data data("nci60") # define response variable y <- protein[, 92] # screen most correlated predictor variables correlations <- apply(gene, 2, corHuber, y) keep <- partialOrder(abs(correlations), 100, decreasing = TRUE) X <- gene[, keep]
# load data data("nci60") # define response variable y <- protein[, 92] # screen most correlated predictor variables correlations <- apply(gene, 2, corHuber, y) keep <- partialOrder(abs(correlations), 100, decreasing = TRUE) X <- gene[, keep]
Obtain a partial permutation that rearranges the smallest (largest) elements of a vector into ascending (descending) order.
partialOrder(x, h, decreasing = FALSE)
partialOrder(x, h, decreasing = FALSE)
x |
a numeric vector of which to find the order of the smallest or largest elements. |
h |
an integer specifying how many (smallest or largest) elements to order. |
decreasing |
a logical indicating whether the sort order should be
increasing ( |
An integer vector containing the indices of the h
smallest or
largest elements of x
.
Andreas Alfons
# randomly draw some values values <- rnorm(10) values # find largest observations partialOrder(values, 5, decreasing = TRUE)
# randomly draw some values values <- rnorm(10) values # find largest observations partialOrder(values, 5, decreasing = TRUE)
Estimate the prediction error of a previously fit sequential regression model such as a robust least angle regression model or a sparse least trimmed squares regression model.
## S3 method for class 'seqModel' perry( object, splits = foldControl(), cost, ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class 'sparseLTS' perry( object, splits = foldControl(), fit = c("reweighted", "raw", "both"), cost = rtmspe, ncores = 1, cl = NULL, seed = NULL, ... )
## S3 method for class 'seqModel' perry( object, splits = foldControl(), cost, ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class 'sparseLTS' perry( object, splits = foldControl(), fit = c("reweighted", "raw", "both"), cost = rtmspe, ncores = 1, cl = NULL, seed = NULL, ... )
object |
the model fit for which to estimate the prediction error. |
splits |
an object of class |
cost |
a cost function measuring prediction loss. It should expect
vectors to be passed as its first two arguments, the first corresponding to
the observed values of the response and the second to the predicted values,
and must return a non-negative scalar value. The default is to use the root
mean squared prediction error for non-robust models and the root trimmed
mean squared prediction error for robust models (see
|
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
... |
additional arguments to be passed to the prediction loss
function |
fit |
a character string specifying for which fit to estimate the
prediction error. Possible values are |
The prediction error can be estimated via (repeated) -fold
cross-validation, (repeated) random splitting (also known as random
subsampling or Monte Carlo cross-validation), or the bootstrap. In
each iteration, the optimal model is thereby selected from the training
data and used to make predictions for the test data.
An object of class "perry"
with the following components:
pe
a numeric vector containing the estimated prediction errors for the requested model fits. In case of more than one replication, this gives the average value over all replications.
se
a numeric vector containing the estimated standard errors of the prediction loss for the requested model fits.
reps
a numeric matrix in which each column contains the estimated prediction errors from all replications for the requested model fits. This is only returned in case of more than one replication.
splits
an object giving the data splits used to estimate the prediction error.
y
the response.
yHat
a list containing the predicted values from all replications.
call
the matched function call.
Andreas Alfons
rlars
, sparseLTS
,
predict
, perry
,
cost
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit and evaluate robust LARS model fitRlars <- rlars(x, y, sMax = 10) perry(fitRlars) ## fit and evaluate sparse LTS model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") perry(fitSparseLTS)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit and evaluate robust LARS model fitRlars <- rlars(x, y, sMax = 10) perry(fitRlars) ## fit and evaluate sparse LTS model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") perry(fitSparseLTS)
Produce a plot of the coefficients, the values of the optimality criterion, or diagnostic plots for a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
## S3 method for class 'seqModel' plot(x, method = c("coefficients", "crit", "diagnostic"), ...) ## S3 method for class 'perrySeqModel' plot(x, method = c("crit", "diagnostic"), ...) ## S3 method for class 'tslars' plot(x, p, method = c("coefficients", "crit", "diagnostic"), ...) ## S3 method for class 'sparseLTS' plot(x, method = c("coefficients", "crit", "diagnostic"), ...) ## S3 method for class 'perrySparseLTS' plot(x, method = c("crit", "diagnostic"), ...)
## S3 method for class 'seqModel' plot(x, method = c("coefficients", "crit", "diagnostic"), ...) ## S3 method for class 'perrySeqModel' plot(x, method = c("crit", "diagnostic"), ...) ## S3 method for class 'tslars' plot(x, p, method = c("coefficients", "crit", "diagnostic"), ...) ## S3 method for class 'sparseLTS' plot(x, method = c("coefficients", "crit", "diagnostic"), ...) ## S3 method for class 'perrySparseLTS' plot(x, method = c("crit", "diagnostic"), ...)
x |
the model fit to be plotted. |
method |
a character string specifying the type of plot. Possible
values are |
... |
additional arguments to be passed down. |
p |
an integer giving the lag length for which to produce the plot (the default is to use the optimal lag length). |
An object of class "ggplot"
(see ggplot
).
Andreas Alfons
coefPlot
, critPlot
,
diagnosticPlot
, rlars
, grplars
,
rgrplars
, tslarsP
, rtslarsP
,
tslars
, rtslars
, sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plots plot(fitRlars, method = "coef") plot(fitRlars, method = "crit") plot(fitRlars, method = "diagnostic") ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # create plots plot(fitSparseLTS, method = "coef") plot(fitSparseLTS, method = "crit") plot(fitSparseLTS, method = "diagnostic")
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # create plots plot(fitRlars, method = "coef") plot(fitRlars, method = "crit") plot(fitRlars, method = "diagnostic") ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # create plots plot(fitSparseLTS, method = "coef") plot(fitSparseLTS, method = "crit") plot(fitSparseLTS, method = "diagnostic")
Make predictions from a sequence of regression models, such as submodels
along a robust or groupwise least angle regression sequence, or sparse least
trimmed squares regression models for a grid of values for the penalty
parameter. For autoregressive time series models with exogenous inputs,
-step ahead forecasts are performed.
## S3 method for class 'seqModel' predict(object, newdata, s = NA, ...) ## S3 method for class 'tslarsP' predict(object, newdata, ...) ## S3 method for class 'tslars' predict(object, newdata, p, ...) ## S3 method for class 'sparseLTS' predict(object, newdata, s = NA, fit = c("reweighted", "raw", "both"), ...)
## S3 method for class 'seqModel' predict(object, newdata, s = NA, ...) ## S3 method for class 'tslarsP' predict(object, newdata, ...) ## S3 method for class 'tslars' predict(object, newdata, p, ...) ## S3 method for class 'sparseLTS' predict(object, newdata, s = NA, fit = c("reweighted", "raw", "both"), ...)
object |
the model fit from which to make predictions. |
newdata |
new data for the predictors. If the model fit was computed with the formula method, this should be a data frame from which to extract the predictor variables. Otherwise this should be a matrix containing the same variables as the predictor matrix used to fit the model (including a column of ones to account for the intercept). |
s |
for the |
... |
for the |
p |
an integer giving the lag length for which to make predictions (the default is to use the optimal lag length). |
fit |
a character string specifying for which fit to make
predictions. Possible values are |
The newdata
argument defaults to the matrix of predictors used to fit
the model such that the fitted values are computed.
For autoregressive time series models with exogenous inputs with forecast
horizon , the
most recent observations of the predictors are
omitted from fitting the model since there are no corresponding values for
the response. Hence the
newdata
argument for predict.tslarsP
and predict.tslars
defaults to those observations of the
predictors.
A numeric vector or matrix containing the requested predicted values.
Andreas Alfons
predict
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # compute fitted values via predict method predict(fitRlars) head(predict(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # compute fitted values via predict method predict(fitSparseLTS) head(predict(fitSparseLTS, fit = "both")) head(predict(fitSparseLTS, s = NULL)) head(predict(fitSparseLTS, fit = "both", s = NULL))
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # compute fitted values via predict method predict(fitRlars) head(predict(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # compute fitted values via predict method predict(fitSparseLTS) head(predict(fitSparseLTS, fit = "both")) head(predict(fitSparseLTS, s = NULL)) head(predict(fitSparseLTS, fit = "both", s = NULL))
Extract residuals from a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
## S3 method for class 'seqModel' residuals(object, s = NA, standardized = FALSE, drop = !is.null(s), ...) ## S3 method for class 'tslars' residuals(object, p, ...) ## S3 method for class 'perrySeqModel' residuals(object, ...) ## S3 method for class 'sparseLTS' residuals( object, s = NA, fit = c("reweighted", "raw", "both"), standardized = FALSE, drop = !is.null(s), ... )
## S3 method for class 'seqModel' residuals(object, s = NA, standardized = FALSE, drop = !is.null(s), ...) ## S3 method for class 'tslars' residuals(object, p, ...) ## S3 method for class 'perrySeqModel' residuals(object, ...) ## S3 method for class 'sparseLTS' residuals( object, s = NA, fit = c("reweighted", "raw", "both"), standardized = FALSE, drop = !is.null(s), ... )
object |
the model fit from which to extract residuals. |
s |
for the |
standardized |
a logical indicating whether the residuals should be
standardized (the default is |
drop |
a logical indicating whether to reduce the dimension to a vector in case of only one step. |
... |
for the |
p |
an integer giving the lag length for which to extract residuals (the default is to use the optimal lag length). |
fit |
a character string specifying which residuals to extract.
Possible values are |
A numeric vector or matrix containing the requested residuals.
Andreas Alfons
rlars
, grplars
, rgrplars
,
tslarsP
, rtslarsP
, tslars
,
rtslars
, sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract residuals residuals(fitRlars) head(residuals(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract residuals residuals(fitSparseLTS) head(residuals(fitSparseLTS, fit = "both")) head(residuals(fitSparseLTS, s = NULL)) head(residuals(fitSparseLTS, fit = "both", s = NULL))
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract residuals residuals(fitRlars) head(residuals(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract residuals residuals(fitSparseLTS) head(residuals(fitSparseLTS, fit = "both")) head(residuals(fitSparseLTS, s = NULL)) head(residuals(fitSparseLTS, fit = "both", s = NULL))
Robustly sequence candidate predictors according to their predictive content and find the optimal model along the sequence.
rlars(x, ...) ## S3 method for class 'formula' rlars(formula, data, ...) ## Default S3 method: rlars( x, y, sMax = NA, centerFun = median, scaleFun = mad, winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), regFun = lmrob, regArgs = list(), crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, tol = .Machine$double.eps^0.5, ... )
rlars(x, ...) ## S3 method for class 'formula' rlars(formula, data, ...) ## Default S3 method: rlars( x, y, sMax = NA, centerFun = median, scaleFun = mad, winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), regFun = lmrob, regArgs = list(), crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, tol = .Machine$double.eps^0.5, ... )
x |
a matrix or data frame containing the candidate predictors. |
... |
additional arguments to be passed down. For the default
method, additional arguments to be passed down to
|
formula |
a formula describing the full model. |
data |
an optional data frame, list or environment (or object coercible
to a data frame by |
y |
a numeric vector containing the response. |
sMax |
an integer giving the number of predictors to be sequenced. If
it is |
centerFun |
a function to compute a robust estimate for the center
(defaults to |
scaleFun |
a function to compute a robust estimate for the scale
(defaults to |
winsorize |
a logical indicating whether to clean the full data set by
multivariate winsorization, i.e., to perform data cleaning RLARS instead of
plug-in RLARS (defaults to |
const |
numeric; tuning constant to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
fit |
a logical indicating whether to fit submodels along the sequence
( |
s |
an integer vector of length two giving the first and last step
along the sequence for which to compute submodels. The default is to start
with a model containing only an intercept (step 0) and iteratively add all
variables along the sequence (step |
regFun |
a function to compute robust linear regressions along the
sequence (defaults to |
regArgs |
a list of arguments to be passed to |
crit |
a character string specifying the optimality criterion to be
used for selecting the final model. Possible values are |
splits |
an object giving data splits to be used for prediction error
estimation (see |
cost |
a cost function measuring prediction loss (see
|
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
selectBest , seFactor
|
arguments specifying a criterion for selecting
the best model (see |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
model |
a logical indicating whether the model data should be included in the returned object. |
tol |
a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since the points form almost a straight line, and the initial estimate is returned. |
If fit
is FALSE
, an integer vector containing the indices of
the sequenced predictors.
Else if crit
is "PE"
, an object of class
"perrySeqModel"
(inheriting from class "perrySelect"
,
see perrySelect
). It contains information on the
prediction error criterion, and includes the final model as component
finalModel
.
Otherwise an object of class "rlars"
(inheriting from class
"seqModel"
) with the following components:
active
an integer vector containing the indices of the sequenced predictors.
s
an integer vector containing the steps for which submodels along the sequence have been computed.
coefficients
a numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence.
fitted.values
a numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence.
residuals
a numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence.
df
an integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients).
robust
a logical indicating whether a robust fit was
computed (TRUE
for "rlars"
models).
scale
a numeric vector giving the robust residual scale estimates for the submodels along the sequence.
crit
an object of class "bicSelect"
containing the
BIC values and indicating the final model (only returned if argument
crit
is "BIC"
and argument s
indicates more than one
step along the sequence).
muX
a numeric vector containing the center estimates of the predictors.
sigmaX
a numeric vector containing the scale estimates of the predictors.
muY
numeric; the center estimate of the response.
sigmaY
numeric; the scale estimate of the response.
x
the matrix of candidate predictors (if model
is
TRUE
).
y
the response (if model
is TRUE
).
w
a numeric vector giving the data cleaning weights (if
winsorize
is TRUE
).
call
the matched function call.
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. doi:10.1198/016214507000000950
coef
,
fitted
,
plot
,
predict
,
residuals
,
rstandard
,
lmrob
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit robust LARS model rlars(x, y, sMax = 10)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit robust LARS model rlars(x, y, sMax = 10)
Extract standardized residuals from a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
## S3 method for class 'seqModel' rstandard(model, s = NA, drop = !is.null(s), ...) ## S3 method for class 'tslars' rstandard(model, p, ...) ## S3 method for class 'perrySeqModel' rstandard(model, ...) ## S3 method for class 'sparseLTS' rstandard( model, s = NA, fit = c("reweighted", "raw", "both"), drop = !is.null(s), ... )
## S3 method for class 'seqModel' rstandard(model, s = NA, drop = !is.null(s), ...) ## S3 method for class 'tslars' rstandard(model, p, ...) ## S3 method for class 'perrySeqModel' rstandard(model, ...) ## S3 method for class 'sparseLTS' rstandard( model, s = NA, fit = c("reweighted", "raw", "both"), drop = !is.null(s), ... )
model |
the model fit from which to extract standardize residuals. |
s |
for the |
drop |
a logical indicating whether to reduce the dimension to a vector in case of only one step. |
... |
for the |
p |
an integer giving the lag length for which to extract standardized residuals (the default is to use the optimal lag length). |
fit |
a character string specifying which standardized residuals to
extract. Possible values are |
A numeric vector or matrix containing the requested standardized residuals.
Andreas Alfons
rlars
, grplars
, rgrplars
,
tslarsP
, rtslarsP
, tslars
,
rtslars
, sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract standardized residuals rstandard(fitRlars) head(rstandard(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract standardized residuals rstandard(fitSparseLTS) head(rstandard(fitSparseLTS, fit = "both")) head(rstandard(fitSparseLTS, s = NULL)) head(rstandard(fitSparseLTS, fit = "both", s = NULL))
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract standardized residuals rstandard(fitRlars) head(rstandard(fitRlars, s = 1:5)) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract standardized residuals rstandard(fitSparseLTS) head(rstandard(fitSparseLTS, fit = "both")) head(rstandard(fitSparseLTS, s = NULL)) head(rstandard(fitSparseLTS, fit = "both", s = NULL))
Extract the relevent information for a plot of the coefficients for a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
setupCoefPlot(object, ...) ## S3 method for class 'seqModel' setupCoefPlot(object, zeros = FALSE, labels = NULL, ...) ## S3 method for class 'tslars' setupCoefPlot(object, p, ...) ## S3 method for class 'sparseLTS' setupCoefPlot( object, fit = c("reweighted", "raw", "both"), zeros = FALSE, labels = NULL, ... )
setupCoefPlot(object, ...) ## S3 method for class 'seqModel' setupCoefPlot(object, zeros = FALSE, labels = NULL, ...) ## S3 method for class 'tslars' setupCoefPlot(object, p, ...) ## S3 method for class 'sparseLTS' setupCoefPlot( object, fit = c("reweighted", "raw", "both"), zeros = FALSE, labels = NULL, ... )
object |
the model fit from which to extract information. |
... |
additional arguments to be passed down. |
zeros |
a logical indicating whether predictors that never enter the
model and thus have zero coefficients should be included in the plot
( |
labels |
an optional character vector containing labels for the
predictors. Information on labels can be suppressed by setting this to
|
p |
an integer giving the lag length for which to extract information (the default is to use the optimal lag length). |
fit |
a character string specifying for which estimator to extract
information. Possible values are |
An object inheriting from class "setupCoefPlot"
with the
following components:
coefficients
a data frame containing the following columns:
fit
the model fit for which the coefficient is computed
(only returned if both the reweighted and raw fit are requested in the
"sparseLTS"
method).
lambda
the value of the penalty parameter for which the
coefficient is computed (only returned for the "sparseLTS"
method).
step
the step along the sequence for which the coefficient is computed.
df
the degrees of freedom of the submodel along the sequence for which the coefficient is computed.
coefficient
the value of the coefficient.
variable
a character string specifying to which variable the coefficient belongs.
abscissa
a character string specifying available options for
what to plot on the -axis
lambda
a numeric vector giving the values of the penalty
parameter. (only returned for the "sparseLTS"
method).
step
an integer vector containing the steps for which submodels along the sequence have been computed.
df
an integer vector containing the degrees of freedom of
the submodels along the sequence (i.e., the number of estimated
coefficients; only returned for the "seqModel"
method).
includeLabels
a logical indicating whether information on labels for the variables should be included in the plot.
labels
a data frame containing the following columns (not returned if information on labels is suppressed):
fit
the model fit for which the coefficient is computed
(only returned if both the reweighted and raw fit are requested in the
"sparseLTS"
method).
lambda
the smallest value of the penalty parameter
(only returned for the "sparseLTS"
method).
step
the last step along the sequence.
df
the degrees of freedom of the last submodel along the sequence.
coefficient
the value of the coefficient.
label
the label of the corresponding variable to be displayed in the plot.
facets
default faceting formula for the plots (only
returned if both estimators are requested in the "sparseLTS"
method).
Andreas Alfons
coefPlot
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract information for plotting setup <- setupCoefPlot(fitRlars) coefPlot(setup) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract information for plotting setup1 <- setupCoefPlot(fitSparseLTS) coefPlot(setup1) setup2 <- setupCoefPlot(fitSparseLTS, fit = "both") coefPlot(setup2)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract information for plotting setup <- setupCoefPlot(fitRlars) coefPlot(setup) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract information for plotting setup1 <- setupCoefPlot(fitSparseLTS) coefPlot(setup1) setup2 <- setupCoefPlot(fitSparseLTS, fit = "both") coefPlot(setup2)
Extract the relevent information for a plot of the values of the optimality criterion for a sequence of regression models, such as submodels along a robust or groupwise least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter.
setupCritPlot(object, ...) ## S3 method for class 'seqModel' setupCritPlot(object, which = c("line", "dot"), ...) ## S3 method for class 'tslars' setupCritPlot(object, p, ...) ## S3 method for class 'sparseLTS' setupCritPlot( object, which = c("line", "dot"), fit = c("reweighted", "raw", "both"), ... ) ## S3 method for class 'perrySeqModel' setupCritPlot(object, which = c("line", "dot", "box", "density"), ...) ## S3 method for class 'perrySparseLTS' setupCritPlot( object, which = c("line", "dot", "box", "density"), fit = c("reweighted", "raw", "both"), ... )
setupCritPlot(object, ...) ## S3 method for class 'seqModel' setupCritPlot(object, which = c("line", "dot"), ...) ## S3 method for class 'tslars' setupCritPlot(object, p, ...) ## S3 method for class 'sparseLTS' setupCritPlot( object, which = c("line", "dot"), fit = c("reweighted", "raw", "both"), ... ) ## S3 method for class 'perrySeqModel' setupCritPlot(object, which = c("line", "dot", "box", "density"), ...) ## S3 method for class 'perrySparseLTS' setupCritPlot( object, which = c("line", "dot", "box", "density"), fit = c("reweighted", "raw", "both"), ... )
object |
the model fit from which to extract information. |
... |
additional arguments to be passed down. |
which |
a character string specifying the type of plot. Possible
values are |
p |
an integer giving the lag length for which to extract information (the default is to use the optimal lag length). |
fit |
a character string specifying for which estimator to extract
information. Possible values are |
An object inheriting from class "setupCritPlot"
with the
following components:
data
a data frame containing the following columns:
Fit
a vector or factor containing the identifiers of the models along the sequence.
Name
a factor specifying the estimator for which the
optimality criterion was estimated ("reweighted"
or "raw"
;
only returned if both are requested in the "sparseLTS"
or
"perrySparseLTS"
methods).
PE
the estimated prediction errors (only returned if applicable).
BIC
the estimated values of the Bayesian information criterion (only returned if applicable).
Lower
the lower end points of the error bars (only returned if possible to compute).
Upper
the upper end points of the error bars (only returned if possible to compute).
which
a character string specifying the type of plot.
grouped
a logical indicating whether density plots should
be grouped due to multiple model fits along the sequence (only returned
in case of density plots for the "perrySeqModel"
and
"perrySparseLTS"
methods).
includeSE
a logical indicating whether error bars based on standard errors are available (only returned in case of line plots or dot plots).
mapping
default aesthetic mapping for the plots.
facets
default faceting formula for the plots (only
returned if both estimators are requested in the "sparseLTS"
or "perrySparseLTS"
methods).
tuning
a data frame containing the grid of tuning parameter
values for which the optimality criterion was estimated (only returned for
the "sparseLTS"
and "perrySparseLTS"
methods).
Andreas Alfons
critPlot
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract information for plotting setup <- setupCritPlot(fitRlars) critPlot(setup) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract information for plotting setup1 <- setupCritPlot(fitSparseLTS) critPlot(setup1) setup2 <- setupCritPlot(fitSparseLTS, fit = "both") critPlot(setup2)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract information for plotting setup <- setupCritPlot(fitRlars) critPlot(setup) ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitSparseLTS <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract information for plotting setup1 <- setupCritPlot(fitSparseLTS) critPlot(setup1) setup2 <- setupCritPlot(fitSparseLTS, fit = "both") critPlot(setup2)
Extract the fitted values and residuals of a sequence of regression models (such as robust least angle regression models or sparse least trimmed squares regression models) and other useful information for diagnostic plots.
setupDiagnosticPlot(object, ...) ## S3 method for class 'seqModel' setupDiagnosticPlot(object, s = NA, covArgs = list(...), ...) ## S3 method for class 'perrySeqModel' setupDiagnosticPlot(object, ...) ## S3 method for class 'tslars' setupDiagnosticPlot(object, p, ...) ## S3 method for class 'sparseLTS' setupDiagnosticPlot( object, s = NA, fit = c("reweighted", "raw", "both"), covArgs = list(...), ... ) ## S3 method for class 'perrySparseLTS' setupDiagnosticPlot(object, ...)
setupDiagnosticPlot(object, ...) ## S3 method for class 'seqModel' setupDiagnosticPlot(object, s = NA, covArgs = list(...), ...) ## S3 method for class 'perrySeqModel' setupDiagnosticPlot(object, ...) ## S3 method for class 'tslars' setupDiagnosticPlot(object, p, ...) ## S3 method for class 'sparseLTS' setupDiagnosticPlot( object, s = NA, fit = c("reweighted", "raw", "both"), covArgs = list(...), ... ) ## S3 method for class 'perrySparseLTS' setupDiagnosticPlot(object, ...)
object |
the model fit from which to extract information. |
... |
additional arguments to be passed to
|
s |
for the |
covArgs |
a list of arguments to be passed to
|
p |
an integer giving the lag length for which to extract information (the default is to use the optimal lag length). |
fit |
a character string specifying from which fit to extract
information. Possible values are |
Note that the argument alpha
for controlling the subset size
behaves differently for sparseLTS
than for
covMcd
. For sparseLTS
, the subset
size is determined by the fraction
alpha
of the number of
observations . For
covMcd
, on the other
hand, the subset size also depends on the number of variables (see
h.alpha.n
). However, the "sparseLTS"
and
"perrySparseLTS"
methods attempt to compute the MCD using the same
subset size that is used to compute the sparse least trimmed squares
regressions. This may not be possible if the number of selected variables
is large compared to the number of observations, in which case a warning is
given and NA
s are returned for the robust Mahalanobis distances.
An object of class "setupDiagnosticPlot"
with the following
components:
data
a data frame containing the columns listed below.
step
the steps (for the "seqModel"
method) or
indices (for the "sparseLTS"
method) of the models (only returned
if more than one model is requested).
fit
the model fits (only returned if both the reweighted
and raw fit are requested in the "sparseLTS"
method).
index
the indices of the observations.
fitted
the fitted values.
residual
the standardized residuals.
theoretical
the corresponding theoretical quantiles from the standard normal distribution.
qqd
the absolute distances from a reference line through the first and third sample and theoretical quartiles.
rd
the robust Mahalanobis distances computed via the
minimum covariance determinant (MCD) estimator (see
covMcd
).
xyd
the pairwise maxima of the absolute values of the standardized residuals and the robust Mahalanobis distances, divided by the respective other outlier detection cutoff point.
weight
the weights indicating regression outliers.
leverage
logicals indicating leverage points (i.e., outliers in the predictor space).
Diagnostics
a factor with levels "Potential outlier"
(potential regression outliers) and "Regular observation"
(data
points following the model).
qqLine
a data frame containing the intercepts and slopes of the respective reference lines to be displayed in residual Q-Q plots.
q
a data frame containing the quantiles of the Mahalanobis distribution used as cutoff points for detecting leverage points.
facets
default faceting formula for the diagnostic plots (only returned where applicable).
Andreas Alfons
diagnosticPlot
, rlars
,
grplars
, rgrplars
, tslarsP
,
rtslarsP
, tslars
, rtslars
,
sparseLTS
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract information for plotting setup <- setupDiagnosticPlot(fitRlars) diagnosticPlot(setup) ## sparse LTS # fit model fitSparseLTS <- sparseLTS(x, y, lambda = 0.05, mode = "fraction") # extract information for plotting setup1 <- setupDiagnosticPlot(fitSparseLTS) diagnosticPlot(setup1) setup2 <- setupDiagnosticPlot(fitSparseLTS, fit = "both") diagnosticPlot(setup2)
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## robust LARS # fit model fitRlars <- rlars(x, y, sMax = 10) # extract information for plotting setup <- setupDiagnosticPlot(fitRlars) diagnosticPlot(setup) ## sparse LTS # fit model fitSparseLTS <- sparseLTS(x, y, lambda = 0.05, mode = "fraction") # extract information for plotting setup1 <- setupDiagnosticPlot(fitSparseLTS) diagnosticPlot(setup1) setup2 <- setupDiagnosticPlot(fitSparseLTS, fit = "both") diagnosticPlot(setup2)
Compute least trimmed squares regression with an penalty on
the regression coefficients, which allows for sparse model estimates.
sparseLTS(x, ...) ## S3 method for class 'formula' sparseLTS(formula, data, ...) ## Default S3 method: sparseLTS( x, y, lambda, mode = c("lambda", "fraction"), alpha = 0.75, normalize = TRUE, intercept = TRUE, nsamp = c(500, 10), initial = c("sparse", "hyperplane", "random"), ncstep = 2, use.correction = TRUE, tol = .Machine$double.eps^0.5, eps = .Machine$double.eps, use.Gram, crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
sparseLTS(x, ...) ## S3 method for class 'formula' sparseLTS(formula, data, ...) ## Default S3 method: sparseLTS( x, y, lambda, mode = c("lambda", "fraction"), alpha = 0.75, normalize = TRUE, intercept = TRUE, nsamp = c(500, 10), initial = c("sparse", "hyperplane", "random"), ncstep = 2, use.correction = TRUE, tol = .Machine$double.eps^0.5, eps = .Machine$double.eps, use.Gram, crit = c("BIC", "PE"), splits = foldControl(), cost = rtmspe, costArgs = list(), selectBest = c("hastie", "min"), seFactor = 1, ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
x |
a numeric matrix containing the predictor variables. |
... |
additional arguments to be passed down. |
formula |
a formula describing the model. |
data |
an optional data frame, list or environment (or object coercible
to a data frame by |
y |
a numeric vector containing the response variable. |
lambda |
a numeric vector of non-negative values to be used as penalty parameter. |
mode |
a character string specifying the type of penalty parameter. If
|
alpha |
a numeric value giving the percentage of the residuals for
which the |
normalize |
a logical indicating whether the predictor variables
should be normalized to have unit |
intercept |
a logical indicating whether a constant term should be
included in the model (the default is |
nsamp |
a numeric vector giving the number of subsamples to be used in
the two phases of the algorithm. The first element gives the number of
initial subsamples to be used. The second element gives the number of
subsamples to keep after the first phase of |
initial |
a character string specifying the type of initial subsamples
to be used. If |
ncstep |
a positive integer giving the number of C-steps to perform on all subsamples in the first phase of the algorithm (the default is to perform two C-steps). |
use.correction |
currently ignored. Small sample correction factors may be added in the future. |
tol |
a small positive numeric value giving the tolerance for convergence. |
eps |
a small positive numeric value used to determine whether the variability within a variable is too small (an effective zero). |
use.Gram |
a logical indicating whether the Gram matrix of the
explanatory variables should be precomputed in the lasso fits on the
subsamples. If the number of variables is large, computation may be faster
when this is set to |
crit |
a character string specifying the optimality criterion to be
used for selecting the final model. Possible values are |
splits |
an object giving data splits to be used for prediction error
estimation (see |
cost |
a cost function measuring prediction loss (see
|
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
selectBest , seFactor
|
arguments specifying a criterion for selecting
the best model (see |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
model |
a logical indicating whether the data |
If crit
is "PE"
and lambda
contains more than one
value of the penalty parameter, an object of class "perrySparseLTS"
(inheriting from class "perryTuning"
, see
perryTuning
). It contains information on the
prediction error criterion, and includes the final model with the optimal
tuning paramter as component finalModel
.
Otherwise an object of class "sparseLTS"
with the following
components:
lambda
a numeric vector giving the values of the penalty parameter.
best
an integer vector or matrix containing the respective
best subsets of observations found and used for computing the raw
estimates.
objective
a numeric vector giving the respective values of
the sparse LTS objective function, i.e., the penalized
sums of the
smallest squared residuals from the raw fits.
coefficients
a numeric vector or matrix containing the respective coefficient estimates from the reweighted fits.
fitted.values
a numeric vector or matrix containing the respective fitted values of the response from the reweighted fits.
residuals
a numeric vector or matrix containing the respective residuals from the reweighted fits.
center
a numeric vector giving the robust center estimates of the corresponding reweighted residuals.
scale
a numeric vector giving the robust scale estimates of the corresponding reweighted residuals.
cnp2
a numeric vector giving the respective consistency factors applied to the scale estimates of the reweighted residuals.
wt
an integer vector or matrix containing binary weights
that indicate outliers from the respective reweighted fits, i.e., the
weights are for observations with reasonably small reweighted
residuals and
for observations with large reweighted residuals.
df
an integer vector giving the respective degrees of freedom of the obtained reweighted model fits, i.e., the number of nonzero coefficient estimates.
intercept
a logical indicating whether the model includes a constant term.
alpha
a numeric value giving the percentage of the residuals
for which the penalized sum of squares was minimized.
quan
the number of observations used to compute the
raw estimates.
raw.coefficients
a numeric vector or matrix containing the respective coefficient estimates from the raw fits.
raw.fitted.values
a numeric vector or matrix containing the respective fitted values of the response from the raw fits.
raw.residuals
a numeric vector or matrix containing the respective residuals from the raw fits.
raw.center
a numeric vector giving the robust center estimates of the corresponding raw residuals.
raw.scale
a numeric vector giving the robust scale estimates of the corresponding raw residuals.
raw.cnp2
a numeric value giving the consistency factor applied to the scale estimate of the raw residuals.
raw.wt
an integer vector or matrix containing binary weights that indicate outliers from the respective raw fits, i.e., the weights used for the reweighted fits.
crit
an object of class "bicSelect"
containing the
BIC values and indicating the final model (only returned if argument
crit
is "BIC"
and argument lambda
contains more
than one value for the penalty parameter).
x
the predictor matrix (if model
is TRUE
).
y
the response variable (if model
is TRUE
).
call
the matched function call.
The underlying C++ code uses the C++ library Armadillo. From package version 0.6.0, the back end for sparse least trimmed squares from package sparseLTSEigen, which uses the C++ library Eigen, is no longer supported and can no longer be used.
Parallel computing is implemented via OpenMP (https://www.openmp.org/).
Andreas Alfons
Alfons, A., Croux, C. and Gelper, S. (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248. doi:10.1214/12-AOAS575
coef
,
fitted
,
plot
,
predict
,
residuals
,
rstandard
,
weights
,
ltsReg
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit sparse LTS model for one value of lambda sparseLTS(x, y, lambda = 0.05, mode = "fraction") ## fit sparse LTS models over a grid of values for lambda frac <- seq(0.2, 0.05, by = -0.05) sparseLTS(x, y, lambda = frac, mode = "fraction")
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## fit sparse LTS model for one value of lambda sparseLTS(x, y, lambda = 0.05, mode = "fraction") ## fit sparse LTS models over a grid of values for lambda frac <- seq(0.2, 0.05, by = -0.05) sparseLTS(x, y, lambda = frac, mode = "fraction")
Standardize data with given functions for computing center and scale.
standardize(x, centerFun = mean, scaleFun = sd) robStandardize( x, centerFun = median, scaleFun = mad, fallback = FALSE, eps = .Machine$double.eps, ... )
standardize(x, centerFun = mean, scaleFun = sd) robStandardize( x, centerFun = median, scaleFun = mad, fallback = FALSE, eps = .Machine$double.eps, ... )
x |
a numeric vector, matrix or data frame to be standardized. |
centerFun |
a function to compute an estimate of the center of a
variable (defaults to |
scaleFun |
a function to compute an estimate of the scale of a
variable (defaults to |
fallback |
a logical indicating whether standardization with
|
eps |
a small positive numeric value used to determine whether the robust scale estimate of a variable is too small (an effective zero). |
... |
currently ignored. |
robStandardize
is a wrapper function for robust standardization,
hence the default is to use median
and
mad
.
An object of the same type as the original data x
containing
the centered and scaled data. The center and scale estimates of the
original data are returned as attributes "center"
and "scale"
,
respectively.
The implementation contains special cases for the typically used
combinations mean
/sd
and
median
/mad
in order to reduce
computation time.
Andreas Alfons
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## standardize data x standardize(x) # mean and sd robStandardize(x) # median and MAD
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## standardize data x standardize(x) # mean and sd robStandardize(x) # median and MAD
The data set contains information on cars featured on the website of the popular BBC television show Top Gear.
data("TopGear")
data("TopGear")
A data frame with 297 observations on the following 32 variables.
Maker
factor; the car maker.
Model
factor; the car model.
Type
factor; the exact model type.
Fuel
factor; the type of fuel ("Diesel"
or
"Petrol"
).
Price
numeric; the list price (in UK pounds)
Cylinders
numeric; the number of cylinders in the engine.
Displacement
numeric; the displacement of the engine (in cc).
DriveWheel
factor; the type of drive wheel ("4WD"
,
"Front"
or "Rear"
).
BHP
numeric; the power of the engine (in bhp).
Torque
numeric; the torque of the engine (in lb/ft).
Acceleration
numeric; the time it takes the car to get from 0 to 62 mph (in seconds).
TopSpeed
numeric; the car's top speed (in mph).
MPG
numeric; the combined fuel consuption (urban + extra urban; in miles per gallon).
Weight
numeric; the car's curb weight (in kg).
Length
numeric; the car's length (in mm).
Width
numeric; the car's width (in mm).
Height
numeric; the car's height (in mm).
AdaptiveHeadlights
factor; whether the car has adaptive
headlights ("no"
, "optional"
or "standard"
).
AdjustableSteering
factor; whether the car has adjustable
steering ("no"
or "standard"
).
AlarmSystem
factor; whether the car has an alarm system
("no/optional"
or "standard"
).
Automatic
factor; whether the car has an automatic
transmission ("no"
, "optional"
or "standard"
).
Bluetooth
factor; whether the car has bluetooth
("no"
, "optional"
or "standard"
).
ClimateControl
factor; whether the car has climate control
("no"
, "optional"
or "standard"
).
CruiseControl
factor; whether the car has cruise control
("no"
, "optional"
or "standard"
).
ElectricSeats
factor; whether the car has electric seats
("no"
, "optional"
or "standard"
).
Leather
factor; whether the car has a leather interior
("no"
, "optional"
or "standard"
).
ParkingSensors
factor; whether the car has parking sensors
("no"
, "optional"
or "standard"
).
PowerSteering
factor; whether the car has power steering
("no"
or "standard"
).
SatNav
factor; whether the car has a satellite navigation
system ("no"
, "optional"
or "standard"
).
ESP
factor; whether the car has ESP ("no"
,
"optional"
or "standard"
).
Verdict
numeric; review score between 1 (lowest) and 10 (highest).
Origin
factor; the origin of the car maker ("Asia"
,
"Europe"
or "USA"
).
The data were scraped from http://www.topgear.com/uk/
on 2014-02-24.
Variable Origin
was added based on the car maker information.
data("TopGear") summary(TopGear)
data("TopGear") summary(TopGear)
Construct blocks of original and lagged values for autoregressive time
series models with exogenous inputs. The typical use case is to supply the
output as newdata
argument to the
predict
method of robust groupwise least
angle regression models.
tsBlocks(x, y, p = 2, subset = NULL, intercept = TRUE)
tsBlocks(x, y, p = 2, subset = NULL, intercept = TRUE)
x |
a numeric matrix or data frame containing the exogenous predictor series. |
y |
a numeric vector containing the response series. |
p |
an integer giving the number of lags to include (defaults to 2). |
subset |
a logical or integer vector defining a subset of observations from which to construct the matrix of predictor blocks. |
intercept |
a logical indicating whether a column of ones should be added to the matrix of predictor blocks to account for the intercept. |
A matrix containing blocks of original and lagged values of the
time series y
and x
.
Andreas Alfons
predict.tslars
, tslars
,
predict.tslarsP
, tslarsP
(Robustly) sequence groups of candidate predictors and their respective lagged values according to their predictive content and find the optimal model along the sequence. Note that lagged values of the response are included as a predictor group as well.
tslars(x, ...) ## S3 method for class 'formula' tslars(formula, data, ...) ## Default S3 method: tslars( x, y, h = 1, pMax = 3, sMax = NA, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, model = TRUE, ... ) rtslars(x, ...) ## S3 method for class 'formula' rtslars(formula, data, ...) ## Default S3 method: rtslars( x, y, h = 1, pMax = 3, sMax = NA, centerFun = median, scaleFun = mad, regFun = lmrob, regArgs = list(), combine = c("min", "euclidean", "mahalanobis"), winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
tslars(x, ...) ## S3 method for class 'formula' tslars(formula, data, ...) ## Default S3 method: tslars( x, y, h = 1, pMax = 3, sMax = NA, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, model = TRUE, ... ) rtslars(x, ...) ## S3 method for class 'formula' rtslars(formula, data, ...) ## Default S3 method: rtslars( x, y, h = 1, pMax = 3, sMax = NA, centerFun = median, scaleFun = mad, regFun = lmrob, regArgs = list(), combine = c("min", "euclidean", "mahalanobis"), winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
x |
a numeric matrix or data frame containing the candidate predictor series. |
... |
additional arguments to be passed down. |
formula |
a formula describing the full model. |
data |
an optional data frame, list or environment (or object coercible
to a data frame by |
y |
a numeric vector containing the response series. |
h |
an integer giving the forecast horizon (defaults to 1). |
pMax |
an integer giving the maximum number of lags in the model (defaults to 3). |
sMax |
an integer giving the number of predictor series to be
sequenced. If it is |
fit |
a logical indicating whether to fit submodels along the sequence
( |
s |
an integer vector of length two giving the first and last
step along the sequence for which to compute submodels. The default
is to start with a model containing only an intercept (step 0) and
iteratively add all series along the sequence (step |
crit |
a character string specifying the optimality criterion to be
used for selecting the final model. Currently, only |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
model |
a logical indicating whether the model data should be included in the returned object. |
centerFun |
a function to compute a robust estimate for the center
(defaults to |
scaleFun |
a function to compute a robust estimate for the scale
(defaults to |
regFun |
a function to compute robust linear regressions that can be
interpreted as weighted least squares (defaults to
|
regArgs |
a list of arguments to be passed to |
combine |
a character string specifying how to combine the data
cleaning weights from the robust regressions with each predictor group.
Possible values are |
winsorize |
a logical indicating whether to clean the data by multivariate winsorization. |
const |
numeric; tuning constant for multivariate winsorization to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
seed |
optional initial seed for the random number generator
(see |
If fit
is FALSE
, an integer matrix in which each column
contains the indices of the sequenced predictor series for the corresponding
lag length.
Otherwise an object of class "tslars"
with the following components:
pFit
a list containing the fits for the respective lag
lengths (see tslarsP
).
pOpt
an integer giving the optimal number of lags.
pMax
the maximum number of lags considered.
x
the matrix of candidate predictor series (if model
is TRUE
).
y
the response series (if model
is TRUE
).
call
the matched function call.
The predictor group of lagged values of the response is indicated by the index 0.
Andreas Alfons, based on code by Sarah Gelper
Alfons, A., Croux, C. and Gelper, S. (2016) Robust groupwise least angle regression. Computational Statistics & Data Analysis, 93, 421–435. doi:10.1016/j.csda.2015.02.007
coef
,
fitted
,
plot
,
predict
,
residuals
,
tslarsP
, lmrob
(Robustly) sequence groups of candidate predictors and their respective lagged values according to their predictive content and find the optimal model along the sequence. Note that lagged values of the response are included as a predictor group as well.
tslarsP(x, ...) ## S3 method for class 'formula' tslarsP(formula, data, ...) ## Default S3 method: tslarsP( x, y, h = 1, p = 2, sMax = NA, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, model = TRUE, ... ) rtslarsP(x, ...) ## S3 method for class 'formula' rtslarsP(formula, data, ...) ## Default S3 method: rtslarsP( x, y, h = 1, p = 2, sMax = NA, centerFun = median, scaleFun = mad, regFun = lmrob, regArgs = list(), combine = c("min", "euclidean", "mahalanobis"), winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
tslarsP(x, ...) ## S3 method for class 'formula' tslarsP(formula, data, ...) ## Default S3 method: tslarsP( x, y, h = 1, p = 2, sMax = NA, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, model = TRUE, ... ) rtslarsP(x, ...) ## S3 method for class 'formula' rtslarsP(formula, data, ...) ## Default S3 method: rtslarsP( x, y, h = 1, p = 2, sMax = NA, centerFun = median, scaleFun = mad, regFun = lmrob, regArgs = list(), combine = c("min", "euclidean", "mahalanobis"), winsorize = FALSE, const = 2, prob = 0.95, fit = TRUE, s = c(0, sMax), crit = "BIC", ncores = 1, cl = NULL, seed = NULL, model = TRUE, ... )
x |
a numeric matrix or data frame containing the candidate predictor series. |
... |
additional arguments to be passed down. |
formula |
a formula describing the full model. |
data |
an optional data frame, list or environment (or object coercible
to a data frame by |
y |
a numeric vector containing the response series. |
h |
an integer giving the forecast horizon (defaults to 1). |
p |
an integer giving the number of lags in the model (defaults to 2). |
sMax |
an integer giving the number of predictor series to be
sequenced. If it is |
fit |
a logical indicating whether to fit submodels along the sequence
( |
s |
an integer vector of length two giving the first and last
step along the sequence for which to compute submodels. The default
is to start with a model containing only an intercept (step 0) and
iteratively add all series along the sequence (step |
crit |
a character string specifying the optimality criterion to be
used for selecting the final model. Currently, only |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
model |
a logical indicating whether the model data should be included in the returned object. |
centerFun |
a function to compute a robust estimate for the center
(defaults to |
scaleFun |
a function to compute a robust estimate for the scale
(defaults to |
regFun |
a function to compute robust linear regressions that can be
interpreted as weighted least squares (defaults to
|
regArgs |
a list of arguments to be passed to |
combine |
a character string specifying how to combine the data
cleaning weights from the robust regressions with each predictor group.
Possible values are |
winsorize |
a logical indicating whether to clean the data by multivariate winsorization. |
const |
numeric; tuning constant for multivariate winsorization to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
seed |
optional initial seed for the random number generator
(see |
If fit
is FALSE
, an integer vector containing the indices of
the sequenced predictor series.
Otherwise an object of class "tslarsP"
(inheriting from classes
"grplars"
and "seqModel"
) with the following components:
active
an integer vector containing the sequence of predictor series.
s
an integer vector containing the steps for which submodels along the sequence have been computed.
coefficients
a numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence.
fitted.values
a numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence.
residuals
a numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence.
df
an integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients).
robust
a logical indicating whether a robust fit was computed.
scale
a numeric vector giving the robust residual scale estimates for the submodels along the sequence (only returned for a robust fit).
crit
an object of class "bicSelect"
containing the BIC
values and indicating the final model (only returned if argument crit
is "BIC"
and argument s
indicates more than one step along the
sequence).
muX
a numeric vector containing the center estimates of the predictor variables.
sigmaX
a numeric vector containing the scale estimates of the predictor variables.
muY
numeric; the center estimate of the response.
sigmaY
numeric; the scale estimate of the response.
x
the matrix of candidate predictor series (if model
is
TRUE
).
y
the response series (if model
is TRUE
).
assign
an integer vector giving the predictor group to which each predictor variable belongs.
w
a numeric vector giving the data cleaning weights (only returned for a robust fit).
h
the forecast horizon.
p
the number of lags in the model.
call
the matched function call.
The predictor group of lagged values of the response is indicated by the index 0.
Andreas Alfons, based on code by Sarah Gelper
Alfons, A., Croux, C. and Gelper, S. (2016) Robust groupwise least angle regression. Computational Statistics & Data Analysis, 93, 421–435. doi:10.1016/j.csda.2015.02.007
coef
,
fitted
,
plot
,
predict
,
residuals
,
rstandard
,
tslars
, lmrob
Extract binary weights that indicate outliers from sparse least trimmed squares regression models.
## S3 method for class 'sparseLTS' weights( object, type = "robustness", s = NA, fit = c("reweighted", "raw", "both"), drop = !is.null(s), ... )
## S3 method for class 'sparseLTS' weights( object, type = "robustness", s = NA, fit = c("reweighted", "raw", "both"), drop = !is.null(s), ... )
object |
the model fit from which to extract outlier weights. |
type |
the type of weights to be returned. Currently only robustness
weights are implemented ( |
s |
an integer vector giving the indices of the models for which to
extract outlier weights. If |
fit |
a character string specifying for which estimator to extract
outlier weights. Possible values are |
drop |
a logical indicating whether to reduce the dimension to a vector in case of only one model. |
... |
currently ignored. |
A numeric vector or matrix containing the requested outlier weights.
The weights are for observations with reasonably small
residuals and
for observations with large residuals.
Andreas Alfons
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitGrid <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract outlier weights weights(fitGrid) head(weights(fitGrid, fit = "both")) head(weights(fitGrid, s = NULL)) head(weights(fitGrid, fit = "both", s = NULL))
## generate data # example is not high-dimensional to keep computation time low library("mvtnorm") set.seed(1234) # for reproducibility n <- 100 # number of observations p <- 25 # number of variables beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients sigma <- 0.5 # controls signal-to-noise ratio epsilon <- 0.1 # contamination level Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p)) x <- rmvnorm(n, sigma=Sigma) # predictor matrix e <- rnorm(n) # error terms i <- 1:ceiling(epsilon*n) # observations to be contaminated e[i] <- e[i] + 5 # vertical outliers y <- c(x %*% beta + sigma * e) # response x[i,] <- x[i,] + 5 # bad leverage points ## sparse LTS over a grid of values for lambda # fit model frac <- seq(0.2, 0.05, by = -0.05) fitGrid <- sparseLTS(x, y, lambda = frac, mode = "fraction") # extract outlier weights weights(fitGrid) head(weights(fitGrid, fit = "both")) head(weights(fitGrid, s = NULL)) head(weights(fitGrid, fit = "both", s = NULL))
Clean data by means of winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.
winsorize(x, ...) ## Default S3 method: winsorize( x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, return = c("data", "weights"), ... ) ## S3 method for class 'matrix' winsorize( x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, return = c("data", "weights"), ... ) ## S3 method for class 'data.frame' winsorize(x, ...)
winsorize(x, ...) ## Default S3 method: winsorize( x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, return = c("data", "weights"), ... ) ## S3 method for class 'matrix' winsorize( x, standardized = FALSE, centerFun = median, scaleFun = mad, const = 2, prob = 0.95, tol = .Machine$double.eps^0.5, return = c("data", "weights"), ... ) ## S3 method for class 'data.frame' winsorize(x, ...)
x |
a numeric vector, matrix or data frame to be cleaned. |
... |
for the generic function, additional arguments to be passed
down to methods. For the |
standardized |
a logical indicating whether the data are already robustly standardized. |
centerFun |
a function to compute a robust estimate for the center to
be used for robust standardization (defaults to
|
scaleFun |
a function to compute a robust estimate for the scale to
be used for robust standardization (defaults to |
const |
numeric; tuning constant to be used in univariate winsorization (defaults to 2). |
return |
character string; if |
prob |
numeric; probability for the quantile of the
|
tol |
a small positive numeric value used to determine singularity
issues in the computation of correlation estimates based on bivariate
winsorization (see |
The borders of the main part of the data are defined on the scale of the
robustly standardized data. In the univariate case, the borders are given
by const
, thus a symmetric distribution is assumed. In the
multivariate case, a normal distribution is assumed and the data are
shrunken towards the boundary of a tolerance ellipse with coverage
probability prob
. The boundary of this ellipse is thereby given by
all points that have a squared Mahalanobis distance equal to the quantile of
the distribution given by
prob
.
If standardize
is TRUE
and return
is "weights"
,
a set of data cleaning weights. Multiplying each observation of the
standardized data by the corresponding weight yields the cleaned
standardized data.
Otherwise an object of the same type as the original data x
containing the cleaned data is returned.
Data cleaning weights are only meaningful for standardized data. In the general case, the data need to be standardized first, then the data cleaning weights can be computed and applied to the standardized data, after which the cleaned standardized data need to be backtransformed to the original scale.
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. doi:10.1198/016214507000000950
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## winsorize data x winsorize(x)
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## winsorize data x winsorize(x)