Title: | Weighted BACON Algorithms |
---|---|
Description: | The BACON algorithms are methods for multivariate outlier nomination (detection) and robust linear regression by Billor, Hadi, and Velleman (2000) <doi:10.1016/S0167-9473(99)00101-2>. The extension to weighted problems is due to Beguin and Hulliger (2008) <https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X200800110616>; see also <doi:10.21105/joss.03238>. |
Authors: | Tobias Schoch [aut, cre] , R-core [cph] (plot.wbaconlm derives from plot.lm) |
Maintainer: | Tobias Schoch <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.6-2 |
Built: | 2024-12-07 06:54:57 UTC |
Source: | CRAN |
The package wbacon implements the BACON algorithms of Billor et al. (2000) and some of the extensions proposed by Béguin and Hulliger (2008).
See wBACON
to learn more on the BACON method for multivariate
outlier nomination (detection).
See wBACON_reg
to learn more on the BACON method for robust
linear regression.
Tobias Schoch
Billor N., Hadi A.S. and Vellemann P.F. (2000). BACON: Blocked Adaptive Computationally efficient Outlier Nominators. Computational Statistics and Data Analysis 34, pp. 279–298. doi:10.1016/S0167-9473(99)00101-2
Béguin C. and Hulliger B. (2008). The BACON-EEM Algorithm for Multivariate Outlier Detection in Incomplete Survey Data. Survey Methodology 34, pp. 91–103. https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X200800110616
Schoch, T. (2021). wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression, Journal of Open Source Software 6 (62), 3238 doi:10.21105/joss.03238
Returns a logical vector that indicates which observations were declared outlier by the method.
is_outlier(object, ...) ## S3 method for class 'wbaconlm' is_outlier(object, ...) ## S3 method for class 'wbaconmv' is_outlier(object, ...)
is_outlier(object, ...) ## S3 method for class 'wbaconlm' is_outlier(object, ...) ## S3 method for class 'wbaconmv' is_outlier(object, ...)
object |
object of class |
... |
additional arguments passed to the method. |
A logical vector.
wBACON_reg
and wBACON
data(swiss) m <- wBACON(swiss) is_outlier(m)
data(swiss) m <- wBACON(swiss) is_outlier(m)
median_w
computes the weighted population median.
median_w(x, w, na.rm = FALSE)
median_w(x, w, na.rm = FALSE)
x |
|
w |
|
na.rm |
|
Weighted sample median; see quantile_w
for more
information.
Weighted estimate of the population median.
The data set consists of 677 observations on 9 variables/characteristics of diaphragm parts for television sets.
data(philips)
data(philips)
A data.frame
with 677 observations on the following variables:
X1
[double]
, characteristic 1.
X2
[double]
, characteristic 2.
X3
[double]
, characteristic 3.
X4
[double]
, characteristic 4.
X5
[double]
, characteristic 5.
X6
[double]
, characteristic 6.
X7
[double]
, characteristic 7.
X8
[double]
, characteristic 8.
X9
[double]
, characteristic 9.
The data have been studied in Rousseeuw and van Driessen (1999) and Billor et al. (2000). They have been published in Raymaekers and Rousseeuw (2023).
Billor, N., A. S. Hadi, and P. F. Vellemann (2000). BACON: Blocked Adaptive Computationally-efficient Outlier Nominators. Computational Statistics and Data Analysis 34, 279–298. doi:10.1016/S0167-9473(99)00101-2
Raymaekers, J. and P. Rousseeuw (2023). cellWise: Analyzing Data with Cellwise Outliers. R package version 2.5.3, https://CRAN.R-project.org/package=cellWise
Rousseeuw, P. J. and K. van Driessen (1999). A fast algorithm for the Minimum Covariance Determinant estimator. Technometrics 41, 212–223. doi:10.2307/1270566
head(philips)
head(philips)
wbaconlm
Four plots (selectable by which
) are available for an object of
class wbaconlm
(see wBACON_reg
): A plot
of residuals against fitted values, a scale-location plot of
against fitted values,
a Normal Q-Q plot, and a plot of the standardized residuals versus the
robust Mahalanobis distances.
## S3 method for class 'wbaconlm' plot(x, which = c(1, 2, 3, 4), hex = FALSE, caption = c("Residuals vs Fitted", "Normal Q-Q", "Scale-Location", "Standardized Residuals vs Robust Mahalanobis Distance"), panel = if (add.smooth) function(x, y, ...) panel.smooth(x, y, iter = iter.smooth, ...) else points, sub.caption = NULL, main = "", ask = prod(par("mfcol")) < length(which) && dev.interactive(), ..., id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, qqline = TRUE, add.smooth = getOption("add.smooth"), iter.smooth = 3, label.pos = c(4, 2), cex.caption = 1, cex.oma.main = 1.25)
## S3 method for class 'wbaconlm' plot(x, which = c(1, 2, 3, 4), hex = FALSE, caption = c("Residuals vs Fitted", "Normal Q-Q", "Scale-Location", "Standardized Residuals vs Robust Mahalanobis Distance"), panel = if (add.smooth) function(x, y, ...) panel.smooth(x, y, iter = iter.smooth, ...) else points, sub.caption = NULL, main = "", ask = prod(par("mfcol")) < length(which) && dev.interactive(), ..., id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, qqline = TRUE, add.smooth = getOption("add.smooth"), iter.smooth = 3, label.pos = c(4, 2), cex.caption = 1, cex.oma.main = 1.25)
x |
object of class |
which |
if a subset of the plots is required, specify a subset of
the numbers |
hex |
toogle a hexagonally binned plot, |
caption |
captions to appear above the plots;
|
panel |
panel function. The useful alternative to
|
sub.caption |
common title |
main |
title to each plot |
ask |
|
... |
other parameters to be passed through to plotting functions. |
id.n |
number of points to be labelled in each plot, starting
with the most extreme, |
labels.id |
vector of labels |
cex.id |
magnification of point labels, |
qqline |
|
add.smooth |
|
iter.smooth |
the number of robustness iterations |
label.pos |
positioning of labels |
cex.caption |
controls the size of |
cex.oma.main |
controls the size of the |
The plots for which %in% 1:3
are identical with the
plot method for linear models (see plot.lm
).
There you can find details on the implementation and references.
The standardized residuals vs. robust Mahalanobis distance plot
(which = 4
) has been proposed by Rousseeuw and van Zomeren (1990).
[no return value]
Rousseeuw, P.J. and B.C. van Zomeren (1990). Unmasking Multivariate Outliers and Leverage Points, Journal of the American Statistical Association 411, 633–639. doi:10.2307/2289995
wbaconmv
Two plots (selectable by which
) are available for an object of class
wbaconmv
: (1) Robust distance vs. Index and (2) Robust distance
vs. Univariate projection.
## S3 method for class 'wbaconmv' plot(x, which = 1:2, caption = c("Robust distance vs. Index", "Robust distance vs. Univariate projection"), hex = FALSE, col = 2, pch = 19, ask = prod(par("mfcol")) < length(which) && dev.interactive(), alpha = 0.05, maxiter = 20, tol = 1e-5, ...) SeparationIndex(object, alpha = 0.05, tol = 1e-5, maxiter = 20)
## S3 method for class 'wbaconmv' plot(x, which = 1:2, caption = c("Robust distance vs. Index", "Robust distance vs. Univariate projection"), hex = FALSE, col = 2, pch = 19, ask = prod(par("mfcol")) < length(which) && dev.interactive(), alpha = 0.05, maxiter = 20, tol = 1e-5, ...) SeparationIndex(object, alpha = 0.05, tol = 1e-5, maxiter = 20)
x |
object of class |
which |
if a subset of the plots is required, specify a subset of
the numbers |
caption |
captions to appear above the plots;
|
hex |
toogle the hexagonal bin plot on/off |
col |
color of outliers, |
pch |
plot character of outliers, |
ask |
|
alpha |
|
maxiter |
|
tol |
numerical termination criterion, |
object |
object of class |
... |
additional arguments passed to the method. |
The first plot (which = 1
) is a standard diagnostic tool which plots
the observations' index (1:n
) against.the robust (Mahalanobis)
distances; see. e.g., Rousseeuw and van Driessen (1999).
The second plot (which = 2
) plots the univariate projection of
the data which maximizes the separation criterion for clusters of
Qui and Joe (2006) against.the robust (Mahalanobis) distances. This plot
is due to Willems et al. (2009).
For large data sets, it is recommended to specify the argument
hex = TRUE
. This option shows a hexagonally binned scatterplot
in place of the classical scatterplot.
[no return value]
Rousseeuw, P.J. and K. van Driessen (1999). A Fast Algorithm for the Minimum Covariance Determinant, Technometrics 41, 212–223. doi:10.2307/1270566
Qiu, W. and H. Joe (2006). Separation index and partial membership for clustering, Computational Statistics and Data Analysis 50, 585–603. doi:10.1016/j.csda.2004.09.009
Willems, G., H. Joe, and R. Zamar (2009). Diagnosing Multivariate Outliers Detected by Robust Estimators, Journal of Computational and Graphical Statistics 18, 73–91. doi:10.1198/jcgs.2009.0005
This function does exactly what predict
does for
the linear model lm
; see predict.lm
for
more details.
## S3 method for class 'wbaconlm' predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf, interval = c("none", "confidence", "prediction"), level = 0.95, type = c("response", "terms"), terms = NULL, na.action = na.pass, ...)
## S3 method for class 'wbaconlm' predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf, interval = c("none", "confidence", "prediction"), level = 0.95, type = c("response", "terms"), terms = NULL, na.action = na.pass, ...)
object |
Object of class inheriting from |
newdata |
An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used. |
se.fit |
A switch |
scale |
Scale parameter for std.err. calculation, |
df |
Degrees of freedom for scale, |
interval |
Type of interval calculation, |
level |
Tolerance/confidence level, |
type |
Type of prediction (response or model term),
|
terms |
If |
na.action |
function determining what should be done with missing
values in |
... |
further arguments passed to
|
predict.wbaconlm
produces a vector of predictions or a matrix of
predictions and bounds with column names fit
, lwr
, and
upr
if interval
is set. For type = "terms"
this
is a matrix with a column per term and may have an attribute
"constant"
.
If se.fit
is
TRUE
, a list with the following components is returned:
fit |
vector or matrix as above |
se.fit |
standard error of predicted means |
residual.scale |
residual standard deviations |
df |
degrees of freedom for residual |
data(iris) m <- wBACON_reg(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris) predict(m, newdata = data.frame(Sepal.Width = 1, Petal.Length = 1, Petal.Width = 1))
data(iris) m <- wBACON_reg(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris) predict(m, newdata = data.frame(Sepal.Width = 1, Petal.Length = 1, Petal.Width = 1))
quantile_w
computes the weighted population quantiles.
quantile_w(x, w, probs, na.rm = FALSE)
quantile_w(x, w, probs, na.rm = FALSE)
x |
|
w |
|
probs |
|
na.rm |
|
quantile_w
computes the weighted sample
quantiles; argument probs
allows vector inputs.
The function is based on a weighted version of the quickselect algorithm with the Bentley and McIlroy (1993) 3-way partitioning scheme. For very small arrays, we use insertion sort.
For equal weighting, i.e. when all elements in
w
are equal, quantile_w
computes quantiles that
are identical with type = 2
in stats::quantile
; see
also Hyndman and Fan (1996).
Weighted estimate of the population quantiles.
Bentley, J.L. and D.M. McIlroy (1993). Engineering a Sort Function, Software - Practice and Experience 23, 1249–1265. doi:10.1002/spe.4380231105
Hyndman, R.J. and Y. Fan (1996). Sample Quantiles in Statistical Packages, The American Statistician 50, 361–365.doi:10.2307/2684934
wBACON
is an iterative method for the computation of multivariate
location and scatter (under the assumption of a Gaussian distribution).
wBACON(x, weights = NULL, alpha = 0.05, collect = 4, version = c("V2", "V1"), na.rm = FALSE, maxiter = 50, verbose = FALSE, n_threads = 2) distance(x) ## S3 method for class 'wbaconmv' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'wbaconmv' summary(object, ...) center(object) ## S3 method for class 'wbaconmv' vcov(object, ...)
wBACON(x, weights = NULL, alpha = 0.05, collect = 4, version = c("V2", "V1"), na.rm = FALSE, maxiter = 50, verbose = FALSE, n_threads = 2) distance(x) ## S3 method for class 'wbaconmv' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'wbaconmv' summary(object, ...) center(object) ## S3 method for class 'wbaconmv' vcov(object, ...)
x |
|
weights |
|
alpha |
|
collect |
determines the size |
version |
|
na.rm |
|
maxiter |
|
verbose |
|
n_threads |
|
digits |
|
... |
additional arguments passed to the method. |
object |
object of class |
The algorithm is initialized from a set of uncontaminated data. Then the subset is iteratively refined; i.e., additional observations are included into the subset if their Mahalanobis distance is below some threshold (likewise, observations are removed from the subset if their distance larger than the threshold). This process iterates until the set of good data remain stable. Observations not among the good data are outliers; see Billor et al. (2000). The weighted Bacon algorithm is due to Béguin and Hulliger (2008).
The threshold for the (squared) Mahalanobis distances is defined as
the standardized chi-square quantile. All
observations whose squared Mahalanobis distances is larger than
the threshold are regarded as outliers.
If the sampling weights weights
are not explicitly specified (i.e.,
weights = NULL
), they are taken to be 1.0.
The wBACON
cannot deal with missing values. In contrast,
function BEM
in package modi implements
the BACON-EEM algorithm of Béguin and Hulliger (2008), which
is tailored to work with outlying and missing values.
If the argument na.rm
is set to TRUE
the method behaves
like na.omit
.
The BACON algorithm assumes that the non-outlying data have (roughly) an elliptically contoured distribution (this includes the Gaussian distribution as a special case). "Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean." (Billor et al., 2000, p. 289)
In line with Billor et al. (2000, p. 290), we use the term outlier "nomination" rather than "detection" to highlight that algorithms should not go beyond nominating observations as potential outliers; see also Béguin and Hulliger (2008). It is left to the analyst to finally label outlying observations as such.
Diagnostic plots are available by the plot
method.
The method center
and vcov
return, respectively, the
estimated center/location and covariance matrix.
The distance
method returns the robust Mahalanobis distances.
The function is_outlier returns a vector of logicals that flags the nominated outliers.
An object of class wbaconmv
with slots
x |
see function arguments |
weights |
see function arguments |
center |
estimated center of the data |
dist |
Mahalanobis distances |
n |
number of observations |
p |
number of variables |
alpha |
see function arguments |
subset |
final subset of outlier-free data |
cutoff |
see function arguments |
maxiter |
number of iterations until convergence |
version |
see functions arguments |
collect |
see functions arguments |
cov |
covariance matrix |
converged |
logical that indicates whether the algorithm converged |
call |
the matched call |
Billor N., Hadi A.S. and Vellemann P.F. (2000). BACON: Blocked Adaptive Computationally efficient Outlier Nominators. Computational Statistics and Data Analysis 34, pp. 279–298. doi:10.1016/S0167-9473(99)00101-2
Béguin C. and Hulliger B. (2008). The BACON-EEM Algorithm for Multivariate Outlier Detection in Incomplete Survey Data. Survey Methodology 34, pp. 91–103. https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X200800110616
Schoch, T. (2021). wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression, Journal of Open Source Software 6 (62), 3238 doi:10.21105/joss.03238
plot
and
is_outlier
data(swiss) dt <- swiss[, c("Fertility", "Agriculture", "Examination", "Education", "Infant.Mortality")] m <- wBACON(dt) m which(is_outlier(m))
data(swiss) dt <- swiss[, c("Fertility", "Agriculture", "Examination", "Education", "Infant.Mortality")] m <- wBACON(dt) m which(is_outlier(m))
The weighted BACON algorithm is a robust method to fit weighted linear regression models. The method is robust against outlier in the response variable and the design matrix (leverage observation).
wBACON_reg(formula, weights = NULL, data, collect = 4, na.rm = FALSE, alpha = 0.05, version = c("V2", "V1"), maxiter = 50, verbose = FALSE, original = FALSE, n_threads = 2) ## S3 method for class 'wbaconlm' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'wbaconlm' summary(object, ...) ## S3 method for class 'wbaconlm' fitted(object, ...) ## S3 method for class 'wbaconlm' residuals(object, ...) ## S3 method for class 'wbaconlm' coef(object, ...) ## S3 method for class 'wbaconlm' vcov(object, ...)
wBACON_reg(formula, weights = NULL, data, collect = 4, na.rm = FALSE, alpha = 0.05, version = c("V2", "V1"), maxiter = 50, verbose = FALSE, original = FALSE, n_threads = 2) ## S3 method for class 'wbaconlm' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'wbaconlm' summary(object, ...) ## S3 method for class 'wbaconlm' fitted(object, ...) ## S3 method for class 'wbaconlm' residuals(object, ...) ## S3 method for class 'wbaconlm' coef(object, ...) ## S3 method for class 'wbaconlm' vcov(object, ...)
formula |
an object of class |
weights |
|
data |
a |
collect |
determines the size |
na.rm |
|
alpha |
|
version |
method to initialize the basic subset, |
maxiter |
|
verbose |
|
original |
|
n_threads |
|
digits |
|
object |
object of class |
x |
object of class |
... |
additional arguments passed to the method. |
First, the wBACON
method is applied to the model's design
matrix (having removed the regression intercept/constant, if there is
a constant) to establish a subset of observations which is supposed to
be free of outliers. Second, the so generated subset is regressed onto
the corresponding subset of response variables. The subset is iteratively
enlarged to include as many “good” observations as possible.
The original approach of Billor et al. (2000) obtains by specifying
the argument original = TRUE
.
Models for wBACON_reg
are specified symbolically. A typical model
has the form response ~ terms
, where response
is the
(numeric) response vector and terms
is a series of terms
which specifies a linear predictor for response.
A formula
has an implied intercept term. To remove this use
either y ~ x - 1
or y ~ 0 + x
. See formula
or lm
for for more details.
The weights
argument can be used to specify sampling weights or
case weights.
It is not possible to fit multiple response variables (on the r.h.s. of the formula, i.e. multivariate models) in one call.
The method cannot deal with missing values. If the argument
na.rm
is set to TRUE
the method behaves like
na.omit
.
The algorithm assumes that the non-outlying data follow a linear (homoscedastic) regression model and that the independent variables have (roughly) an elliptically contoured distribution. “Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean.” (Billor et al., 2000, p. 289)
In line with Billor et al. (2000, p. 290), we use the term outlier “nomination” rather than “detection” to highlight that algorithms should not go beyond nominating observations as potential outliers. It is left to the analyst to finally label outlying observations as such.
The generic functions coef
, fitted
, residuals
,
and vcov
extract the estimate coefficients, fitted values,
residuals, and the covariance matrix of the estimated coefficients.
The function summary
summarizes the estimated model.
An object of class wbaconlm
with slots
coefficients |
a named vector of coefficients |
residuals |
the residuals (for all observations in the data.frame not only the ones in the final subset |
rank |
the numeric rank of the fitted linear model (i.e.. number of variables in the design matrix |
fitted.values |
fitted values |
df.residual |
the residual degrees of freedom (computed for the observations in the final subset) |
call |
the matched call |
terms |
the |
model |
the |
weights |
weights |
qr |
the |
subset |
the subset |
reg |
a list with additional details on |
mv |
a list with details on the results of |
Billor N., Hadi A.S. and Vellemann P.F. (2000). BACON: Blocked Adaptive Computationally efficient Outlier Nominators. Computational Statistics and Data Analysis 34, pp. 279–298. doi:10.1016/S0167-9473(99)00101-2
Schoch, T. (2021). wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression, Journal of Open Source Software 6 (62), 3238 doi:10.21105/joss.03238
plot
gives diagnostic plots for an
wbaconlm
object.
predict
is used for prediction (incl.
confidence and prediction intervals).
data(iris) m <- wBACON_reg(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris) m summary(m)
data(iris) m <- wBACON_reg(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris) m summary(m)