Title: | Detection of Univariate Outliers |
---|---|
Description: | Well known outlier detection techniques in the univariate case. Methods to deal with skewed distribution are included too. The Hidiroglou-Berthelot (1986) method to search for outliers in ratios of historical data is implemented as well. When available, survey weights can be used in outliers detection. |
Authors: | Marcello D'Orazio |
Maintainer: | Marcello D'Orazio <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.4 |
Built: | 2024-12-16 07:03:00 UTC |
Source: | CRAN |
Well known outlier detection techniques in the univariate case. Ratios of two variables are covered too. When available, survey weights can be considered.
The package provides few simple functions implementing well known outlier detection techniques in the univariate case. Methods to deal with skewed distributions are included. The Hidiroglou-Berthelot (1986) method to search for outliers in ratios of historical data is implemented as well. When available, the survey weights can be used in outliers' detection.
Author and Maintainer: Marcello D'Orazio [email protected]
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
McGill, R., Tukey, J. W. and Larsen, W. A. (1978) ‘Variations of box plots’. The American Statistician, 32, pp. 12-16.
Rousseeuw, P.J. and Croux, C. (1993) ‘Alternatives to the Median Absolute Deviation’, Journal of the American Statistical Association 88, pp. 1273-1283.
Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics & Data Analysis, 52, pp. 5186-5201
Identifies univariate outliers by using methods based on BoxPlots
boxB(x, k=1.5, method='asymmetric', weights=NULL, id=NULL, exclude=NA, logt=FALSE)
boxB(x, k=1.5, method='asymmetric', weights=NULL, id=NULL, exclude=NA, logt=FALSE)
x |
Numeric vector that will be searched for outliers. |
k |
Nonnegative constant that determines the extension of the 'whiskers'. Commonly used values are 1.5 (default), 2, or 3.
Note that when |
method |
Character, identifies the method to be used: |
weights |
Optional numeric vector with units' weights associated to the observations in |
id |
Optional vector with identifiers of units in |
exclude |
Values of |
logt |
Logical, if |
When method="resistant"
the outlying observations are those outside the interval:
where and
are respectively the 1st and the 3rd quartile of
x
, while is the Inter-Quartile Range. The value
(said 'inner fences') is commonly used when drawing a boxplot. Values
and
provide middle and outer fences, respectively.
When method="asymmetric"
the outlying observations are those outside the interval:
being the median; such a modification allows to account for slight skewness of the distribution.
Finally, when method="adjbox"
the outlying observations are identified using the method proposed by Hubert and Vandervieren (2008) and based on the Medcouple measure of skewness; in practice the bounds are:
Where M is the medcouple; when (positive skewness) then
and
; on the contrary
and
for negative skewness (
). This adjustment of the boxplot, according to Hubert and Vandervieren (2008), works with moderate skewness (
). The bounds of the adjusted boxplot are derived by applying the function
adjboxStats
in the package robustbase.
When weights are available (passed via the argument weights
) then they are used in the computation of the quartiles. In particular, the quartiles are derived using the function wtd.quantile
in the package Hmisc.
Remember that when asking a log transformation (argument logt=TRUE
) all the estimates (quartiles, etc.) will refer to .
The output is a list containing the following components:
quartiles |
The quartiles of |
fences |
The bounds of the interval, values outside the interval are detected as outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
lowOutl |
The identifiers or positions (when |
upOutl |
The identifiers or positions (when |
Marcello D'Orazio [email protected]
McGill, R., Tukey, J. W. and Larsen, W. A. (1978) ‘Variations of box plots’. The American Statistician, 32, pp. 12-16.
Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.
set.seed(321) x <- rnorm(30, 50, 10) x[10] <- 1 x[20] <- 100 out <- boxB(x = x, k = 1.5, method = 'asymmetric') out$fences out$outliers x[out$outliers] out <- boxB(x = x, k = 1.5, method = 'adjbox') out$fences out$outliers x[out$outliers] x[24] <- NA x.ids <- paste0('obs',1:30) out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids) out$excluded out$fences out$outliers set.seed(111) w <- round(runif(n = 30, min=1, max=10)) out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids, weights = w) out$excluded out$fences out$outliers
set.seed(321) x <- rnorm(30, 50, 10) x[10] <- 1 x[20] <- 100 out <- boxB(x = x, k = 1.5, method = 'asymmetric') out$fences out$outliers x[out$outliers] out <- boxB(x = x, k = 1.5, method = 'adjbox') out$fences out$outliers x[out$outliers] x[24] <- NA x.ids <- paste0('obs',1:30) out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids) out$excluded out$fences out$outliers set.seed(111) w <- round(runif(n = 30, min=1, max=10)) out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids, weights = w) out$excluded out$fences out$outliers
This function implements the method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data, i.e. when the same variable is measured at two time points.
HBmethod(yt1, yt2, U=0.5, A=0.05, C=4, pct=0.25, id=NULL, std.score=FALSE, return.dataframe=FALSE, adjboxE=FALSE)
HBmethod(yt1, yt2, U=0.5, A=0.05, C=4, pct=0.25, id=NULL, std.score=FALSE, return.dataframe=FALSE, adjboxE=FALSE)
yt1 |
Numeric vector providing the values observed at time t1. |
yt2 |
Numeric vector providing the values observed at time t2 (t2 > t1). |
U |
Numeric, parameter needed to determine the ‘importance’ of a ratio. The value should lie in [0, 1] interval; commonly used values are 0.3, 0.4, or 0.5 (default) (see Details for further information). |
A |
Numeric, parameter needed when computing the scale measure used to derive the bounds. Hidiroglou and Berthelot (1986) suggest setting |
C |
Numeric, parameter determining the extension of the interval; greater values will provide larger intervals, i.e. fewer expected outliers. Values commonly used are 4 (default) or 7, but also values close or grater than 40 can be used in some particular cases. Note that two C values can be provided instead of one, the first one will be used to determine the left tail bound, while the second determines the right tail bound; this setting can help in improving outlier detection in skewed distributions (see Details for further information). |
pct |
Numeric, the percentage point of the scores that will be used to calculate the lower and upper bounds. By default, |
id |
Optional numeric or character vector, with identifiers of units. If |
std.score |
Logical, if |
return.dataframe |
Logical, if |
adjboxE |
Logical (default |
The method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data consists in deriving a score variable based on the ratios (
yt2/yt1
) with being
the number of observations after discarding NAs and 0s in both
yt1
and yt2
.
At first the ratios are centered around their median :
Then, in order to account for the magnitude of data, the following score is derived:
Finally, the interval is calculated as:
where
and
being ,
and
the quartiles of the E scores (when
pct = 0.25
, default)).
Recently Hidiroglou and Emond (2018) suggest using percentiles P10 and P90 of the E scores in replacement of respectively Q1 and Q3 to avoid the drawback of many units identified as outliers; this is likely to occur when a large proportion of units (>1/4) has the same ratio. P10 and P90 are achieved by setting pct = 0.10
when running the function.
In practice, all the units with an E score outside the interval are considered as outliers. Notice that when two C
values are provided, then the first is used to derive the left bound while the second determines the right bound.
When std.score=TRUE
a standardized score is derived in the following manner:
The constant g is set equal to qnorm(1-pct)
and makes and
approximately unbiased estimators when the E scores follow the normal distribution.
When adjboxE = TRUE
outliers on the E scores will all be searched using the boxplot adjusted for skewness as implemented in the function boxB
when run with with argument method = "adjbox"
.
A list whose components depend on the return.dataframe
argument. When return.dataframe=FALSE
just the following components are provided:
median.r |
the median of the ratios |
quartiles.E |
Quartiles of the E score |
bounds.E |
Bounds of the interval of the E score, values outside are considered outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
outliersBB |
The identifiers or positions (when |
When return.dataframe=TRUE
, the first three components remain the same with, in addition, two dataframes:
excluded |
A dataframe with the subset of observations excluded. The data frame has the following columns: 'id' (units' identifiers), 'yt1' columns 'yt2' |
data |
A dataframe with the the not excluded observations and the following columns: ‘id’ (units' identifiers), ‘yt1’, ‘yt2’, ‘ratio’ (= yt1/yt2), ‘sizeU’ (=max(yt1, yt2)^U),‘Escore’ (the E scores, see Details), ‘std.Escore’ (the standardized E scores when |
Marcello D'Orazio [email protected]
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hidiroglou, M.A. and Emond, N. (2018) ‘Modifying the Hidiroglou-Berthelot (HB) method’. Unpublished note, Business Survey Methods Division, Statistics Canada, May 18 2018.
set.seed(222) x0 <- rnorm(30, 50, 5) x0[1] <- NA set.seed(333) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x1 <- x0 * rr x1[20] <- 0 out <- HBmethod(yt1 = x0, yt2 = x1) out$excluded out$median.r out$bounds.E out$outliers cbind(x0[out$outliers], x1[out$outliers]) out <- HBmethod(yt1 = x0, yt2 = x1, return.dataframe = TRUE) out$excluded head(out$data)
set.seed(222) x0 <- rnorm(30, 50, 5) x0[1] <- NA set.seed(333) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x1 <- x0 * rr x1[20] <- 0 out <- HBmethod(yt1 = x0, yt2 = x1) out$excluded out$median.r out$bounds.E out$outliers cbind(x0[out$outliers], x1[out$outliers]) out <- HBmethod(yt1 = x0, yt2 = x1, return.dataframe = TRUE) out$excluded head(out$data)
This function identifies outliers in the tails of a distribution by detecting the observations outside the bounds built using a robust estimate of both location and scale parameters.
LocScaleB(x, k=3, method='MAD', weights=NULL, id=NULL, exclude=NA, logt=FALSE, return.dataframe=FALSE)
LocScaleB(x, k=3, method='MAD', weights=NULL, id=NULL, exclude=NA, logt=FALSE, return.dataframe=FALSE)
x |
Numeric vector that will be searched for outliers. |
k |
Nonnegative constant that determines the extension of bounds. Commonly used values are 2, 2.5 and 3 (default). |
method |
character identifying how to estimate the scale of the distribution. Available choices are:
When When Finally, when |
weights |
Optional numeric vector that provides weights associated to observations. Only nonnegative weights are allowed. Note that weights can only be used when |
id |
Optional numeric or character vector, with identifiers of units in |
exclude |
Values of |
logt |
Logical, if |
return.dataframe |
Logical, if |
The intervals are derived by considering the median as a robust location estimate while different robust scale estimators are considered:
where and
are robust scale estimates.
With most of the methods
with exception of
method='dQ'
and method='dD'
where respectively:
and
Note that when method='dQ'
or method='dD'
the function calculates and prints a the Bowley's coefficient of skewness, that uses Q1, Q2 and Q3 (they are replaced by respectively P10, P50 and P90 when method='dD'
).
With method='AdjOut'
the following estimates are considered:
being and
derived starting from the fences of the adjusted boxplot (Hubert and Vandervieren, 2008; see
adjboxStats
). In addition the medcouple (mc
) measure of skewness is calculated and printed on the screen.
When weights are available (passed via the argument weights
) then they are used in the computation of the quartiles. In particular, the quartiles are derived using the function wtd.quantile
in the package Hmisc. Note that their use is allowed just with method='IQR'
, method='IDR'
, method='dQ'
, method='dD'
or method='AdjOut'
.
The ‘score’ variable reported in the the data
dataframe when return.dataframe=TRUE
is the standardized score derived as (x - Median)/scale.
A list whose components depend on the return.dataframe
argument. When return.dataframe = FALSE
just the following components are provided:
pars |
Vector with estimated median and scale parameters |
bounds |
The bounds of the interval, values outside the interval are considered outliers. |
excluded |
The position or identifiers of |
outliers |
The position or identifiers of |
lowOutl |
The identifiers or positions (when |
upOutl |
The identifiers or positions (when |
When return.dataframe=TRUE
the latter two components are substituted with two dataframes:
excluded |
A dataframe with the subset of observations excluded. |
data |
A dataframe with the the not excluded observations and the following columns: ‘id’ (units' identifiers), ‘x’, ‘log.x’ (only if |
Marcello D'Orazio [email protected]
Hubert, M. and Van der Veeken, S. (2008) ‘Outlier Detection for Skewed Data’. Journal of Chemometrics, 22, pp. 235-246.
Maronna, R.A. and Zamar, R.H. (2002) ‘Robust estimates of location and dispersion of high-dimensional datasets’ Technometrics, 44, pp. 307-317.
Rousseeuw, P.J. and Croux, C. (1993) ‘Alternatives to the Median Absolute Deviation’, Journal of the American Statistical Association 88, pp. 1273-1283.
Vanderviere, E. and Huber, M. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics & Data Analysis, 52, pp. 5186-5201
mad
, scaleTau2
, Qn
, Sn
, GiniMd
set.seed(333) x <- rnorm(30, 50, 1) x[10] <- 1 x[20] <- 100 out <- LocScaleB(x = x, k = 3, method='MAD') out$pars out$bounds out$outliers x[out$outliers] out <- LocScaleB(x = x, k = 3, method='MAD', return.dataframe = TRUE) head(out$data) out <- LocScaleB(x = x, k = 3, method='AdjOut') out$outliers
set.seed(333) x <- rnorm(30, 50, 1) x[10] <- 1 x[20] <- 100 out <- LocScaleB(x = x, k = 3, method='MAD') out$pars out$bounds out$outliers x[out$outliers] out <- LocScaleB(x = x, k = 3, method='MAD', return.dataframe = TRUE) head(out$data) out <- LocScaleB(x = x, k = 3, method='AdjOut') out$outliers
The function gets the output of the function of HBmethod
or ratioSize
when they are ran with the argument return.dataframe = TRUE
) to draw a scatter-plot of ratios vs. the corresponding importance measures.
plot4ratios(out)
plot4ratios(out)
out |
Is the output of |
This function draws a scatter-plot. With the output of HBmethod
the ratios (=yt2/yt1) are on Y axis while their importance measure ( max(yt1, yt2)^U) are represented on the X axis. With the output of ratioSize
on the Y axis the centered ratios are reported. In addition the acceptance bounds are drawn (blue lines); the dots (in red color) outside the bounds are the outliers. This is considered a useful diagnostic plot to understand how the procedure identifies the outliers.
A scatter-plot is drawn and, in addition, the output includes a data.frame with the data used to derive the plot.
Marcello D'Orazio [email protected]
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hidiroglou, M.A. and Emond, N. (2018) ‘Modifying the Hidiroglou-Berthelot (HB) method’. Unpublished note, Business Survey Methods Division, Statistics Canada, May 18 2018.
# generate some data set.seed(222) x0 <- rnorm(30, 50, 5) set.seed(333) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x1 <- x0 * rr # run HBmethod with argument return.dataframe = TRUE out <- HBmethod(yt1 = x0, yt2 = x1, return.dataframe = TRUE) # draw the scatterplot plot4ratios(out)
# generate some data set.seed(222) x0 <- rnorm(30, 50, 5) set.seed(333) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x1 <- x0 * rr # run HBmethod with argument return.dataframe = TRUE out <- HBmethod(yt1 = x0, yt2 = x1, return.dataframe = TRUE) # draw the scatterplot plot4ratios(out)
Identifies outliers on transformed ratios (centering with respect to their median) using the adjusted boxplot for skewed distributions. Outliers can be sorted/filtered according to a size measure.
ratioSize(numerator, denominator, id=NULL, size=NULL, U=1, size.th=NULL, return.dataframe=FALSE)
ratioSize(numerator, denominator, id=NULL, size=NULL, U=1, size.th=NULL, return.dataframe=FALSE)
numerator |
Numeric vector with the values that go at numerator of the ratio |
denominator |
Numeric vector with the values that go at denominator of the ratio |
id |
Optional numeric or character vector, with identifiers of units. If |
size |
Optional numeric vector providing a measure of the importance of a ratio. If |
U |
Numeric, constant with |
size.th |
Numeric, size threshold. Can be specified when a size measure is used. In such a case just outliers with a size greater than the threshold will be returned. Note that when argument |
return.dataframe |
Logical, if |
This function searches for outliers starting from ratios r=numerator/denominator
. At first the ratios are centered around their median, as in Hidiroglou Berthelot (1986) procedure (see HBmethod
), then the outlier identification is based on the adjusted boxplot for skewed distribution (Hubert and Vandervieren 2008) (see adjboxStats
).
The subset of outliers is sorted in decreasing order according the size measure. If a size threshold is provided then just outliers with (size^U) > (size.th^U) will be returned.
A list whose components depend on the return.dataframe
argument. When return.dataframe = FALSE
just the following components are returned:
median.r |
the median of the ratios |
bounds |
The bounds of the interval for centered ratios |
excluded |
The position or the identifiers of the units with values excluded by the computations because of 0s or NAs. |
outliers |
The position or the identifiers of the units detected as outliers. Remember that when |
When return.dataframe=TRUE
the latter two components are substituted with two dataframes:
excluded |
A dataframe with the subset of observations excluded |
data |
A dataframe with the not excluded observations with the following columns: ‘id’ (units' identifiers), ‘numerator’, ‘denominator’, ‘ratio’ (= numerator/denominator), ‘c.ratio’ (centered ratios, see Details), ‘sizeU’ (size^U values) and finally ‘outliers’, where value 1 indicates observations detected as an outlier and 0 otherwise. The data frame will be sorted in decreasing manner according to size^U. Note that when a size threshold is provided then ONLY outliers with (size^U) > (size.th^U) will be returned. |
Marcello D'Orazio [email protected]
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.
HBmethod
, plot4ratios
, boxB
,adjboxStats
set.seed(444) x1 <- rnorm(30, 50, 5) set.seed(555) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x2 <- x1 * rr out <- ratioSize(numerator = x2, denominator = x1) out out <- ratioSize(numerator = x2, denominator = x1, return.dataframe = TRUE) head(out$data) out <- ratioSize(numerator = x2, denominator = x1, size.th = 65, return.dataframe = TRUE) head(out$data)
set.seed(444) x1 <- rnorm(30, 50, 5) set.seed(555) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x2 <- x1 * rr out <- ratioSize(numerator = x2, denominator = x1) out out <- ratioSize(numerator = x2, denominator = x1, return.dataframe = TRUE) head(out$data) out <- ratioSize(numerator = x2, denominator = x1, size.th = 65, return.dataframe = TRUE) head(out$data)
The function calculates some skewness measures for the input vector data.
skew.misc(x, weights=NULL)
skew.misc(x, weights=NULL)
x |
Input vector containing data for which skewness will be calculated. |
weights |
Optional vector with eventual non-negative weights associated to the units in |
This function calculates Pearson's skewness coefficient, the MedCouple measure of skewness and the non-parametric Bowley's measure of symmetry. The Bowley's skewness measure uses quartiles:
It ranges between -1 and +1, where positive (negative) values denote right (left) skewness. A value equal to 0 indicates symmetry. A crude measure of skewness can be obtained with a monotonic increasing function of b:
It ranges from 0 to Inf, g=1 indicates symmetry.
A measure of skewness similar to the Bowley's one is achieved by replacing Q3 and Q1 with respectively P90 and P10 percentiles:
Similarly
For major details see Kotz at al. (2006, vol. 12, pp. 7771-7772).
The medCouple measure of skewness, M, ranges from -1 to +1 and is equal to 0 in case of symmetry, while indicates positive skewness. For major details see
mc
.
Note that eventual weights, passed through the argument weights
, are used ONLY in the calculation of the Bowley's type measures.
A vector with the estimated measures of skewness.
Marcello D'Orazio [email protected]
Kotz S. et al. (2006) Encyclopedia of Statistical Sciences, Volume 12. John Wiley and Sons.
set.seed(112233) y <- rnorm(n = 30, mean = 50, sd = 10) y[20] <- 100 skew.misc(x = y, weights=NULL) # use weights ww <- runif(n = 30, min = 1, max = 10) skew.misc(x = y, weights=ww)
set.seed(112233) y <- rnorm(n = 30, mean = 50, sd = 10) y[20] <- 100 skew.misc(x = y, weights=NULL) # use weights ww <- runif(n = 30, min = 1, max = 10) skew.misc(x = y, weights=ww)