Title: | Leland Wilkinson's Algorithm for Detecting Multidimensional Outliers |
---|---|
Description: | An implementation of an algorithm for outlier detection that can handle a) data with a mixed categorical and continuous variables, b) many columns of data, c) many rows of data, d) outliers that mask other outliers, and e) both unidimensional and multidimensional datasets. Unlike ad hoc methods found in many machine learning papers, HDoutliers is based on a distributional model that uses probabilities to determine outliers. |
Authors: | Chris Fraley [aut, cre], Leland Wilkinson [ctb] |
Maintainer: | Chris Fraley <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.4 |
Built: | 2024-11-11 07:21:10 UTC |
Source: | CRAN |
Transforms the data according to the specifications in Wilkinson's hdoutliers algorithm.
dataTrans(data)
dataTrans(data)
data |
A vector, matrix, or data frame consisting of numeric and/or categorical variables. |
Replaces each categorical variables with a numeric variable corresponding
to its first component in multiple correspondence analysis, then maps the
data to the unit square. There is no porvision for handling missing data.
Functions HDoutliers
and getHDoutliers
apply this
transformation to their input data.
The transformed data, according to Wilkinson's specifications for the hdoutliers algorithm.
Wilkinson, L. (2016). Visualizing Outliers.
require(FactoMineR) data(tea) head(tea) dataTrans(tea[,-1])
require(FactoMineR) data(tea) head(tea) dataTrans(tea[,-1])
A matrix whose columns are the Z
and W
dots
datasets from
Wilkinson (2016).
data(dots)
data(dots)
L. Wilkinson. 2016. Vizualizing Outliers. <https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf>.
A dataset with 510 rows and 2 columns comprised of 500 normally-distributed samples and 10 uniformly distributed outliers.
data(ex2D)
data(ex2D)
Implements the first stage of the hdoutliers Algorithm, in which the data is partitioned according to exemplars and their associated lists of members.
getHDmembers(data, maxrows = 10000, radius = NULL)
getHDmembers(data, maxrows = 10000, radius = NULL)
data |
A vector, matrix, or data frame consisting of numeric and/or categorical variables. |
maxrows |
If the number of observations is greater than |
radius |
Threshold for determining membership in the exemplars's lists
(used only when the number of observations is greater than |
If the number of observations exceeds maxrows
, the data is
partitioned into lists corresponding to exemplars
and their members within radius
of each exemplar,
to reduce the number of nearest-neighbor computations required for
outlier detection.
When there are fewer observations, the result is a list whose elements are
the individual observations (each observation is an exemplar, with no
other members).
A list in which each component is a vector of observation indexes.
The first index in each list is the index of the exemplar
defining that list, and any remaining indexes are the
associated members, within radius
of the exemplar.
Wilkinson, L. (2016). Visualizing Outliers. <https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf>.
data(dots) mem.W <- getHDmembers(dots$W) out.W <- getHDoutliers(dots$W,mem.W) data(ex2D) mem.ex2D <- getHDmembers(ex2D) out.ex2D <- getHDoutliers(ex2D,mem.ex2D) ## Not run: n <- 100000 # number of observations set.seed(3) x <- matrix(rnorm(2*n),n,2) nout <- 10 # number of outliers x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1) mem.x <- getHDmembers(x) out.x <- getHDoutliers(x,mem.x) ## End(Not run)
data(dots) mem.W <- getHDmembers(dots$W) out.W <- getHDoutliers(dots$W,mem.W) data(ex2D) mem.ex2D <- getHDmembers(ex2D) out.ex2D <- getHDoutliers(ex2D,mem.ex2D) ## Not run: n <- 100000 # number of observations set.seed(3) x <- matrix(rnorm(2*n),n,2) nout <- 10 # number of outliers x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1) mem.x <- getHDmembers(x) out.x <- getHDoutliers(x,mem.x) ## End(Not run)
Detects outliers based on a probability model.
getHDoutliers(data, memberLists, alpha = 0.05, transform = TRUE)
getHDoutliers(data, memberLists, alpha = 0.05, transform = TRUE)
data |
A vector, matrix, or data frame consisting of numeric and/or categorical variables. |
memberLists |
A list following the structure of the output to |
alpha |
Threshold for determining the cutoff for outliers.
Observations are considered outliers
outliers if they fall in the |
transform |
A logical variable indicating whether or not the data needs to be
transformed to conform to Wilkinson's specifications before outlier
detection. The default is to transform the data using function
|
An exponential distribution is fitted to the upper tail of the
nearest-neighbor distances between exemplars (the observations
considered representatives of each component of memberLists
).
Observations are considered
outliers if they fall in the tail of the fitted CDF.
The indexes of the observations determined to be outliers.
Wilkinson, L. (2016). Visualizing Outliers. <https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf>.
A call to getHDoutliers
in which membersLists
result from
a call to getHDmembers
is equivalent to calling HDoutliers
.
HDoutliers
,
getHDmembers
,
dataTrans
data(dots) mem.W <- getHDmembers(dots$W) out.W <- getHDoutliers(dots$W,mem.W) ## Not run: plotHDoutliers( dots.W, out.W) ## End(Not run) data(ex2D) mem.ex2D <- getHDmembers(ex2D) out.ex2D <- getHDoutliers( ex2D, mem.ex2D) ## Not run: plotHDoutliers( ex2D, out.ex2D) ## End(Not run) ## Not run: n <- 100000 # number of observations set.seed(3) x <- matrix(rnorm(2*n),n,2) nout <- 10 # number of outliers x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1) mem.x <- getHDmembers(x) out.x <- getHDoutliers(x) ## End(Not run)
data(dots) mem.W <- getHDmembers(dots$W) out.W <- getHDoutliers(dots$W,mem.W) ## Not run: plotHDoutliers( dots.W, out.W) ## End(Not run) data(ex2D) mem.ex2D <- getHDmembers(ex2D) out.ex2D <- getHDoutliers( ex2D, mem.ex2D) ## Not run: plotHDoutliers( ex2D, out.ex2D) ## End(Not run) ## Not run: n <- 100000 # number of observations set.seed(3) x <- matrix(rnorm(2*n),n,2) nout <- 10 # number of outliers x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1) mem.x <- getHDmembers(x) out.x <- getHDoutliers(x) ## End(Not run)
Detects outliers based on a probability model.
HDoutliers(data, maxrows=10000, radius=NULL, alpha=0.05, transform=TRUE)
HDoutliers(data, maxrows=10000, radius=NULL, alpha=0.05, transform=TRUE)
data |
A vector, matrix, or data frame consisting of numeric and/or categorical variables. |
maxrows |
If the number of observations is greater than |
radius |
Threshold for determining membership in the exemplars's lists
(used only when the number of observations is greater than |
alpha |
Threshold for determining the cutoff for outliers.
Observations are considered outliers
outliers if they fall in the |
transform |
A logical variable indicating whether or not the data needs to be
transformed to conform to Wilkinson's specifications before outlier
detection. The default is to transform the data using function
|
Wilkinson replaces categorical variables with the leading component from
correspondence analysis, and maps the data to the unit square. This is
done as a preprocessing step if transform = TRUE
(the default).
If the number of observations exceeds maxrows
,
the data is first partitioned into lists associated with exemplars
and their members within radius
of each exemplar,
to reduce the number of nearest-neighbor computations required for
outlier detection.
An exponential distribution is then fitted to the upper tail of the
nearest-neighbor distances between exemplars.
Observations are considered
outliers if they fall in the tail of the fitted CDF.
The indexes of the observations determined to be outliers.
Wilkinson, L. (2016). Visualizing Outliers.
getHDmembers
,
getHDoutliers
,
dataTrans
data(dots) out.W <- HDoutliers(dots$W) ## Not run: plotHDoutliers(dots$W,out.W) ## End(Not run) data(ex2D) out.ex2D <- HDoutliers(ex2D) ## Not run: plotHDoutliers(ex2D,out.ex2D) ## End(Not run) ## Not run: n <- 100000 # number of observations set.seed(3) x <- matrix(rnorm(2*n),n,2) nout <- 10 # number of outliers x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1) out.x <- HDoutliers(x) ## End(Not run)
data(dots) out.W <- HDoutliers(dots$W) ## Not run: plotHDoutliers(dots$W,out.W) ## End(Not run) data(ex2D) out.ex2D <- HDoutliers(ex2D) ## Not run: plotHDoutliers(ex2D,out.ex2D) ## End(Not run) ## Not run: n <- 100000 # number of observations set.seed(3) x <- matrix(rnorm(2*n),n,2) nout <- 10 # number of outliers x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1) out.x <- HDoutliers(x) ## End(Not run)
Plotting function showing observations determined to be outliers.
plotHDoutliers(data, indexes = NULL, transform = TRUE, ...)
plotHDoutliers(data, indexes = NULL, transform = TRUE, ...)
data |
A vector, matrix, or data frame consisting of numeric and/or categorical variables. |
indexes |
The (row) indexes of the outliers in |
transform |
A logical variable indicating whether or not the data needs to be
transformed to conform to Wilkinson's specifications before outlier
detection. The default is to transform the data using function
|
... |
Additional plotting arguments. |
Produces a plot of the data (transformed according to the Wilkinson's specifications) showing the outliers. If the data has more than two dimensions, it is plotted onto the principal components of the data that remains after removing outliers.
The indexes of the observations determined to be outliers.
Wilkinson, L. (2016). Visualizing Outliers.
data(dots) out.W <- HDoutliers(dots$W) ## Not run: plotHDoutliers(dots$W,out.W) ## End(Not run) data(ex2D) out.ex2D <- HDoutliers(ex2D) ## Not run: plotHDoutliers(ex2D,out.ex2D) ## End(Not run)
data(dots) out.W <- HDoutliers(dots$W) ## Not run: plotHDoutliers(dots$W,out.W) ## End(Not run) data(ex2D) out.ex2D <- HDoutliers(ex2D) ## Not run: plotHDoutliers(ex2D,out.ex2D) ## End(Not run)