Title: | Unsupervised Multivariate Outlier Probabilities for Large Datasets |
---|---|
Description: | Estimates unsupervised outlier probabilities for multivariate numeric data with many observations from a nonparametric outlier statistic. |
Authors: | Chris Fraley [aut, cre] |
Maintainer: | Chris Fraley <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.2 |
Built: | 2024-12-04 07:25:25 UTC |
Source: | CRAN |
Outlier probabilities for all of the data, obtained by assigning to each observation the probabilty of the its associated leader partition.
allProb( leaderInstance, partprob)
allProb( leaderInstance, partprob)
leaderInstance |
A single component from a call to |
partprob |
A vector of probabilities for each partition in |
A vector of probabilities for each observation in the data underlying
leaderInstance
. Each observation inherits the probability of its
associated partition.
set.seed(0) lead <- leader(faithful) nlead <- length(lead[[1]]$partitions) # repeat multiple times to account for randomness ntimes <- 100 probs <- matrix( NA, nlead, ntimes) for (i in 1:ntimes) { probs[,i] <- partProb( simData(lead[[1]]), method = "distance") } # median probability for each partition partprobs <- apply( probs, 1, median) quantile(partprobs) # plot leaders with outlier probability > .95 plot( faithful[,1], faithful[,2], pch = 16, cex = .5, main = "red : instances with outlier probability > .95") allprobs <- allProb( lead[[1]], partprobs) out <- allprobs > .95 points( faithful[out,1], faithful[out,2], pch = 8, cex = 1, col = "red")
set.seed(0) lead <- leader(faithful) nlead <- length(lead[[1]]$partitions) # repeat multiple times to account for randomness ntimes <- 100 probs <- matrix( NA, nlead, ntimes) for (i in 1:ntimes) { probs[,i] <- partProb( simData(lead[[1]]), method = "distance") } # median probability for each partition partprobs <- apply( probs, 1, median) quantile(partprobs) # plot leaders with outlier probability > .95 plot( faithful[,1], faithful[,2], pch = 16, cex = .5, main = "red : instances with outlier probability > .95") allprobs <- allProb( lead[[1]], partprobs) out <- allprobs > .95 points( faithful[out,1], faithful[out,2], pch = 8, cex = 1, col = "red")
Partitions the data according to Hartigan's leader algorithm, and provides ranges, centroids, and variances for the partitions.
leader(data, radius = NULL, scale = T)
leader(data, radius = NULL, scale = T)
data |
A numeric vector or matrix of observations. If a matrix, rows correspond to observations and columns correspond to variables. |
radius |
A vector of values for the partitioning radius. Wilkinson's default
radius is used if |
scale |
A logical variable indicating whether or not the data should be mapped
to the unit hypercube. The default is to scale the data. Values of the
radius will not be scaled; they should be specifed relative to the unit
hypercube unless |
Given a partitioning radius r
, the leader algorithm makes one pass
through the data, designating an observation as a new leader if it is not
within r
of an existing leader, and otherwise assigning it to the
partition associated with the nearest existing leader. The set of leaders
typically depends on the order of the data observations.
If radius = 0
, then all of the data observations are leaders, and
only radius
and leaders
are returned as output components.
This implementation does a completely new nearest-neighbor search for
each observation and for each radius. A more efficient approach would be to
maintain, for each radius, a data structure (such as a kd-tree) allowing
fast nearest-neighbor search. These data structures could then be updated
to account for new observations. Currently, there doesn't seem to be a way
to do this in R.
A list with one component for each value of radius
, each having the
following sub-components:
radius |
The value of the radius associated with the partitioning. |
partitions |
A list with one component for each partition, giving the indexes (as observations in the data) of the members of the partition. The first index is that of the associated leader (sometimes called exemplar). |
leaders |
The indexes of the leaders for each partition. |
centroids |
The centroids for each partition, as a matrix with rows corresponding to
the partitions and columns corresponding to variables if multidimensional.
These will be the data if |
variances |
The variances for each partition, as a matrix with rows corresponding to the partitions and columns corresponding to variables if multidimensional. |
ranges |
A list with two components: |
maxdist |
A vector with one value for each partition, giving the largest distance from each leader to any member of its partition. |
J. A. Hartigan, Clustering Algorithms, Wiley, 1975.
L. Wilkinson, Visualizing Outliers, Technical Report, University of
Illinois at Chicago, 2016.
https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf
radius.default <- LWradius(nrow(faithful),ncol(faithful)) lead <- leader(faithful, radius = c(0,radius.default)) # number of partitions for each radius sapply(lead, function(x) length(x$partitions)) # plot the leaders for the non-zero radius plot( faithful[,1], faithful[,2], main = "blue indicates leaders (default radius)", pch = 16, cex = .5) ldrs <- lead[[2]]$leaders points( faithful[ldrs,1], faithful[ldrs,2], pch = 8, col = "dodgerblue", cex = .5)
radius.default <- LWradius(nrow(faithful),ncol(faithful)) lead <- leader(faithful, radius = c(0,radius.default)) # number of partitions for each radius sapply(lead, function(x) length(x$partitions)) # plot the leaders for the non-zero radius plot( faithful[,1], faithful[,2], main = "blue indicates leaders (default radius)", pch = 16, cex = .5) ldrs <- lead[[2]]$leaders points( faithful[ldrs,1], faithful[ldrs,2], pch = 8, col = "dodgerblue", cex = .5)
Computes the log density for observations in a univariate or multivariate Gaussian mixture model with spherical or diagonal (co)variance that varies across components.
logdens( x, simData, shrink = 1)
logdens( x, simData, shrink = 1)
x |
A numeric vector or matrix for which the log density is to be computed. |
simData |
Observations from a call to |
shrink |
Shrinkage parameter for the mixture model variance. To be
consistent with the shrinkage as described in |
If either radius = 0
, or simData
returns only centroids
(nsim = 0
), then no density estimate is attempted.
A vector giving the log density of x
in the model as
specified by simData
, with optional shrinkage applied to the
variance.
G. Celeux and G. Govaert, Gaussian Parsimonious Mixture Models, Pattern Recognition, 1995.
G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley, 2000.
C. Fraley and A. E. Raftery, Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 2002.
lead <- leader(faithful) sim <- simData( lead) logdens( faithful, sim)
lead <- leader(faithful) sim <- simData( lead) logdens( faithful, sim)
Wilkinson's default leader-partitioning radius.
LWradius( n, p)
LWradius( n, p)
n |
The number of observations (rows) in the data. |
p |
The number of variables (columns) in the data; |
Wilkinson's default leader partitioning radius 0.1/(log(n)^(1/p))
.
L. Wilkinson (2016), Visualizing Outliers, Technical Report, University of
Illinois at Chicago,
https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf
.
x1 <- rnorm(10000) LWradius(length(x1),1) LWradius(nrow(faithful),ncol(faithful))
x1 <- rnorm(10000) LWradius(length(x1),1) LWradius(nrow(faithful),ncol(faithful))
Robust nonparametric outlier statistic for univariate or multivariate data.
OutlierStatistic( x, nproj=1000, prior=NULL, seed=NULL)
OutlierStatistic( x, nproj=1000, prior=NULL, seed=NULL)
x |
A numeric vector or matrix for which the outlier statistic is to be determined. |
nproj |
If |
prior |
If |
seed |
An optional integer argument to |
A vector giving the maximum value of the outlier statistic for each observation over all projections.
W. A. Stahel, Breakdown of Covariance Estimators, doctoral thesis, Fachgruppe Fur Statistik, Eidgenossische Technische Hochshule (ETH), 1981.
D. L. Donoho, Breakdown Properties of Multivariate Location Estimators, doctoral thesis, Department of Statistics, Harvard University, 1982.
Note that partition probabilities are computed from an exponential distribution fit to the outlier statistic, rather than from the empirical distribution of the outlier statistic.
stat <- OutlierStatistic(faithful) q.99 <- quantile(stat,.99) out <- stat > q.99 plot( faithful[,1], faithful[,2], main="red : .99 quantile for outlier statistic", cex=.5) points( faithful[out,1], faithful[out,2], pch = 4, col = "red", lwd = 1, cex = .5) require(mvtnorm) set.seed(0) Sigma <- crossprod(matrix(rnorm(2*2),2,2)) x <- rmvt( 10000, sigma = Sigma, df = 2) stat <- OutlierStatistic(x) q.95 <- quantile(stat,.95) hist(x, main = "gray : .95 quantile for outlier statistic", col = "black") abline( v = x[stat > q.95], col = "gray") hist(x, col = "black", add = TRUE)
stat <- OutlierStatistic(faithful) q.99 <- quantile(stat,.99) out <- stat > q.99 plot( faithful[,1], faithful[,2], main="red : .99 quantile for outlier statistic", cex=.5) points( faithful[out,1], faithful[out,2], pch = 4, col = "red", lwd = 1, cex = .5) require(mvtnorm) set.seed(0) Sigma <- crossprod(matrix(rnorm(2*2),2,2)) x <- rmvt( 10000, sigma = Sigma, df = 2) stat <- OutlierStatistic(x) q.95 <- quantile(stat,.95) hist(x, main = "gray : .95 quantile for outlier statistic", col = "black") abline( v = x[stat > q.95], col = "gray") hist(x, col = "black", add = TRUE)
Assigns outlier probabilities to the partitions by fitting an exponential distribution to a nonparametric outlier statistic for simulated data or partition centroids.
partProb( simData, method = c("intrinsic","distance","logdensity","distdens", "density"), shrink = 1, nproj = 1000, seed = NULL)
partProb( simData, method = c("intrinsic","distance","logdensity","distdens", "density"), shrink = 1, nproj = 1000, seed = NULL)
simData |
Observations from a call to |
|||||||||||||||
method |
One of the following options:
The default is to use the |
|||||||||||||||
shrink |
Shrinkage parameter for outlier detection data. The offsets from
|
|||||||||||||||
nproj |
If the data is multivariate or |
|||||||||||||||
seed |
An optional integer argument to |
"logdensity"
is generally prefered over "density"
, because
negative values that are large in magniude
of the logarithm of the density will not be
numerically distinguishable as density values.
A vector of probabilities for each partition, obtained by fitting an exponential distribution to the outlier statistic.
C. Fraley, Estimating Outlier Probabilities for Large Datasets, 2017.
simData
,
OutlierStatistic
,
allProb
set.seed(0) lead <- leader(faithful) nlead <- length(lead[[1]]$partitions) # repeat multiple times to account for randomness ntimes <- 100 probs <- matrix( NA, nlead, ntimes) for (i in 1:ntimes) { probs[,i] <- partProb( simData(lead[[1]]), method = "distance") } # median probability for each partition partprobs <- apply( probs, 1, median) quantile(probs) # plot leaders with outlier probability > .95 plot( faithful[,1], faithful[,2], pch = 16, cex = .5, main = "red : leaders with outlier probability > .95") out <- partprobs > .95 l <- lead[[1]]$leaders points( faithful[l[out],1], faithful[l[out],2], pch = 8, cex = 1, col = "red")
set.seed(0) lead <- leader(faithful) nlead <- length(lead[[1]]$partitions) # repeat multiple times to account for randomness ntimes <- 100 probs <- matrix( NA, nlead, ntimes) for (i in 1:ntimes) { probs[,i] <- partProb( simData(lead[[1]]), method = "distance") } # median probability for each partition partprobs <- apply( probs, 1, median) quantile(probs) # plot leaders with outlier probability > .95 plot( faithful[,1], faithful[,2], pch = 16, cex = .5, main = "red : leaders with outlier probability > .95") out <- partprobs > .95 l <- lead[[1]]$leaders points( faithful[l[out],1], faithful[l[out],2], pch = 8, cex = 1, col = "red")
Simulates observations from a mixture model based on information on
partitions from the leader
function.
simData( leaderInstance, nsim=NULL, model=c("diagonal","spherical"), seed=NULL)
simData( leaderInstance, nsim=NULL, model=c("diagonal","spherical"), seed=NULL)
leaderInstance |
A single component from a call to |
nsim |
The number of observations to be simulated. Only the radius and centroids
are returned of |
model |
For multivariate data, a vector of character strings indicating the type of
Gaussian mixture model covariance to be used in generating the simulated
observations (see |
seed |
An optional integer argument to |
The following models are available for multivariate data:
"spherical" |
: | spherical, varying volume |
"diagonal" |
: | diagonal, varying volume and shape |
An ellipsoidal model is also possible, but has not yet been implemented.
If nsim = 0
or leaderInstance$radius == 0
, no observations are
simulated, and only the radius and partition centroids are returned.
A list with the following components:
radius |
The value of the radius associated with |
location |
The vector or matrix of centroids of the partitions. If a matrix, rows correspond to the partitions and columns to the variables. |
index |
A vector of integer values giving the index of the partition associated with each simulated observation. |
offset |
A vector of numeric values giving offset for the simulated observations from their associated centroids. |
weight |
A vector of numeric values between 0 and 1 giving the proportion of data observations in each partition. |
scale |
The scale (variance) of the mixture components in a univariate or spherical model. Set to 1 for each component in the diagonal model. |
shape |
A matrix giving the variances of the mixture component in a diagonal model. The rows correspond to the dimensions of the data, while the columns correspond to the mixture components (partitions). |
C. Fraley, Estimating Outlier Probabilities for Large Datasets, 2017.
radius.default <- LWradius(nrow(faithful),ncol(faithful)) lead <- leader(faithful, radius = c(0,radius.default)) # (simulated) data for outlier statistic (no simulation for radius = 0) sim <- lapply( lead, simData) # components of simData output lapply( sim, names)
radius.default <- LWradius(nrow(faithful),ncol(faithful)) lead <- leader(faithful, radius = c(0,radius.default)) # (simulated) data for outlier statistic (no simulation for radius = 0) sim <- lapply( lead, simData) # components of simData output lapply( sim, names)