Title: | Depth Measures in Multivariate, Regression and Functional Settings |
---|---|
Description: | Tools to compute depth measures and implementations of related tasks such as outlier detection, data exploration and classification of multivariate, regression and functional data. |
Authors: | Pieter Segaert [aut], Mia Hubert [aut], Peter Rousseeuw [aut], Jakob Raymaekers [aut, cre], Kaveh Vakili [ctb] |
Maintainer: | Jakob Raymaekers <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.17 |
Built: | 2024-11-01 06:46:50 UTC |
Source: | CRAN |
Computes the skew-adjusted outlyingness of -dimensional points
z
relative to a -dimensional dataset
x
. For each multivariate point , its adjusted outlyingness relative to
x
is defined as its maximal univariate adjusted outlyingness measured over all directions. To obtain the univariate adjusted outlyingness in the direction , the dataset
x
is projected on , and the robustly skew-adjusted standardized distance of
to the median of the projected data points
x
is computed.
adjOutl(x, z = NULL, options = list())
adjOutl(x, z = NULL, options = list())
x |
An |
z |
An optional |
options |
A list of available options:
|
The adjusted outlyingness (AO) of multivariate data was introduced in Brys et al. (2005) and studied in more detail in Hubert and Van der Veeken (2008). It extends the Stahel-Donoho outlyingness towards skewed distributions.
Depending on the dimension , different approximate algorithms are implemented. The affine invariant algorithm can only be used when
. It draws
ndir
times at random observations from
x
and considers the direction orthogonal to the hyperplane spanned by these observations. At most
out of
directions can be considered. The orthogonal invariant version can be applied to high-dimensional data. It draws
ndir
times at random observations from
x
and considers the direction through these two observations. Here, at most 2 out of directions can be considered. Finally, the shift invariant version randomly draws
ndir
vectors from the unit sphere.
The resulting AO values are invariant to affine transformations, rotations and shifts respectively provided that the seed
is kept fixed at different runs of the algorithm. Note that the AO values are guaranteed to increase when more directions are considered provided the seed is kept fixed, as this ensures that the random directions are generated in a fixed order.
An observation from x
and z
is flagged as an outlier if its AO exceeds a cutoff value. This cutoff value is determined using the procedure in Rousseeuw et al. (2018). First, the logarithm of the AO values is taken to render their distribution more symmetric, after which a normal approximation yields a cutoff on these values. The cutoff is then transformed back by applying the exponential function.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it. Furthermore, the univariate adjusted outlyingness of the projected points
x
is ill-defined when the scale in its denominator becomes zero. This can happen when many observations collapse. In these cases the algorithm will stop and give a warning. The returned values then include the direction
as well as an indicator specifying which of the observations of
x
belong to the hyperplane orthogonal to .
This function extends the adjOutlyingness
function in the package robustbase
. It has more options for choosing the directions, it allows to compute the adjusted outlyingness of points not belonging to the data matrix x
and it is faster as it is fully implemented in C++. On the other hand, the constants (3 and -4) used in the definition of the adjusted outlyingness can not be modified in this implementation.
A list with components:
outlyingnessX |
Vector of length |
outlyingnessZ |
Vector of length |
cutoff |
Points whose adjusted outlyingness exceeds this cutoff can be considered as outliers with respect to |
flagX |
Observations of |
flagZ |
Points of |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
inSubspace |
When a direction |
P. Segaert using C++
code by K. Vakili, P. Segaert, G. Brys and M. Maechler.
Brys G., Hubert M., Rousseeuw P.J. (2005). A robustification of Independent Component Analysis. Journal of Chemometrics, 19, 364–375.
Hubert M., Van der Veeken S. (2008). Outlier detection for skewed data. Journal of Chemometrics, 22, 235–246.
Hubert M., Vandervieren E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 52, 5186–5201.
Rousseeuw P.J., Raymaekers J., Hubert M., (2018). A measure of directional outlyingness with applications to image data and video. Journal of Computational and Graphical Statistics, 27, 345–359.
sprojdepth
, sprojmedian
, dirOutl
, outlyingness
adjbox
, adjOutlyingness
from package robustbase.
# Compute the adjusted outlyingness of a simple # two-dimensional dataset. Outliers are plotted # in red. data(geological) BivData <- geological[c("MnO","MgO")] Result <- adjOutl(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # The number of directions may be specified through # the option list. The resulting adjusted outlyingness # is monotone increasing in the number of directions. Result1 <- adjOutl(x = BivData, options = list(ndir = 50) ) Result2 <- adjOutl(x = BivData, options = list(ndir = 100) ) which(Result2$outlyingnessX - Result1$outlyingnessX < 0) # This is however not the case when the seed is changed Result1 <- adjOutl(x = BivData, options = list(ndir = 50) ) Result2 <- adjOutl(x = BivData, options = list(ndir = 100, seed = 950) ) plot(Result2$outlyingnessX - Result1$outlyingnessX, xlab = "Index", ylab = "Difference in AO") # We can also consider directions through two data # points. If the sample is small enough one may opt # to search over all choose(n,2) directions. # Note that the computational load increases dramatically # as n becomes larger. data(bloodfat) BivData <- bloodfat[1:100,] # Consider a small toy example. Result <- adjOutl(x = BivData, options = list(type = "Rotation", ndir = "all") ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # Alternatively one may consider randomly generated directions. data(bloodfat) Result <- adjOutl(x = bloodfat, options = list(type = "Shift", ndir = 1000) ) IndOutliers <- which(!Result$flagX) plot(bloodfat) points(bloodfat[IndOutliers,], col = "red")
# Compute the adjusted outlyingness of a simple # two-dimensional dataset. Outliers are plotted # in red. data(geological) BivData <- geological[c("MnO","MgO")] Result <- adjOutl(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # The number of directions may be specified through # the option list. The resulting adjusted outlyingness # is monotone increasing in the number of directions. Result1 <- adjOutl(x = BivData, options = list(ndir = 50) ) Result2 <- adjOutl(x = BivData, options = list(ndir = 100) ) which(Result2$outlyingnessX - Result1$outlyingnessX < 0) # This is however not the case when the seed is changed Result1 <- adjOutl(x = BivData, options = list(ndir = 50) ) Result2 <- adjOutl(x = BivData, options = list(ndir = 100, seed = 950) ) plot(Result2$outlyingnessX - Result1$outlyingnessX, xlab = "Index", ylab = "Difference in AO") # We can also consider directions through two data # points. If the sample is small enough one may opt # to search over all choose(n,2) directions. # Note that the computational load increases dramatically # as n becomes larger. data(bloodfat) BivData <- bloodfat[1:100,] # Consider a small toy example. Result <- adjOutl(x = BivData, options = list(type = "Rotation", ndir = "all") ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # Alternatively one may consider randomly generated directions. data(bloodfat) Result <- adjOutl(x = bloodfat, options = list(type = "Shift", ndir = 1000) ) IndOutliers <- which(!Result$flagX) plot(bloodfat) points(bloodfat[IndOutliers,], col = "red")
Computes the bagdistance of -dimensional points
z
relative to a -dimensional dataset
x
. To compute the bagdistance of a point first the bag of
x
is computed as the depth region containing the 50% observations (of x
) with largest halfspace depth. Next, the ray from the halfspace median through
is considered and
is defined as the intersection of this ray and the boundary of the bag. The bagdistance of
to
x
is then given by the ratio between the Euclidean distance of to the halfspace median and the Euclidean distance of
to the halfspace median.
bagdistance(x, z = NULL, options = list())
bagdistance(x, z = NULL, options = list())
x |
An |
z |
An optional |
options |
A list of available options:
|
The bagdistance has been introduced in Hubert et al. (2015) and studied in Hubert et al. (2017). It does not assume symmetry and is affine invariant. Note that when the halfspace is not computed in an affine invariant way, the bagdistance cannot be affine invariant either.
The function first computes the halfspace depth and the halfspace median of x
. Additional options may be passed to the hdepth
routine by specifying them in the option
list argument.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.
Depending on the dimensions different algorithms are used. For the bagdistance is computed exactly. For
the default setting (
options$approx=TRUE
) uses an approximated algorithm. Exact computation, based on the exact algoritm to compute the contours of the bag (see the depthContour
function), is obtained by setting options$approx
to FALSE. Note that this may lead to an increase in computation time.
For the approximated algorithm, the intersection point is approximated by searching on each ray the point whose depth is equal to the median of the depth values of
x
. As the halfspace depth is monotone decreasing along the ray, a bisection algorithm is used. Starting limits are obtained by projecting the data on the direction and considering the data point with univariate depth corresponding to the median of the halfspace depths of x
. By definition the multivariate depth of this point has to be lower or equal than its univariate depth. A second limit is obtained by considering the deepest location estimate. The maximum number of iterations bisecting the current search interval can be specified through the options argument max.iter
.
An observation from z
is flagged as an outlier if its bagdistance exceeds a cutoff value. This cutoff is equal to the squareroot of the 0.99 quantile of the chi-squared distribution with degrees of freedom.
A list with components:
bagdistance |
The bagdistance of the points of |
cutoff |
Points of |
flag |
Points of |
converged |
Vector of length |
dimension |
When the data |
hyperplane |
When the data |
P. Segaert.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24, 177–202.
Hubert M., Rousseeuw P.J., Segaert P. (2017). Multivariate and functional classification using depth and distance. Advances in Data Analysis and Classification, 11, 445–466.
# Generate some bivariate data set.seed(5) nObs <- 500 XS <- matrix(rnorm(nObs * 2), nrow = nObs, ncol = 2) A <- matrix(c(1,1,.5,.1), ncol = 2, nrow = 2) X <- XS %*% A # In two dimensions we may either use the approximate # or the exact algorithm to compute the bag. respons.exact <- bagdistance(x = X, options = list(approx = FALSE)) respons.approx <- bagdistance(x = X, options = list(approx = TRUE)) # Both algorithms yield fairly similar results. plot(respons.exact$bagdistance, respons.approx$bagdistance) abline(a = 0, b = 1) # In Hubert et al. (2015) it was shown that for elliptical # distributions the squared bagdistance relates to the # squared Mahalanobis distances. This may be easily illustrated. mahDist <- mahalanobis(x = X, colMeans(X), cov(X)) plot(respons.exact$bagdistance^2, mahDist) # Computation of the bagdistance relies on the computation # of halfspace depth using the hdepth function. Options for # the hdepth routine can be passed down using the options # arguments. Note that the bagdistance is only affine invariant # if the halfspace depth is computed in an affine invariant way. options <-list(type = "Rotation", ndir = 375, approx = TRUE, seed = 78341) respons.approx.rot <- bagdistance(x = X, options = options) plot(respons.exact$bagdistance, respons.approx.rot$bagdistance) abline(a = 0, b = 1)
# Generate some bivariate data set.seed(5) nObs <- 500 XS <- matrix(rnorm(nObs * 2), nrow = nObs, ncol = 2) A <- matrix(c(1,1,.5,.1), ncol = 2, nrow = 2) X <- XS %*% A # In two dimensions we may either use the approximate # or the exact algorithm to compute the bag. respons.exact <- bagdistance(x = X, options = list(approx = FALSE)) respons.approx <- bagdistance(x = X, options = list(approx = TRUE)) # Both algorithms yield fairly similar results. plot(respons.exact$bagdistance, respons.approx$bagdistance) abline(a = 0, b = 1) # In Hubert et al. (2015) it was shown that for elliptical # distributions the squared bagdistance relates to the # squared Mahalanobis distances. This may be easily illustrated. mahDist <- mahalanobis(x = X, colMeans(X), cov(X)) plot(respons.exact$bagdistance^2, mahDist) # Computation of the bagdistance relies on the computation # of halfspace depth using the hdepth function. Options for # the hdepth routine can be passed down using the options # arguments. Note that the bagdistance is only affine invariant # if the halfspace depth is computed in an affine invariant way. options <-list(type = "Rotation", ndir = 375, approx = TRUE, seed = 78341) respons.approx.rot <- bagdistance(x = X, options = options) plot(respons.exact$bagdistance, respons.approx.rot$bagdistance) abline(a = 0, b = 1)
This function draws a bagplot of bivariate data, based on the result of a call to compBagplot
. The bagplot is a generalisation of the univariate boxplot to bivariate data. It aims to visualize the location, spread, skewness and outliers of the data set.
bagplot(compbag.result, colorbag = NULL, colorloop = NULL, colorchull = NULL, databag = TRUE, dataloop = TRUE, plot.fence = FALSE)
bagplot(compbag.result, colorbag = NULL, colorloop = NULL, colorchull = NULL, databag = TRUE, dataloop = TRUE, plot.fence = FALSE)
compbag.result |
The return of a call to |
colorbag |
The color of the bag (which contains the 50% observations with largest depth). |
colorloop |
The color of the loop (which contains the regular observations). |
colorchull |
When the bagplot is based on halfspace depth, the depth region with maximal depth is plotted. This argument controls its color. |
databag |
Logical indicating whether data points inside the bag need to be plotted. |
dataloop |
Logical indicating whether data points inside the fence need to be plotted. |
plot.fence |
Logical indicating whether the fence should be plotted. |
The bagplot has been proposed by Rousseeuw et al. (1999) as a generalisation of the boxplot to bivariate data. It is constructed based on halfspace depth and as such is invariant under affine transformations. Similar graphical representations can be obtained by means of other depth functions, as illustrated in Hubert and Van der Veeken (2008) and in Hubert et al. (2015). See compBagplot
for more details.
The deepest point is indicated by a red diamond symbol, the outlying observations by red stars.
The plot is made using ggplot2
. The plot itself is returned by the function and is fully customisable using standard ggplot2
commands.
P. Segaert
Rousseeuw P.J., Ruts I., Tukey J.W. (1999). The bagplot: a bivariate boxplot. The American Statistician, 53, 382–387.
Hubert M., Van der Veeken S. (2008). Outlier detection for skewed data. Journal of Chemometrics, 22, 235–246.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Rejoinder of 'Multivariate functional outlier detection'. Statistical Methods & Applications, 24, 269–277.
compBagplot
, hdepth
, projdepth
, sprojdepth
,
dprojdepth
.
data(bloodfat) # The bagplot can be plotted based on halfspace depth, projection depth, # skewness-adjusted projection depth or directional projection depth. # Note that projection depth is not appropiate for skewed data. # bagplot(compBagplot(bloodfat)) bagplot(compBagplot(bloodfat, type = "projdepth")) bagplot(compBagplot(bloodfat, type = "dprojdepth")) # The main features of the bagplot can easily be adjusted. result <- compBagplot(bloodfat, type = "projdepth") bagplot(result, databag = FALSE, dataloop = FALSE) bagplot(result, colorbag = rgb(0.2, 0.2, 0.2), colorloop = "lightgreen") data(cardata90) result <- compBagplot(cardata90, type = "projdepth") bagplot(result) # Compared to the original paper on the bagplot, # an additional outlier is identified. However this # point lies very close to the fence and this may be # attributed to differences in numerical rounding. # This may be illustrated by plotting the fence. plot <- bagplot(result, plot.fence = TRUE) plot # The returned object is a ggplot2 object and may be # edited using standard ggplot2 commands. library("ggplot2") plot + ylab("Engine displacement") + xlab("Weight in pounds")
data(bloodfat) # The bagplot can be plotted based on halfspace depth, projection depth, # skewness-adjusted projection depth or directional projection depth. # Note that projection depth is not appropiate for skewed data. # bagplot(compBagplot(bloodfat)) bagplot(compBagplot(bloodfat, type = "projdepth")) bagplot(compBagplot(bloodfat, type = "dprojdepth")) # The main features of the bagplot can easily be adjusted. result <- compBagplot(bloodfat, type = "projdepth") bagplot(result, databag = FALSE, dataloop = FALSE) bagplot(result, colorbag = rgb(0.2, 0.2, 0.2), colorloop = "lightgreen") data(cardata90) result <- compBagplot(cardata90, type = "projdepth") bagplot(result) # Compared to the original paper on the bagplot, # an additional outlier is identified. However this # point lies very close to the fence and this may be # attributed to differences in numerical rounding. # This may be illustrated by plotting the fence. plot <- bagplot(result, plot.fence = TRUE) plot # The returned object is a ggplot2 object and may be # edited using standard ggplot2 commands. library("ggplot2") plot + ylab("Engine displacement") + xlab("Weight in pounds")
Data were collected on the concentration of plasma cholesterol and plasma triglycerides (mg/dl) for 371 male patients evaluated for chest pain. For 51 of those patients, no evidence of heart disease was found. This subset corresponds to the remaining 320 patients for which there was evidence of narrowing arteries.
data(bloodfat)
data(bloodfat)
A data frame containing the following variables:
Cholesterol
Concentration of plasma cholesterol [mg/dl].
Triglycerides
Concentration of plasma triglycerides [mg/dl].
Hand D.J., Daly F., Lunn A., McConway A. (1994). A Handbook of Small Data Sets. Londen: Chapman and Hall, dataset 277.
Scott D.W., Gotto A.M., Cole J.S., Gorry G.A. (1978). Plasma lipids as collateral risk factors in coronary artery disease: a study of 371 males with chest pain. Journal of Chronic Diseases, 31, 337–345.
data(bloodfat) plot(bloodfat)
data(bloodfat) plot(bloodfat)
Subset from data on cars taken from pages 235-255, 281-285 and 287-288 of the April 1990 Consumer Reports Magazine.
data(cardata90)
data(cardata90)
A data frame containing the following variables:
Weight
Weight of the car (in pounds).
Disp
Engine displacement (in cubic inches).
Consumer Reports, April 1990, 235–288.
Chambers J.M., Hastie T.J. (1993). Statistical Models in S. Londen: Chapman and Hall, 46–47.
Rousseeuw P.J., Ruts I., Tukey J.W. (1999). The bagplot: a bivariate boxplot. The American Statistician, 53, 382–387.
data(cardata90) plot(cardata90)
data(cardata90) plot(cardata90)
Subset of the 'Character Trajectories Data Set' from the UCI Machine Learning Repository. The data set consists of trajectories of the tip of a pen whilst writing the letter 'a'. All samples are from the same writer. Original data has been processed.
data(characterA)
data(characterA)
Three dimensional array. The first dimension represents time. The second dimension corresponds to the observation number. The third dimension contains the X and Y coordinates.
Bache K., Lichman M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Williams B.H., Toussaint M., Storkey A.J. (2006). Extracting motion primitives from natural handwriting data. In ICANN, volume 2, pages 634–643.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods & Applications, 24, 177–202.
data(characterA) par(mfrow = c(1,2)) matplot(y = characterA[,,1], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "X position of the pen") matplot(y = characterA[,,2], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "Y position of the pen") par(mfrow = c(1,1))
data(characterA) par(mfrow = c(1,2)) matplot(y = characterA[,,1], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "X position of the pen") matplot(y = characterA[,,2], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "Y position of the pen") par(mfrow = c(1,1))
Subset of the 'Character Trajectories Data Set' from the UCI Machine Learning Repository. The data set consists of trajectories of the tip of a pen whilst writing the letter 'i'. All samples are from the same writer. Original data has been processed.
data(characterI)
data(characterI)
Three dimensional array. The first dimension represents time. The second dimension corresponds to the observation number. The third dimension contains the X and Y coordinates.
Bache K., Lichman M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Williams B.H., Toussaint M., Storkey A.J. (2006). Extracting motion primitives from natural handwriting data. In ICANN, volume 2, pages 634–643.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods & Applications, 24, 177–202.
data(characterI) par(mfrow = c(1,2)) matplot(y = characterI[,,1], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "X position of the pen") matplot(y = characterI[,,2], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "Y position of the pen") par(mfrow = c(1,1))
data(characterI) par(mfrow = c(1,2)) matplot(y = characterI[,,1], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "X position of the pen") matplot(y = characterI[,,2], type = "l", col = "black", lty = 1, xlab = "Time", ylab = "Y position of the pen") par(mfrow = c(1,1))
A test based on regression depth for linearity of the
conditional median z
given the simple regression dataset x
. The computation of the regression depth of z
with respect to x
is done by the function rdepth
.
The test is only valid when x
contains no duplicates.
cmltest(x, z)
cmltest(x, z)
x |
An |
z |
A matrix with one row containing an intercept and slope. |
The following hypothesis test is performed:: The data come from a model with:
The test statistic being used is the regression depth of z
with respect to x
.
pval |
The |
P. Segaert
Van Aelst S., Rousseeuw P.J., Hubert M., Struyf A. (2002). The deepest regression method. Journal of Multivariate Analysis, 81, 138–166.
Rousseeuw P.J., Struyf A. (2002). A depth test for symmetry. In: Goodness-of-Fit Tests and Model Validity, Birkhäuser Boston, pages 401–412.
data(stars) # Compute the least squares fit. Due to outliers # this fit will be bad and thus H0 should be rejected. temp <- lsfit(x = stars[,1], y = stars[,2])$coefficients intercept <- temp[1] slope <- temp[2] z <- matrix(c(intercept, slope), nrow = 1) pvalue1 <- cmltest(x = stars[!duplicated(stars), ], z = z) pvalue1 # Let's now test the deepest regression line. result <- rdepthmedian(x = stars) pvalue2 <- cmltest(x = stars[!duplicated(stars), ], z = matrix(result$deepest, nrow = 1)) pvalue2 plot(stars) abline(a = intercept, b = slope) abline(result$deepest, col = "red") text(x = 3.8, y = 5.3, labels = paste("p-value", round(pvalue1, digits = 3))) text(x = 4.45, y = 4.8, labels = paste("p-value", round(pvalue2, digits = 3)), col = "red")
data(stars) # Compute the least squares fit. Due to outliers # this fit will be bad and thus H0 should be rejected. temp <- lsfit(x = stars[,1], y = stars[,2])$coefficients intercept <- temp[1] slope <- temp[2] z <- matrix(c(intercept, slope), nrow = 1) pvalue1 <- cmltest(x = stars[!duplicated(stars), ], z = z) pvalue1 # Let's now test the deepest regression line. result <- rdepthmedian(x = stars) pvalue2 <- cmltest(x = stars[!duplicated(stars), ], z = matrix(result$deepest, nrow = 1)) pvalue2 plot(stars) abline(a = intercept, b = slope) abline(result$deepest, col = "red") text(x = 3.8, y = 5.3, labels = paste("p-value", round(pvalue1, digits = 3))) text(x = 4.45, y = 4.8, labels = paste("p-value", round(pvalue2, digits = 3)), col = "red")
Computes all elements of the bagplot, a generalisation
of the univariate boxplot to bivariate data. The bagplot can be
computed based on halfspace depth, projection depth, skewness-adjusted projection
depth and directional projection depth. To draw the actual plot, the function
bagplot
needs to be called on the result of compBagplot
.
compBagplot(x, type = "hdepth", sizesubset = 500, extra.directions = FALSE, options = NULL)
compBagplot(x, type = "hdepth", sizesubset = 500, extra.directions = FALSE, options = NULL)
x |
An |
type |
Determines the depth function used to construct the bagplot:
|
sizesubset |
When computing the bagplot based on halfspace depth,
the size of the subset used to perform the main
computations. See Details for more information. |
extra.directions |
Logical indicating whether additional directions should
be considered in the computation of the fence for the
bagplot based on projection depth or skewness-adjusted
projection depth. If set to |
options |
A list of options to pass to the
|
The bagplot has been proposed by Rousseeuw et al. (1999) as a generalisation of the boxplot to bivariate data. It is constructed based on halfspace depth. In the original format the deepest point is indicated by a "+" and is contained in the bag which is defined as the depth region containing the 50% observations with largest depth. The fence is obtained by inflating the bag (relative to the deepest point) by a factor of three. The loop is the convex hull of the observations of x
inside the fence. Observations outside the fence are flagged as outliers and plotted with a red star. This function only computes all the components constituting the bagplot. The bagplot itself can be drawn using the bagplot
function.
The bagplot may also be defined using other depth functions. When using projection depth, skewness-adjusted projection depth or directional projection depth, the bagplot is build as follows. The center corresponds to the observation with largest depth. The bag is constructed as the convex hull of the fifty percent points with largest depth. Outliers are identified as points with a depth smaller than a cutoff value, see projdepth
, sprojdepth
and dprojdepth
for the precise definition.
The loop is computed as the convex hull of the non-outlying points. The fence is approximated by the convex hull of those points that lie on rays from the center through the vertices of the bag and have a depth that equals the cutoff depth. For a better approximation the user can set the input parameter extraDirections
to TRUE
such that an additional 250 equally spaced directions on the circle are considered.
The computation of the bagplot based on halfspace depth can be time
consuming. Therefore it is possible to limit the bulk of the computations
to a random subset of the data. Computations of the halfspace median and
the bag are then based on this random subset. The number of points in this
subset can be controlled by the optional argument sizesubset
.
It is first checked whether the data is found to lie on a line. If so, the routine will give a warning, giving back the dimension of the subspace (being 1) together with the normal vector to that line.
A list with components:
center |
Center of the data. |
chull |
When |
bag |
The coordinates of the vertices of the bag. |
fence |
The coordinates of the vertices of the fence. |
datatype |
An |
flag |
A vector of length |
depth |
The depth of the observations of |
dimension |
If the data are lying in a lower dimensional subspace, the dimension of this subspace. |
hyperplane |
If the data are lying in a lower dimensional subspace, a direction orthogonal to this subspace. |
type |
Same as the input parameter |
P. Segaert based on Fortran code by P.J. Rousseeuw, I. Ruts and A. Struyf.
Rousseeuw P.J., Ruts I., Tukey J.W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 53, 382–387.
Hubert M., Van der Veeken S. (2008). Outlier detection for skewed data. Journal of Chemometrics, 22, 235–246.
Hubert M., Rousseeuw P.J., Segaert, P. (2015). Rejoinder to 'Multivariate functional outlier detection'. Statistical Methods & Applications, 24, 269–277.
bagplot
, hdepth
, projdepth
, sprojdepth
,
dprojdepth
.
data(bloodfat) # Result <- compBagplot(bloodfat) # bagplot(Result) # The sizesubset argument may be used to control the # computation time when computing the bagplot based on # halfspace depth. However results may be unreliable when # choosing a small subset for the main computations. # system.time(Result1 <- compBagplot(bloodfat)) # system.time(Result2 <- compBagplot(bloodfat, sizesubset = 100)) # bagplot(Result1) # bagplot(Result2) # When using any of the projection depth functions, # a list of options may be passed down to the corresponding # outlyingness routines. options <- list(type = "Rotation", ndir = 50, stand = "unimcd", h = floor(nrow(bloodfat)*3/4)) Result <- compBagplot(bloodfat, type = "projdepth", options = options) bagplot(Result) # The fence is computed using the depthContour function. # To get a smoother fence, one may opt to consider extra # directions. options <- list(ndir = 500, seed = 36) Result <- compBagplot(bloodfat, type = "dprojdepth", options = options) bagplot(Result, plot.fence = TRUE) options <- list(ndir = 500, seed = 36) Result <- compBagplot(bloodfat, type = "dprojdepth", options = options, extra.directions = TRUE) bagplot(Result, plot.fence = TRUE)
data(bloodfat) # Result <- compBagplot(bloodfat) # bagplot(Result) # The sizesubset argument may be used to control the # computation time when computing the bagplot based on # halfspace depth. However results may be unreliable when # choosing a small subset for the main computations. # system.time(Result1 <- compBagplot(bloodfat)) # system.time(Result2 <- compBagplot(bloodfat, sizesubset = 100)) # bagplot(Result1) # bagplot(Result2) # When using any of the projection depth functions, # a list of options may be passed down to the corresponding # outlyingness routines. options <- list(type = "Rotation", ndir = 50, stand = "unimcd", h = floor(nrow(bloodfat)*3/4)) Result <- compBagplot(bloodfat, type = "projdepth", options = options) bagplot(Result) # The fence is computed using the depthContour function. # To get a smoother fence, one may opt to consider extra # directions. options <- list(ndir = 500, seed = 36) Result <- compBagplot(bloodfat, type = "dprojdepth", options = options) bagplot(Result, plot.fence = TRUE) options <- list(ndir = 500, seed = 36) Result <- compBagplot(bloodfat, type = "dprojdepth", options = options, extra.directions = TRUE) bagplot(Result, plot.fence = TRUE)
Computes the vertices of depth contours of multivariate data. The contours can be
computed based on halfspace depth, projection depth, skewness-adjusted projection
depth or directional projection depth. To make the actual plot for bivariate data,
the function plotContours
needs to be called on the result of depthContour
.
depthContour(x, alpha = NULL, type = "hdepth", directions = NULL, options = NULL)
depthContour(x, alpha = NULL, type = "hdepth", directions = NULL, options = NULL)
x |
An |
alpha |
A vector containing the depth values of which the depth contours have to be computed. |
type |
The depth used in the computation of the contours:
|
directions |
An |
options |
A list of options to pass to
|
Depth contours of level (or
-depth contours) are the boundaries of depth regions. Depth regions of level
are defined as regions in space containing the multivariate points whose depth value is at least
.
For bivariate data halfspace depth contours can be computed exactly following the algorithm in Ruts and Rousseeuw (1996). When the data are not in general position (i.e. when there is a line containing more than two observations) dithering is performed by adding random Gaussian noise to the data.
In all other cases an approximated method is performed using a bisection algorithm. Intersections with the depth contours are searched on lines originating from the depth median. The user can specify a set of directions corresponding to these lines. By default a random set of directions is considered. On each direction a point is searched having depth
. Starting limits are obtained by projecting the data on the direction and considering the data point with univariate depth
. By definition the multivariate depth of this point has to be lower or equal to
. A second limit is obtained by considering the deepest location estimate. The maximum number of iterations bisecting the current search interval can be specified through the options argument
max.iter
. Note that this method is only affine or rotation equivariant if the chosen directions are affine or rotation equivariant.
It is first checked whether the data is found to lie in a subspace of
dimension lower than . If so, the routine will give a warning, giving
back the dimension of the subspace together with a direction describing a
hyperplane containing this subspace.
The output consists of a list. Each element of the list contains the following elements for each value specified in the argument
alpha
.
depth |
The depth of the depth contour of level |
vertices |
The coordinates of the vertices of the depth contour. |
empty |
Logical indicating whether the corresponding depth region is empty. |
dithered |
Logical indicating whether dithering has been applied in the exact bivariate algorithm based on halfspace depth. |
converged |
A vector of length |
type |
Same as input parameter type. |
dimension |
If the data are lying in a lower dimensional subspace, the dimension of this subspace. |
hyperplane |
If the data are lying in a lower dimensional subspace, a direction orthogonal to this subspace. |
P. Segaert based on Fortran code by P.J. Rousseeuw, I. Ruts and A. Struyf
Ruts I., Rousseeuw P.J. (1996). Computing depth contours of bivariate point clouds. Computational Statistics & Data Analysis, 23, 153–168.
# Compute and plot some halfspace depth contours of a two-dimensional dataset. # The returned object is a ggplot2 object that may be edited # using standard ggplot2 commands. # One may consider different depth functions such as projection depth # by changing the input parameter 'type'. # By default the halfspace depth is used. data(bloodfat) Result <- depthContour(x = bloodfat, alpha = c(0.03, 0.125, 0.25)) plotContours(x = bloodfat, depthContour = Result) # Other options are projection depth, skewness-adjusted projection depth # and directional projection depth # they can be used by specifying type to be # "projdepth", "sprojdepth" or "dprojdepth" respectively. # When there is skewness in the data projection depth # is less appropriate. Result <- depthContour(x = bloodfat, alpha = c(0.25, 0.35, 0.45), type = "projdepth") plotContours(x = bloodfat, depthContour = Result) # The skewness-adjusted projection depth and directional projection depth # better reflect the skewness in the data. Result <- depthContour(x = bloodfat, alpha = c(0.35, 0.45, 0.55), type = "sprojdepth") plotContours(x = bloodfat, depthContour = Result) Result <- depthContour(x = bloodfat, alpha = c(0.25, 0.35, 0.45), type = "dprojdepth") plotContours(x = bloodfat, depthContour = Result)
# Compute and plot some halfspace depth contours of a two-dimensional dataset. # The returned object is a ggplot2 object that may be edited # using standard ggplot2 commands. # One may consider different depth functions such as projection depth # by changing the input parameter 'type'. # By default the halfspace depth is used. data(bloodfat) Result <- depthContour(x = bloodfat, alpha = c(0.03, 0.125, 0.25)) plotContours(x = bloodfat, depthContour = Result) # Other options are projection depth, skewness-adjusted projection depth # and directional projection depth # they can be used by specifying type to be # "projdepth", "sprojdepth" or "dprojdepth" respectively. # When there is skewness in the data projection depth # is less appropriate. Result <- depthContour(x = bloodfat, alpha = c(0.25, 0.35, 0.45), type = "projdepth") plotContours(x = bloodfat, depthContour = Result) # The skewness-adjusted projection depth and directional projection depth # better reflect the skewness in the data. Result <- depthContour(x = bloodfat, alpha = c(0.35, 0.45, 0.55), type = "sprojdepth") plotContours(x = bloodfat, depthContour = Result) Result <- depthContour(x = bloodfat, alpha = c(0.25, 0.35, 0.45), type = "dprojdepth") plotContours(x = bloodfat, depthContour = Result)
Computes the directional outlyingness of -dimensional points
z
relative to a -dimensional dataset
x
. For each multivariate point , its directional outlyingness relative to
x
is defined as its maximal univariate directional outlyingness measured over all directions. To obtain the univariate directional outlyingness in the direction , the dataset
x
is projected on , and the robustly skew-adjusted standardized distance of
to the median of the projected data points
x
is computed. This is done through the estimation of 2 scales, one on each side of the median, using a 1-step M-estimator of scale.
dirOutl(x, z = NULL, options = list())
dirOutl(x, z = NULL, options = list())
x |
An |
z |
An optional |
options |
A list of available options:
|
The directional outlyingness (DO) of multivariate data was introduced in Rousseeuw et al. (2018). It extends the Stahel-Donoho outlyingness towards skewed distributions.
Depending on the dimension , different approximate algorithms are implemented. The affine invariant algorithm can only be used when
. It draws
ndir
times at random observations from
x
and considers the direction orthogonal to the hyperplane spanned by these observations. At most
out of
directions can be considered. The orthogonal invariant version can be applied to high-dimensional data. It draws
ndir
times at random observations from
x
and considers the direction through these two observations. Here, at most 2 out of directions can be considered. Finally, the shift invariant version randomly draws
ndir
vectors from the unit sphere.
The resulting DO values are invariant to affine transformations, rotations and shifts respectively provided that the seed
is kept fixed at different runs of the algorithm. Note that the DO values are guaranteed to increase when more directions are considered provided the seed is kept fixed, as this ensures that the random directions are generated in a fixed order.
An observation from x
and z
is flagged as an outlier if its DO exceeds a cutoff value. This cutoff value is determined using the procedure in Rousseeuw et al. (2018). First, the logarithm of the DO values is taken to render their distribution more symmetric, after which a normal approximation yields a cutoff on these values. The cutoff is then transformed back by applying the exponential function.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it. Furthermore, the univariate directional outlyingness of the projected points
x
is ill-defined when the scale in its denominator becomes zero. This can happen when many observations collapse. In these cases the algorithm will stop and give a warning. The returned values then include the direction
as well as an indicator specifying which of the observations of
x
belong to the hyperplane orthogonal to .
A list with components:
outlyingnessX |
Vector of length |
outlyingnessZ |
Vector of length |
cutoff |
Points whose directional outlyingness exceeds this cutoff can be considered as outliers with respect to |
flagX |
Observations of |
flagZ |
Points of |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
inSubspace |
When a direction |
J. Raymaekers and P. Rousseeuw
Rousseeuw P.J., Raymaekers J., Hubert M. (2018). A measure of directional outlyingness with applications to image data and video. Journal of Computational and Graphical Statistics, 27, 345–359.
dprojdepth
, dprojmedian
, adjOutl
, outlyingness
# Compute the directional outlyingness of a two-dimensional dataset. # Outliers are plotted in red. data(geological) BivData <- geological[c("MnO", "MgO")] Result <- dirOutl(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData, pch = 16, col = "grey60") points(BivData[IndOutliers, ], pch = 16, col = "red") # The number of directions may be specified through # the option list. The resulting directional outlyingness # is monotone increasing in the number of directions. Result1 <- dirOutl(x = BivData, options = list(ndir = 50)) Result2 <- dirOutl(x = BivData, options = list(ndir = 100)) which(Result2$outlyingnessX - Result1$outlyingnessX < 0) # This is however not the case when the seed is changed Result1 <- dirOutl(x = BivData, options = list(ndir = 50)) Result2 <- dirOutl(x = BivData, options = list(ndir = 100, seed = 950)) plot(Result2$outlyingnessX - Result1$outlyingnessX, xlab = "Index", ylab = "Difference in DO")
# Compute the directional outlyingness of a two-dimensional dataset. # Outliers are plotted in red. data(geological) BivData <- geological[c("MnO", "MgO")] Result <- dirOutl(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData, pch = 16, col = "grey60") points(BivData[IndOutliers, ], pch = 16, col = "red") # The number of directions may be specified through # the option list. The resulting directional outlyingness # is monotone increasing in the number of directions. Result1 <- dirOutl(x = BivData, options = list(ndir = 50)) Result2 <- dirOutl(x = BivData, options = list(ndir = 100)) which(Result2$outlyingnessX - Result1$outlyingnessX < 0) # This is however not the case when the seed is changed Result1 <- dirOutl(x = BivData, options = list(ndir = 50)) Result2 <- dirOutl(x = BivData, options = list(ndir = 100, seed = 950)) plot(Result2$outlyingnessX - Result1$outlyingnessX, xlab = "Index", ylab = "Difference in DO")
Computation of distance space representation.
distSpace(trainingData, testData = NULL, type = "bagdistance", options = NULL)
distSpace(trainingData, testData = NULL, type = "bagdistance", options = NULL)
trainingData |
A list of |
testData |
An |
type |
The distance used in the computations.
For multivariate data one of the following options: |
options |
A list of options to pass to the function
computing the underlying distance. |
The distance space representation is a tool in supervised classification and was introduced in Hubert et al. (2016) as a generalisation of the depth-depth representation of a multivariate sample. Based on a distance transform, an observation (be it multivariate or functional) is mapped to its representation in distance space. The distance transformation consists of mapping the observation to a vector containing at coordinate the distance to the training group
. After transformation, any multivariate classifier may be used to classify new observations in distance space. Typically the
-nearest neighbour algorithm is used.
Different options are available to compute the distance to each of the training groups. For multivariate data, the user may choose between the bagdistance or any of the projection type distances including the Stahel-Donoho outlyingness, the adjusted outlyingness or the directional outlyingness. For functional data, the user may opt to employ the functional bagdistance (fbd), the functional Stahel-Donoho outlyingness (fSDO), the functional skweness-adjusted outlyingness (fAO) or the functional directional outlyingness (fDO). Options available in each of the underlying distance routines may be passed down using the options
argument.
A by
matrix composed of two blocks. The first block contains the observations in the training set (rows) with in each column the distance to each of the groups. The last column contains a label indicating the original group membership of the observation. The second block contains the observations in the test set, if any, with in each column the distance to the different training groups. The last column contains an indicator signaling the observation was part of the test set.
P. Segaert
Hubert M., Rousseeuw P.J., Segaert P. (2017). Multivariate and functional classification using depth and distance. Advances in Data Analysis and Classification, 11, 445–466.
data(plane) # Build the training data Mirage <- plane$plane1[, 1:25, 1, drop = FALSE] Eurofighter <- plane$plane3[, 1:25, 1, drop = FALSE] trainingData <- list(group1 = Mirage, group2 = Eurofighter) # Build the test data Mirage.t <- plane$plane1[, 26:30, 1, drop = FALSE] Eurofighter.t <- plane$plane3[, 26:30, 1, drop = FALSE] testData <- abind::abind(Mirage.t, Eurofighter.t, along = 2) # Transform the data into distSpace Result <- distSpace(trainingData = trainingData, testData = testData, type="fbd") # Plot the results plotColors <- c(rep("orange", dim(Mirage)[2]), rep("blue", dim(Eurofighter)[2]), rep("green3", dim(testData)[2])) plot(Result[, 1:2, ], col = plotColors, pch=16, xlab = "distance to Mirage", ylab = "distance to Eurofighter", main = "distSpace representation of Mirage and Eurofighter") legend("bottomleft", legend = c("Mirage","Eurofighter", "test data"), pch = 16, col = c("orange","blue", "green3"))
data(plane) # Build the training data Mirage <- plane$plane1[, 1:25, 1, drop = FALSE] Eurofighter <- plane$plane3[, 1:25, 1, drop = FALSE] trainingData <- list(group1 = Mirage, group2 = Eurofighter) # Build the test data Mirage.t <- plane$plane1[, 26:30, 1, drop = FALSE] Eurofighter.t <- plane$plane3[, 26:30, 1, drop = FALSE] testData <- abind::abind(Mirage.t, Eurofighter.t, along = 2) # Transform the data into distSpace Result <- distSpace(trainingData = trainingData, testData = testData, type="fbd") # Plot the results plotColors <- c(rep("orange", dim(Mirage)[2]), rep("blue", dim(Eurofighter)[2]), rep("green3", dim(testData)[2])) plot(Result[, 1:2, ], col = plotColors, pch=16, xlab = "distance to Mirage", ylab = "distance to Eurofighter", main = "distSpace representation of Mirage and Eurofighter") legend("bottomleft", legend = c("Mirage","Eurofighter", "test data"), pch = 16, col = c("orange","blue", "green3"))
Computes the directional projection depth of -dimensional
points
z
relative to a -dimensional dataset
x
.
dprojdepth(x, z = NULL, options = NULL)
dprojdepth(x, z = NULL, options = NULL)
x |
An |
z |
An optional |
options |
A list of options to pass to the underlying See |
Directional projection depth is based on the directional
outlyingness and is computed as . As directional
outlyingness extends the Stahel-Donoho outlyingness towards
skewed distributions, the directional projection depth
is suited for both elliptical distributions and skewed
multivariate data.
It is first checked whether the data is found to lie in a subspace of
dimension lower than . If so, a warning is given, as well as the
dimension of the subspace and a direction which is orthogonal to it.
See dirOutl
for more details on the computation of the DO.
To visualize the depth of bivariate data one can apply the
mrainbowplot
function. It plots the data colored according to
their depth.
The output values of this function are based on the output of the
dirOutl
function. More details can be found there.
A list with components:
depthX |
Vector of length |
depthZ |
Vector of length |
cutoff |
Points whose directional projection depth is smaller than this cutoff can be considered as outliers. |
flagX |
Observations of |
flagZ |
Points of |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
inSubspace |
When a direction |
J. Raymaekers
Rousseeuw, P.J., Raymaekers, J., Hubert, M. (2018). A measure of directional outlyingness with applications to image Data and video. Journal of Computational and Graphical Statistics, 27, 345–359.
dirOutl
, dprojmedian
, mrainbowplot
, adjOutl
, outlyingness
# Compute the directional projection depth # of a simple two-dimensional dataset. # Outliers are plotted in red. data(bloodfat) Result <- dprojdepth(x = bloodfat) IndOutliers <- which(!Result$flagX) plot(bloodfat) points(bloodfat[IndOutliers,], col = "red") # A multivariate rainbowplot may be obtained using mrainbowplot. plot.options = list(legend.title = "DPD") mrainbowplot(x = bloodfat, depths = Result$depthX, plot.options = plot.options) # Options for the underlying outlyingness routine may be passed # using the options argument. Result <- dprojdepth(x = bloodfat, options = list(type = "Affine", ndir=100))
# Compute the directional projection depth # of a simple two-dimensional dataset. # Outliers are plotted in red. data(bloodfat) Result <- dprojdepth(x = bloodfat) IndOutliers <- which(!Result$flagX) plot(bloodfat) points(bloodfat[IndOutliers,], col = "red") # A multivariate rainbowplot may be obtained using mrainbowplot. plot.options = list(legend.title = "DPD") mrainbowplot(x = bloodfat, depths = Result$depthX, plot.options = plot.options) # Options for the underlying outlyingness routine may be passed # using the options argument. Result <- dprojdepth(x = bloodfat, options = list(type = "Affine", ndir=100))
Computes a directional projection depth based location estimate of a
-dimensional dataset
x
.
dprojmedian(x, dprojection.depths = NULL, options = NULL)
dprojmedian(x, dprojection.depths = NULL, options = NULL)
x |
An |
dprojection.depths |
Vector containing the directional projection
depth of the observations in |
options |
A list of options to pass to the |
The algorithm depends on the function dprojdepth
to compute the
directional projection depth of the observations in x
. If these depth values have already been computed they can be passed as an optional argument to save computing time. If not, directional projection depth values will be computed and the user
can pass a list with options to the dprojdepth
function.
It is first checked whether the data lie in a subspace of dimension smaller
than . If so, a warning is given, as well as the dimension of the
subspace and a direction which is orthogonal to it.
A list with component:
max |
The observation of |
J. Raymaekers
Rousseeuw P.J., Raymaekers J., Hubert M. (2018). A measure of directional outlyingness with applications to image data and video. Journal of Computational and Graphical Statistics, 27, 345–359.
dirOutl
, dprojdepth
, adjOutl
, outlyingness
# Compute a location estimate of a two-dimensional dataset. data(bloodfat) result <- dprojmedian(x = bloodfat) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # Options for the underlying dprojdepth routine may be passed # using the options argument. result <- dprojmedian(x = bloodfat, options = list(type = "Rotation", ndir = 100)) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # One may also compute the depth values of the observations in the data # separately. This avoids having to recompute them when conmputing the median. depth.result <- dprojdepth(x = bloodfat) result <- dprojmedian(x = bloodfat, dprojection.depths = depth.result$depthX)
# Compute a location estimate of a two-dimensional dataset. data(bloodfat) result <- dprojmedian(x = bloodfat) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # Options for the underlying dprojdepth routine may be passed # using the options argument. result <- dprojmedian(x = bloodfat, options = list(type = "Rotation", ndir = 100)) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # One may also compute the depth values of the observations in the data # separately. This avoids having to recompute them when conmputing the median. depth.result <- dprojdepth(x = bloodfat) result <- dprojmedian(x = bloodfat, dprojection.depths = depth.result$depthX)
Draws a heatmap of depth values or distances of functional data.
fHeatmap(rowValues, cellValues, type, legend.title = "")
fHeatmap(rowValues, cellValues, type, legend.title = "")
rowValues |
Vector of functional depth or distance values. Each value should correspond with the functional depth or distance of an observation from a (multivariate) functional data set. |
cellValues |
Matrix of multivariate depth or distance values. The value in row |
type |
One of |
legend.title |
Title of the legend. |
On the vertical axis the functional data are sorted from top to bottom according to their functional depth or distance value as provided in rowValues
. When type = "depth"
, the deepest observation is put at the bottom. When type = "distance"
, the rows are sorted in decreasing order of their functional distance.
Each cell of the map is colored according to the multivariate depth provided in cellValues
. For a depth-based heatmap, the smallest
depth value is white and the overall highest depth value is colored dark green. A distance-based heatmap colors the cells from white to dark red.
P. Segaert
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods & Applications, 24, 177–202.
data(octane) Result <- mfd(octane, diagnostic = TRUE, type = "sprojdepth") Plot <- fHeatmap(rowValues = Result$MFDdepthZ, cellValues = Result$crossdepthZ, type = "depth", legend.title = "SPD") Plot Result <- fOutl(octane, diagnostic = TRUE, type = "fAO") Plot <- fHeatmap(rowValues = Result$fOutlyingnessZ, cellValues = Result$crossDistsZ, type = "distance", legend.title = "AO") Plot
data(octane) Result <- mfd(octane, diagnostic = TRUE, type = "sprojdepth") Plot <- fHeatmap(rowValues = Result$MFDdepthZ, cellValues = Result$crossdepthZ, type = "depth", legend.title = "SPD") Plot Result <- fOutl(octane, diagnostic = TRUE, type = "fAO") Plot <- fHeatmap(rowValues = Result$fOutlyingnessZ, cellValues = Result$crossDistsZ, type = "distance", legend.title = "AO") Plot
Creates the Functional Outlier Map, a graphical tool to detect outliers in multivariate functional data. Depending on the position of a multivariate curve in the FOM, different types of outliers may be distinguished.
fom(fOutlResult, cutoff = FALSE)
fom(fOutlResult, cutoff = FALSE)
fOutlResult |
The return of a call to |
cutoff |
Boolean indicating whether the cutoff should be drawn. |
The fom is only applicable when fOutl
is called with the following options: diagnostic = TRUE
and type
equaling either fAO
or fDO
.
The functional outlier map was proposed by Hubert et al. (2015) and subsequently improved by Rousseeuw et al. (2018). It consists of a graphical tool to detect outliers in multivariate functional data based on functional outlyingness measures such as the or
(see
fOutl
).
The coordinates of the points in the FOM correspond to two outlyingness indicators. On the horizontal axis, the functional outlyingness as obtained by the routine fOutl
is plotted. On the vertical axis the scaled standard deviation of the cross-sectional outlyingness measures across time are plotted. The FOM thus consists of the following points representing the curve :
. The scaling of the standard deviation is added to ensure that curves with a different location, measured from the center, but with the same relative variability have a similar
-coordinate.
For some underlying multivariate outlyingness measures, the output of fOutl
contains an additional argument signaling whether curve is outlying as measured in the multivariate space at time point
. This information is incorporated in the FOM by plotting the points with a different degree of outlyingness by a different symbol. Curves that are outlying in the multivariate space determined by at least
of the total number of time points in the domain are plotted using filled diamonds. Similarly curves that are outlying in at least
or
of the time points are plotted in filled squares and filled triangles respectively. Curves that are outlying in fewer than
of the time points are plotted using filled circles.
The user may opt to draw a cutoff curve for the detection of outliers on the FOM.
This cutoff was introduced in Rousseeuw et al. (2018) and is based on the euclidean
distance between the points on the FOM and the origin, after scaling with the median.
More specifically, let and
let
.
Finally, with
,an observation lies outside of the cutoff when
.
This FOM may be read in a way similar to the outlier map of robust regression (Rousseeuw and van Zomeren 1990) and the outlier map of
robust principal components (Hubert et al. 2005). Points in the lower left part of the FOM represent regular curves which hold a central position in the dataset. Points in the lower right part are curves with a high but a low variability of their cross-sectional
values. This happens for shift outliers, i.e. curves which have the same shape as the majority but are shifted on the whole domain. Points in the upper left part have a low
but a high variability in cross-sectional
. Typical examples are isolated outliers, i.e. cuves which only display outlyingness over a small part of their domain.
The points in the upper right part of the FOM have both a high
and a high cross-sectional
. These correspond to curves which are strongly outlying on a substantial part of their domain.
P. Segaert
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods & Applications, 24, 177–202.
Rousseeuw P.J., Raymaekers J., Hubert M., (2018). A measure of directional outlyingness with applications to image data and video. Journal of Computational and Graphical Statistics, 27, 345–359.
data(octane) # To construct the FOM, one first need to compute # the functional outlyingness. # Note that the option diagnostic in fOutl must be # set to TRUE. If not calling fom will result in an # error Result <- fOutl(octane, alpha = 0, type = "fAO", diagnostic = TRUE) fom(Result) # The user may opt to draw a cut off line seperating the outliers. # which will be plotted in red fom(Result, cutoff = TRUE) # Six observations are flagged as outliers. These correspond to # the samples with added ethanol.
data(octane) # To construct the FOM, one first need to compute # the functional outlyingness. # Note that the option diagnostic in fOutl must be # set to TRUE. If not calling fom will result in an # error Result <- fOutl(octane, alpha = 0, type = "fAO", diagnostic = TRUE) fom(Result) # The user may opt to draw a cut off line seperating the outliers. # which will be plotted in red fom(Result, cutoff = TRUE) # Six observations are flagged as outliers. These correspond to # the samples with added ethanol.
Computes several measures of functional outlyingness for multivariate functional data.
fOutl(x, z = NULL, type = "fAO", alpha = 0, time = NULL, diagnostic = FALSE, distOptions = NULL)
fOutl(x, z = NULL, type = "fAO", alpha = 0, time = NULL, diagnostic = FALSE, distOptions = NULL)
x |
A three dimensional |
z |
An optional three-dimensional |
type |
The outlyingness measure used in the computations.
One of the following options: |
alpha |
Specifies the weights at every cross-section.
When |
time |
If the measurements are not equidistant,
a sorted numeric vector containing a set of time points. |
diagnostic |
If set to |
distOptions |
A list of options to pass to the function
computing the cross-sectional distances. |
The functional outlyingness of a multivariate curve with respect to a
given set of multivariate curves is defined as the weighted average of its
multivariate outlyingness at each time point (Hubert et al., 2015).
The functional outlyingness can be computed in all dimensions using
the adjusted outlyingness (
fAO
), the directional outlyingness (fDO
),
the Stahel-Donoho outlyingness (fSDO
) or the bagdistance (fbd
).
When the data array z
is specified, the functional outlyingness and diagnostic
information for the data array x
is also returned whenever the underlying
outlyingness routine allows it. For more information see the specific routines
listed in the section "See Also".
In some situations, additional diagnostics are available to flag outlying time
points. At each time point, observations from the data array x
are marked
if they are flagged as outliers. The observations from the data array x
are marked if their scaled outlyingness is larger than a prescribed cutoff value
from the chi-square distribution. For more details see the respective
outlyingness routines.
It is possible that at certain time points a part of the algorithm can not be executed due to e.g. exact fits. In that case the weight of that particular time point is set to zero. A warning is issued at the end of the algorithm to signal these time points. Furthermore the output contains an extra argument giving the indices of the time points where problems occured.
A list with the following components:
fOutlyingnessX |
Vector of length |
fOutlyingnessZ |
Vector of length |
weights |
Vector of weights according to the input parameter
|
crossDistsX |
An |
crossDistsZ |
An |
locOutlX |
An |
locOutlZ |
An |
IndFlagExactFit |
Vector containing the indices of the time points for which an exact fit is detected. |
P. Segaert
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods and Applications, 24, 177–202.
Hubert M., Rousseeuw P.J., Segaert P. (2017). Multivariate and functional classification using depth and distance. Advances in Data Analysis and Classification, 11, 445–466.
bagdistance
, outlyingness
, adjOutl
, dirOutl
, fom
data(octane) Data <- octane # When the option diagnostic is set to TRUE, a crude diagnostic # to detect outliers can be extracted from the local outlyingness # indicators. Result <- fOutl(x = Data, type = "fAO", diagnostic = TRUE) matplot(Data[,,1], type = "l", col = "black", lty = 1) for (i in 1:dim(Data)[2]) { if(sum(Result$locOutlZ[i, ]) > 0) { obsData <- matrix(Data[,i,1], nrow = 1) obsData[!Result$locOutlZ[i,]] <- NA obsData <- rbind(obsData, obsData) matpoints(t(obsData), col = "red", pch = 15) } } # For more advanced outlier detection techniques, see the # fom routine.
data(octane) Data <- octane # When the option diagnostic is set to TRUE, a crude diagnostic # to detect outliers can be extracted from the local outlyingness # indicators. Result <- fOutl(x = Data, type = "fAO", diagnostic = TRUE) matplot(Data[,,1], type = "l", col = "black", lty = 1) for (i in 1:dim(Data)[2]) { if(sum(Result$locOutlZ[i, ]) > 0) { obsData <- matrix(Data[,i,1], nrow = 1) obsData[!Result$locOutlZ[i,]] <- NA obsData <- rbind(obsData, obsData) matpoints(t(obsData), col = "red", pch = 15) } } # For more advanced outlier detection techniques, see the # fom routine.
This data originates from a geological survey on the composition in agricultural soils from 10 countries surrounding the Baltic Sea. Top soil (0-25 cm) and bottom soil (50-75 cm) samples from 768 sites were analysed. This data frame contains the measurements corresponding to the total concentration of four elements in the top soil samples.
data(geological)
data(geological)
A data frame containing the following variables:
Fe2O3
Iron Oxide
MgO
Magnesium oxide
MnO
Manganese oxide
TiO2
Titanium dioxide
Reimann C., Siewers U., Tarvainen T., Bityukova L., Eriksson J., Gilucis A., Gregorauskiene V., Lukashev V., Matinian N.N., Pasieczna A. (2000). Baltic soil survey: total concentrations of major and selected trace elements in arable soils from 10 countries around the Baltic Sea. Science of the Total Environment, 257, 155–170.
Hubert M., Van der Veeken S. (2008). Outlier detection for skewed data. Journal of Chemometrics, 22, 235–246.
data(geological) plot(geological)
data(geological) plot(geological)
The glass data set studied by Lemberge et al. (2000) consists of 180 different 16-17th century archeological glass samples. Electron Probe X-ray Microanalysis (EPXMA) intensities across energy levels are recorded using a Jeol JSM 6300 scanning electron microscope equipped with an energy-dispersive Si(Li) X-ray detection system (SEM-EDX).
data(glass)
data(glass)
A three-dimensional by
by
array.
Lemberge P., De Raedt I., Janssens K.H., Wei F., Van Espen P.J. (2000).
Quantitative Z-analysis of 16th-17th century archaeological glass vessels using
PLS regression of EPXMA and -XRF data. Journal of Chemometrics, 14,
751–763.
Hubert M., Rousseeuw P.J., Vanden Branden K. (2005). ROBPCA: a new approach to robust principal component analysis. Technometrics, 47, 64–79.
data(glass) matplot(glass[,,1], type = "l", lty = 1, col = "black")
data(glass) matplot(glass[,,1], type = "l", lty = 1, col = "black")
Computes the halfspace depth of -dimensional points
z
relative to a -dimensional dataset
x
. Computation is exact for and approximate when
. For the approximate algorithm the halfspace depth is computed as the minimal univariate halfspace depth over many directions. To obtain the univariate halfspace depth in the direction
, the dataset
x
is projected on , and the univariate location depth of the points of
to
is computed.
hdepth(x, z = NULL, options = list())
hdepth(x, z = NULL, options = list())
x |
An |
z |
An optional |
options |
A list of available options:
|
Halfspace depth has been introduced by Tukey (1975). The halfspace depth of a point is defined as the minimal number of observations from
x
that are contained in any closed halfspace with boundary through .
In dimensions and
the computations are by default carried out exactly using the algorithms described in Rousseeuw and Ruts (1996) and Rousseeuw and Struyf (1998). This yields an affine invariant measure of depth.
Approximate algorithms are also implemented which are affine, rotation or shift invariant, depending on the value chosen for
type
. They can be used in any dimension. The shift invariant algorithm coincides with the random Tukey depth (Cuesta-Albertos and Nieto-Reyes, 2008).
The resulting halfspace depth values are invariant to affine transformations when the exact algorithm is used and invariant to affine transformations, rotations and shifts depending on the choice for type
, provided that the seed
is kept fixed at different runs of the algorithm. Note that the halfspace depth values values are guaranteed to decrease when more directions are considered, provided the seed is kept fixed, as this ensures that the random directions are generated in a fixed order.
If the halfspace depth needs to be computed for points
, it is recommended to apply the function once with the matrix
as input, instead of applying it
times with input vectors
, as numerous computations can be saved. The approximate algorithms automatically then also compute the depth values of the observations in
x
.
When only the halfspace depth of the observations in x
is required, the call to the function should be hdepth(x)
or equivalently hdepth(x,x)
. In that case the depth values will be stored in the 'depthZ' output field. For bivariate data these will be the exact values by default.
To visualize the depth of bivariate data one can apply the mrainbowplot
function. It plots the data colored according to their depth.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.
A list with components:
depthX |
Vector of length |
depthZ |
Vector of length |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
P. Segaert based on Fortran code by P.J. Rousseeuw, I. Ruts and A. Struyf, and C++
code by P. Segaert and K. Vakili.
Tukey J. (1975). Mathematics and the picturing of data. Proceedings of the International Congress of Mathematicians, 2, 523–531, Vancouver.
Rousseeuw P.J., Ruts I. (1996). AS 307: Bivariate location depth. Journal of the Royal Statistical Society: Series C, 45, 516–526.
Rousseeuw P.J., Struyf A. (1998). Computing location depth and regression depth in higher dimensions. Statistics and Computing, 8, 193–203.
Cuesta-Albertos J., Nieto-Reyes A. (2008). The random Tukey depth. Computational Statistics & Data Analysis, 52, 4979–4988.
hdepthmedian
, mrainbowplot
, bagdistance
, bagplot
# Compute the halfspace depth of a simple # two-dimensional dataset. data(cardata90) Result <- hdepth(x = cardata90) mrainbowplot(cardata90, depths = Result$depthZ) # In two dimensions we may also opt to use the # approximate algorithm. The number of directions # may be specified through the option list. options <- list(type = "Rotation", ndir = 750, approx = TRUE) Result <- hdepth(x = cardata90, options = options) # The resulting halfspace depth is monotone decreasing # in the number of directions. options <- list(type = "Rotation", ndir = 10, approx = TRUE) Result1 <- hdepth(x = cardata90, options = options) options <- list(type = "Rotation", ndir = 500, approx = TRUE) Result2 <- hdepth(x = cardata90, options = options) which(Result1$depthZ - Result2$depthZ < 0) # This is however not the case when the seed is changed options <- list(type = "Rotation", ndir = 10, approx = TRUE) Result1 <- hdepth(x = cardata90, options = options) options <- list(type = "Rotation", ndir = 50, approx = TRUE, seed = 897) Result2 <- hdepth(x = cardata90, options = options) which(Result1$depthZ - Result2$depthZ < 0) plot(Result1$depthZ - Result2$depthZ, xlab = "Index", ylab = "Difference in halfspace depth") # We can also consider directions through two data # points. If the sample is small enough one may opt # to search over all choose(n,2) directions. # Note that the computational load increases substantially # as n becomes larger. options <- list(type = "Rotation", ndir = "all", approx = TRUE) Result1 <- hdepth(x = cardata90, options = options) # Alternatively one may consider randomly generated directions. options <- list(type = "Shift", ndir = 250, approx = TRUE) Result1 <- hdepth(x = cardata90, options = options)
# Compute the halfspace depth of a simple # two-dimensional dataset. data(cardata90) Result <- hdepth(x = cardata90) mrainbowplot(cardata90, depths = Result$depthZ) # In two dimensions we may also opt to use the # approximate algorithm. The number of directions # may be specified through the option list. options <- list(type = "Rotation", ndir = 750, approx = TRUE) Result <- hdepth(x = cardata90, options = options) # The resulting halfspace depth is monotone decreasing # in the number of directions. options <- list(type = "Rotation", ndir = 10, approx = TRUE) Result1 <- hdepth(x = cardata90, options = options) options <- list(type = "Rotation", ndir = 500, approx = TRUE) Result2 <- hdepth(x = cardata90, options = options) which(Result1$depthZ - Result2$depthZ < 0) # This is however not the case when the seed is changed options <- list(type = "Rotation", ndir = 10, approx = TRUE) Result1 <- hdepth(x = cardata90, options = options) options <- list(type = "Rotation", ndir = 50, approx = TRUE, seed = 897) Result2 <- hdepth(x = cardata90, options = options) which(Result1$depthZ - Result2$depthZ < 0) plot(Result1$depthZ - Result2$depthZ, xlab = "Index", ylab = "Difference in halfspace depth") # We can also consider directions through two data # points. If the sample is small enough one may opt # to search over all choose(n,2) directions. # Note that the computational load increases substantially # as n becomes larger. options <- list(type = "Rotation", ndir = "all", approx = TRUE) Result1 <- hdepth(x = cardata90, options = options) # Alternatively one may consider randomly generated directions. options <- list(type = "Shift", ndir = 250, approx = TRUE) Result1 <- hdepth(x = cardata90, options = options)
Computes the halfspace median and its corresponding
halfspace depth for a -dimensional data set
x
. Computation is exact for and approximate for
.
hdepthmedian(x, maxdir = NULL)
hdepthmedian(x, maxdir = NULL)
x |
An |
maxdir |
The number of projections used in the approximate algorithm. |
The halfspace median, or Tukey median, is the multivariate point with largest halfspace depth with respect to the data x
. This point is not always unique. In that case the halfspace median corresponds to the center of gravity of the convex set of deepest points.
It is first checked whether the data is found to lie in a subspace of
dimension lower than . If so, the routine will give a warning, giving
back the dimension of the subspace together with a direction describing a
hyperplane containing this subspace.
For bivariate data the exact algorithm of Rousseeuw and Ruts (1998) is applied.
When the data are not in general position (i.e. when there is a line containing more than two observations) dithering is performed by adding random Gaussian noise to the data. In this case the output argument dithered
will contain a flag.
When the approximate algorithm of Struyf and Rousseeuw (2000) is applied. It is an iterative procedure based on projections. Their number can be chosen by the input parameter
maxdir
.
A list containing:
median |
The coordinates of the halfspace median. |
depth |
The halfspace depth of the halfspace median. |
dithered |
Logical indicating whether dithering has been applied in the exact algorithm. |
ndir |
The number of projections used by the approximate algorithm. Due to the possibility of singularity of certain |
AlgStopFlag |
Indicates which stopping rule is used by the approximate algorithm. |
dimension |
If the data are lying in a lower dimensional subspace, the dimension of this subspace. |
hyperplane |
If the data are lying in a lower dimensional subspace, a direction orthogonal to this subspace. |
P. Segaert based on Fortran code by P.J. Rousseeuw, I. Ruts and A. Struyf
Rousseeuw P.J., Ruts I. (1998). Constructing the bivariate Tukey median. Statistica Sinica, 8, 827–839.
Struyf A., Rousseeuw P.J. (2000). High-dimensional computation of the deepest location. Computational Statistics & Data Analysis, 34, 415–436.
# Compute a location estimate of a two-dimensional dataset. data(cardata90) Result <- hdepthmedian(x = cardata90) plot(cardata90, pch = 16) points(Result$median, col = "red", pch = 18, cex = 1.5)
# Compute a location estimate of a two-dimensional dataset. data(cardata90) Result <- hdepthmedian(x = cardata90) plot(cardata90, pch = 16) points(Result$median, col = "red", pch = 18, cex = 1.5)
Computes the medcouple, a robust measure of skewness for univariate data. For multivariate data the medcouple is computed on each column of the data matrix.
medcouple(x, do.reflect = NULL)
medcouple(x, do.reflect = NULL)
x |
An |
do.reflect |
Logical indicating whether the medcouple should also be computed on the reflected sample |
The medcouple is a robust measure of skewness yielding values between and
. For left- and right-skewed data the medcouple is negative and positive respectively.
The medcouple is defined as the median of the kernel function
evaluated over all couples
where
is smaller than the median of
x
and larger than the median of
x
. When there are multiple observations tied to the median, the kernel is defined separately as the denominator is not defined for these observations. Let denote the indices of the observations which are tied to the median. Then
is defined to equal
if
,
when
and
if
. To compute the medcouple an algorithm with time complexity
is applied. For details, see https://en.wikipedia.org/wiki/Medcouple.
For numerical accuracy it is advised, for small data sets, to compute the medcouple on both x
and -x
. The final value of the medcouple may then be obtained as a linear combination of both calculations. This procedure is warranted by the properties of the medcouple. Indeed the medcouple of the distribution equals minus the medcouple of the reflected distribution
. Moreover the medcouple is location and scale invariant.
Note that missing values are not allowed.
mc |
A |
P. Segaert with original code from M. Maechler and G. Brys.
Brys G., Hubert M., Struyf A. (2004). A robust measure of skewness. Journal of Computational and Graphical Statistics, 13, 996–1017.
# Calculate the medcouple of a bivariate data set. # Note that the medcouple of each variable is returned. data(bloodfat) medcouple(bloodfat) # For smaller data sets it is advisable to compute # the medcouple on both the sample and the reflected sample. small.data <- bloodfat[1:25,] medcouple(small.data, do.reflect = FALSE) -medcouple(-small.data, do.reflect = FALSE) # Small difference are due to numerical instabilities. # Use the option do.reflect to increase expected accuracy. medcouple(small.data, do.reflect = TRUE)
# Calculate the medcouple of a bivariate data set. # Note that the medcouple of each variable is returned. data(bloodfat) medcouple(bloodfat) # For smaller data sets it is advisable to compute # the medcouple on both the sample and the reflected sample. small.data <- bloodfat[1:25,] medcouple(small.data, do.reflect = FALSE) -medcouple(-small.data, do.reflect = FALSE) # Small difference are due to numerical instabilities. # Use the option do.reflect to increase expected accuracy. medcouple(small.data, do.reflect = TRUE)
Computes the multivariate functional depth for multivariate functional data.
mfd(x, z = NULL, type = "hdepth", alpha = 0, time = NULL, diagnostic = FALSE, depthOptions = NULL)
mfd(x, z = NULL, type = "hdepth", alpha = 0, time = NULL, diagnostic = FALSE, depthOptions = NULL)
x |
A three-dimensional |
z |
An optional three-dimensional |
type |
The depth used in the computations.
One of the following options: |
alpha |
Specifies the weights at every cross-section.
When |
time |
If the measurements are not equidistant,
a sorted numeric vector containing a set of time points. |
diagnostic |
If set to |
depthOptions |
A list of options to pass to the function
computing the cross-sectional depths. |
The multivariate functional depth of a multivariate curve with respect to a
given set of multivariate curves is defined as the weighted average of its
multivariate depth at each time point (Claeskens et al., 2014).
The MFD can be computed in all dimensions using halfspace depth,
projection depth, skewness-adjusted projection depth and directional projection depth.
For
also simplicial depth is available.
When the data array z
is specified, the MFD depth and diagnostic
information for the data array x
is also returned whenever the underlying
depth routine allows it. For more information see the specific depth routines
listed in the section "See Also".
For the weight vector, three options are available: uniform weights,
user-defined weights or weights proportional to the volume of the
-depth contour at each time point.
The
-depth contours are computed using the
depthContour
function.
In some situations, additional diagnostics are available to flag outlying time
points, as described in Hubert et al. (2012). At each time point, observations
from the data array x
are marked if they are flagged as outliers.
When using any of the projection depth measures, this flag is automatically
returned by the corresponding functions. When using halfspace depth,
the diagnostic is only available for bivariate curves. The observations
from the data array x
are marked if they are flagged as outliers by the
bagplot, or similarly if their bagdistance
is larger than 3 at that time
point. This can be seen as a measure of local outlyingness. The option is not
available for simplicial depth.
A heatmap of the cross-sectional depth values can be drawn by setting diagnostic
to TRUE and passing the results to fHeatmap
.
It is possible that at certain time points a part of the algorithm can not be executed due to e.g. exact fits. In that case the weight of that particular time point is set to zero. A warning is issued at the end of the algorithm to signal these time points. Furthermore the output contains an extra argument giving the indices of the time points where problems occured.
A list with the following components:
MFDdepthX |
Vector of length |
MFDdepthZ |
Vector of length |
weights |
Vector of weights according to the input parameter
|
crossDepthsX |
An |
crossDepthsZ |
An |
locOutlX |
An |
locOutlZ |
An |
IndFlagExactFit |
Vector containing the indices of the time points for which an exact fit is detected. |
IndFlagBag |
Vector containing the indices of the time points for which the bagplot could not be computed. |
IndFlagIso |
Vector containing the indices of the time
points for which the cross-sectional |
IndFlagAlpha |
Vector containing the indices of the
time points for which the volume of the
cross-sectional |
P. Segaert and M. Hubert
Claeskens G., Hubert M., Slaets L., Vakili K. (2014). Multivariate functional halfspace depth. Journal of the American Statistical Association, 109, 411–423.
Hubert M., Claeskens G., De Ketelaere B., Vakili K. (2012). A new depth-based approach for detecting outlying curves. In Proceedings of COMPSTAT 2012, edited by A. Colubi, K. Fokianos, G. Gonzalez-Rodriguez, E.J. Kontoghiorghes, 329–340.
depthContour
, hdepth
, projdepth
,
sprojdepth
, dprojdepth
, sdepth
,
fHeatmap
data(octane) Result <- mfd(x = octane, alpha = 0.125, diagnostic = TRUE) Plot <- fHeatmap(rowValues = Result$MFDdepthZ, cellValues = Result$crossdepthZ, type = "depth", legend.title = "HD") Plot
data(octane) Result <- mfd(x = octane, alpha = 0.125, diagnostic = TRUE) Plot <- fHeatmap(rowValues = Result$MFDdepthZ, cellValues = Result$crossdepthZ, type = "depth", legend.title = "HD") Plot
Computes the multivariate functional median, an estimate for the central tendency of multivariate functional data.
mfdmedian(x, type = "hdepth", crossDepthsX = NULL, depthOptions = NULL, centerOption = "maxdepth")
mfdmedian(x, type = "hdepth", crossDepthsX = NULL, depthOptions = NULL, centerOption = "maxdepth")
x |
A three-dimensional |
type |
The depth used in the computations.
One of the following options: |
crossDepthsX |
Depth values at each time point. Can be used to save computing time. |
depthOptions |
A list of options to pass to the function that computes the cross-sectional depths. |
centerOption |
When equal to |
The multivariate functional median of a multivariate functional data set is defined
as the multivariate curve connecting the cross-sectional multivariate depth
medians (Claeskens et al., 2014).
The MFD median can be computed in all dimensions using halfspace depth,
projection depth, skewness-adjusted projection depth or directional projection depth.
The simplicial depth can only be used for
.
It is possible that at certain time points a part of the algorithm can not be executed due to e.g. exact fits. In that case the estimate for the center will be set to NaN. A warning is issued at the end of the algorithm to signal these time points. Furthermore the output contains an extra argument giving the indices of the time points where problems occured.
A list with the following component:
MFDmedian |
An |
IndFlagExactFit |
Vector containing the indices of the time points for which an exact fit is detected. |
P. Segaert, M. Hubert
Claeskens G., Hubert M., Slaets L., Vakili K. (2014). Multivariate functional halfspace depth. Journal of the American Statistical Association, 109, 411–423.
hdepth
, projdepth
,
sprojdepth
, dprojdepth
, sdepth
,
mfd
# Consider a bivariate functional sample. data(characterA) Data <- characterA[, 1:50, ] Result <- mfdmedian(Data) # Plot the data and the functional median par(mfrow = c(1,2)) matplot(Data[, , 1], type = "l", col = "black", lty = 1, ylab = "x-coordinate") matlines(Result$MFDmedian[, 1], type = "l", col = "red", lwd = 2) matplot(Data[, , 2], type = "l", col = "black", lty = 1, ylab = "y-coordinate") matlines(Result$MFDmedian[, 2], type = "l", col = "red", lwd = 2) par(mfrow = c(1,1)) # Other depth functions such as the adjusted outlyingness may also # be used to determine the cross-sectional depth median. # In this case the depth median is computed by the # sprojmedian routine. # Optional arguments used by the sprojmedian routine may be specified # using the depthOptions. For example one might choose the # "Rotation" option with 300 directions. Result <- mfdmedian(Data, type = "sprojdepth", depthOptions = list(type = "Rotation", ndir = 300)) par(mfrow = c(1,2)) matplot(Data[, , 1], type = "l", col = "black", lty = 1, ylab = "x-coordinate") matlines(Result$MFDmedian[, 1], type = "l", col = "red", lwd = 2) matplot(Data[, , 2], type = "l", col = "black", lty = 1, ylab = "y-coordinate") matlines(Result$MFDmedian[, 2], type = "l", col = "red", lwd = 2) par(mfrow = c(1,1)) # If the user already placed a call to the mfd routine # with the diagnostic options set to TRUE, the # mfdmedian can easily be obtained by passing the cross-sectional # depths. This considerably saves computing time. tResult <- mfd(x = Data, type = "sprojdepth", diagnostic = TRUE) Result <- mfdmedian(Data, type = "sprojdepth", crossDepthsX = tResult$crossdepthX, ) # Univariate curves should also be represented as arrays Data.x <- Data[, , 1, drop = FALSE] Result <- mfdmedian(Data.x) matplot(Data.x[ , , 1], type = "l", col = "black", lty = 1, ylab = "x-coordinate") matlines(Result$MFDmedian[, 1], type = "l", col = "red", lwd = 2)
# Consider a bivariate functional sample. data(characterA) Data <- characterA[, 1:50, ] Result <- mfdmedian(Data) # Plot the data and the functional median par(mfrow = c(1,2)) matplot(Data[, , 1], type = "l", col = "black", lty = 1, ylab = "x-coordinate") matlines(Result$MFDmedian[, 1], type = "l", col = "red", lwd = 2) matplot(Data[, , 2], type = "l", col = "black", lty = 1, ylab = "y-coordinate") matlines(Result$MFDmedian[, 2], type = "l", col = "red", lwd = 2) par(mfrow = c(1,1)) # Other depth functions such as the adjusted outlyingness may also # be used to determine the cross-sectional depth median. # In this case the depth median is computed by the # sprojmedian routine. # Optional arguments used by the sprojmedian routine may be specified # using the depthOptions. For example one might choose the # "Rotation" option with 300 directions. Result <- mfdmedian(Data, type = "sprojdepth", depthOptions = list(type = "Rotation", ndir = 300)) par(mfrow = c(1,2)) matplot(Data[, , 1], type = "l", col = "black", lty = 1, ylab = "x-coordinate") matlines(Result$MFDmedian[, 1], type = "l", col = "red", lwd = 2) matplot(Data[, , 2], type = "l", col = "black", lty = 1, ylab = "y-coordinate") matlines(Result$MFDmedian[, 2], type = "l", col = "red", lwd = 2) par(mfrow = c(1,1)) # If the user already placed a call to the mfd routine # with the diagnostic options set to TRUE, the # mfdmedian can easily be obtained by passing the cross-sectional # depths. This considerably saves computing time. tResult <- mfd(x = Data, type = "sprojdepth", diagnostic = TRUE) Result <- mfdmedian(Data, type = "sprojdepth", crossDepthsX = tResult$crossdepthX, ) # Univariate curves should also be represented as arrays Data.x <- Data[, , 1, drop = FALSE] Result <- mfdmedian(Data.x) matplot(Data.x[ , , 1], type = "l", col = "black", lty = 1, ylab = "x-coordinate") matlines(Result$MFDmedian[, 1], type = "l", col = "red", lwd = 2)
Makes a scatterplot of bivariate data and colors the observations according to their depth value.
mrainbowplot(x, depths, col = NULL, plot.options = list())
mrainbowplot(x, depths, col = NULL, plot.options = list())
x |
An |
depths |
A column vector of length |
col |
An |
plot.options |
A list of available options:
|
The plot is made using ggplot2. The plot itself is returned by the function and is fully customisable using standard ggplot2 commands. Similar plots for multivariate data with can be made using the ggpairs function in the library
GGally
.
P. Segaert
data(cardata90) Result <- projdepth(x = cardata90) plot.options <- list(legend.title = "PD") plot <- mrainbowplot(cardata90, depths = Result$depthZ, plot.options = plot.options) plot + ggtitle("Rainbowplot based on projection depth") # The default color range may be adjusted using the col argument. RGBmatrix <- c(1, 0, 0, #Red 1, 1, 1, #White 0, 1, 0) #Green RGBmatrix <- matrix(RGBmatrix, ncol = 3, byrow = TRUE) plot <- mrainbowplot(cardata90, depths = Result$depthZ, col = RGBmatrix, plot.options = plot.options) plot + ggtitle("Rainbowplot based on projection depth")
data(cardata90) Result <- projdepth(x = cardata90) plot.options <- list(legend.title = "PD") plot <- mrainbowplot(cardata90, depths = Result$depthZ, plot.options = plot.options) plot + ggtitle("Rainbowplot based on projection depth") # The default color range may be adjusted using the col argument. RGBmatrix <- c(1, 0, 0, #Red 1, 1, 1, #White 0, 1, 0) #Green RGBmatrix <- matrix(RGBmatrix, ncol = 3, byrow = TRUE) plot <- mrainbowplot(cardata90, depths = Result$depthZ, col = RGBmatrix, plot.options = plot.options) plot + ggtitle("Rainbowplot based on projection depth")
Felipe et al. (2005) obtained intensities of MRI images of 9 different parts of the human body (plus a group consisting of all remaining body regions, which was of course very heterogeneous). They then transformed their data to univariate curves.
data(mri)
data(mri)
A list of arrays corresponding to each bodypart. For each bodypart, a three-dimensional by
by
array is available. The index
corresponds to the different points of measurement, the index
to the different observations.
When using this data set please cite both Felipe et al. (2005) and Hubert et al. (2017).
Felipe J.C., Traina A.J.M., Traina C. (2005). Global warp metric distance: boosting content-based image retrieval through histograms. Proceedings of the Seventh IEEE International Symposium on Multimedia (ISM05), p.8.
Chen Y., Keogh E., Hu B., Begum N., Bagnall A., Mueen A., Batista G.J. (2015). The UCR Time Series Classification Archive. [http://www.cs.ucr.edu/~eamonn/time_series_data]
Hubert M., Rousseeuw P.J., Segaert P. (2017). Multivariate and functional classification using depth and distance. Advances in Data Analysis and Classification, 11, 445–466.
data(mri) par(mfrow = c(2,1)) matplot(y = mri$bodypart1[,,1], type = "l", col = "black", lty = 1, xlab = "", ylab="", main = "bodypart 1") matplot(y = mri$bodypart2[,,1], type = "l", col = "black", lty = 1, xlab = "", ylab="", main = "bodypart 2") par(mfrow = c(1,1))
data(mri) par(mfrow = c(2,1)) matplot(y = mri$bodypart1[,,1], type = "l", col = "black", lty = 1, xlab = "", ylab="", main = "bodypart 1") matplot(y = mri$bodypart2[,,1], type = "l", col = "black", lty = 1, xlab = "", ylab="", main = "bodypart 2") par(mfrow = c(1,1))
Near infrared absorbance spectra (NIR) of 39 gasoline samples over 226 wavelengths ranging from 1102nm to 1552nm with measurements every two nanometers. Six of the samples (25, 26, 36-39) contain added alcohol.
data(octane)
data(octane)
A three-dimensional by
by
array.
Esbensen K., Schonkopf S., Midtgaard T. (1994). Multivariate Analysis in Practice. Trondheim: Camo.
Hubert M., Rousseeuw P.J., Vanden Branden K. (2005). ROBPCA: a new approach to robust principal component analysis. Technometrics, 47, 64–79.
data(octane) matplot(octane[,1:30,1], type = "l", lty = 1, col = "black")
data(octane) matplot(octane[,1:30,1], type = "l", lty = 1, col = "black")
Computes the Stahel-Donoho outlyingness (SDO) of -dimensional points
z
relative to a -dimensional dataset
x
. For each multivariate point , its outlyingness relative to
x
is defined as its maximal univariate Stahel-Donoho outlyingness measured over all directions. To obtain the univariate Stahel-Donoho outlyingness in the direction , the dataset
x
is projected on , and the robustly standardized distance of
to the robust center of the projected data points
x
is computed.
outlyingness(x, z = NULL, options = list())
outlyingness(x, z = NULL, options = list())
x |
An |
z |
An optional |
options |
A list of available options:
|
The Stahel-Donoho outlyingness has been introduced by Stahel (1981) and Donoho (1982). It is mostly suited to measure the degree of outlyingness of multivariate points with respect to a data cloud from an elliptical distribution.
Depending on the dimension , different approximate algorithms are implemented. The affine invariant algorithm can only be used when
. It draws
ndir
times at random observations from
x
and considers the direction orthogonal to the hyperplane spanned by these observations. At most
out of
directions can be considered. The orthogonal invariant version can be applied to high-dimensional data. It draws
ndir
times at random 2 observations from x
and considers the direction through these observations. Here, at most 2 out of directions can be considered. Finally, the shift invariant version randomly draws
ndir
vectors from the unit sphere.
The resulting Stahel-Donoho outlyingness values are invariant to affine transformations, rotations and shifts respectively provided that the seed
is kept fixed at different runs of the algorithm. Note that the SDO values are guaranteed to increase when more directions are considered provided the seed is kept fixed, as this ensures that the random directions are generated in a fixed order.
An observation from x
and z
is flagged as an outlier if its SDO exceeds a cutoff value. This cutoff value is determined using the procedure in Rousseeuw et al. (2018). First, the logarithm of the SDO values is taken to render their distribution more symmetric, after which a normal approximation yields a cutoff on these values. The cutoff is then transformed back by applying the exponential function.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it. Moreover, from the definition of the Stahel-Donoho outlyingness it follows that the outlyingness is ill-defined when the robust scale of the data projected on the direction
equals zero. In this case the algorithm will stop and give a warning. The returned values then include the direction
as well as an indicator specifying which of the observations of
x
belong to the hyperplane orthogonal to .
A list with components:
outlyingnessX |
Vector of length |
outlyingnessZ |
Vector of length |
cutoff |
Points whose outlyingness exceeds this cutoff can be considered as outliers with respect to |
flagX |
Observations of |
flagZ |
Points of |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
inSubspace |
When a direction |
P. Segaert using C++
code by K. Vakili and P. Segaert.
Stahel W.A. (1981). Robuste Schatzungen: infinitesimale Optimalitat und Schatzungen von Kovarianzmatrizen. PhD Thesis, ETH Zurich.
Donoho D.L. (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying paper, Dept. Statistics, Harvard University, Boston.
Maronna R.A., Yohai V. (1995). The behavior of the Stahel-Donoho robust multivariate estimator. Journal of the American Statistical Association, 90, 330–341.
Rousseeuw P.J., Raymaekers J., Hubert M., (2018). A measure of directional outlyingness with applications to image data and video. Journal of Computational and Graphical Statistics, 27, 345–359.
projdepth
, projmedian
, adjOutl
, dirOutl
# Compute the outlyingness of a simple two-dimensional dataset. # Outliers are plotted in red. if (requireNamespace("robustbase", quietly = TRUE)) { BivData <- log(robustbase::Animals2) } else { BivData <- matrix(rnorm(120), ncol = 2) BivData <- rbind(BivData, matrix(c(6,6,6,-2), ncol = 2)) } Result <- outlyingness(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # The number of directions may be specified through # the option list. The resulting outlyingness is # monotone increasing in the number of directions. Result1 <- outlyingness(x = BivData, options = list(ndir = 50) ) Result2 <- outlyingness(x = BivData, options = list(ndir = 100) ) which(Result2$outlyingnessX - Result1$outlyingnessX < 0) # This is however not the case when the seed is changed Result1 <- outlyingness(x = BivData, options = list(ndir = 50) ) Result2 <- outlyingness(x = BivData, options = list(ndir = 100, seed = 950) ) plot(Result2$outlyingnessX - Result1$outlyingnessX, xlab = "Index", ylab = "Difference in outlyingness") # We can also consider directions through two data # points. If the sample is small enough one may opt # to search over all choose(n,2) directions. # Note that the computational load increases considerably # as n becomes larger. Result <- outlyingness(x = BivData, options = list(type = "Rotation", ndir = "all") ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # Alternatively one may consider randomly generated directions. Result <- outlyingness(x = BivData, options = list(type = "Shift", ndir = 1000) ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # The default option of using the MAD for the scale may be # changed to using the univariate mcd. Result <- outlyingness(x = BivData, options = list(type = "Affine", ndir = 1000, stand = "unimcd", h = 0.75*nrow(BivData)) ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red")
# Compute the outlyingness of a simple two-dimensional dataset. # Outliers are plotted in red. if (requireNamespace("robustbase", quietly = TRUE)) { BivData <- log(robustbase::Animals2) } else { BivData <- matrix(rnorm(120), ncol = 2) BivData <- rbind(BivData, matrix(c(6,6,6,-2), ncol = 2)) } Result <- outlyingness(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # The number of directions may be specified through # the option list. The resulting outlyingness is # monotone increasing in the number of directions. Result1 <- outlyingness(x = BivData, options = list(ndir = 50) ) Result2 <- outlyingness(x = BivData, options = list(ndir = 100) ) which(Result2$outlyingnessX - Result1$outlyingnessX < 0) # This is however not the case when the seed is changed Result1 <- outlyingness(x = BivData, options = list(ndir = 50) ) Result2 <- outlyingness(x = BivData, options = list(ndir = 100, seed = 950) ) plot(Result2$outlyingnessX - Result1$outlyingnessX, xlab = "Index", ylab = "Difference in outlyingness") # We can also consider directions through two data # points. If the sample is small enough one may opt # to search over all choose(n,2) directions. # Note that the computational load increases considerably # as n becomes larger. Result <- outlyingness(x = BivData, options = list(type = "Rotation", ndir = "all") ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # Alternatively one may consider randomly generated directions. Result <- outlyingness(x = BivData, options = list(type = "Shift", ndir = 1000) ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # The default option of using the MAD for the scale may be # changed to using the univariate mcd. Result <- outlyingness(x = BivData, options = list(type = "Affine", ndir = 1000, stand = "unimcd", h = 0.75*nrow(BivData)) ) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red")
The fighter plane dataset of Thakoor and Gao (2005) describes 7 shapes: of the Mirage, Eurofighter, F-14 with wings closed, F-14 with wings opened, Harrier, F-22 and F-15. Each class contains 30 shape samples obtained from digital pictures, which Thakoor and Gao (2005) then reduced to univariate functions.
data(plane)
data(plane)
A list of arrays corresponding to each plane. For each plane, a three-dimensional by
by
array is available. The index
corresponds to the different measurement points, the index
to the different observations.
When using this data set please cite both Thakoor et al. (2005) and Hubert et al. (2017).
Thakoor N., Gao J. (2005). Shape classifier based on generalized probabilistic descent method with hidden Markov descriptor. Tenth IEEE International Conference on Computer Vision (ICCV 2005), Vol. 1: 495–502.
Chen Y., Keogh E., Hu B., Begum N., Bagnall A., Mueen A., Batista G.J. (2015). The UCR time series classification archive. [http://www.cs.ucr.edu/~eamonn/time_series_data]
Hubert M., Rousseeuw P.J., Segaert P. (2017). Multivariate and functional classification using depth and distance. Advances in Data Analysis and Classification, 11, 445–466.
data(plane) par(mfrow = c(2,1)) matplot(y = plane$plane1[,,1], type ="l", col = "black", lty = 1, xlab = "", ylab = "", main = "plane 1") matplot(y = plane$plane2[,,1], type = "l", col = "black", lty = 1, xlab = "", ylab = "", main = "plane 2") par(mfrow = c(1,1))
data(plane) par(mfrow = c(2,1)) matplot(y = plane$plane1[,,1], type ="l", col = "black", lty = 1, xlab = "", ylab = "", main = "plane 1") matplot(y = plane$plane2[,,1], type = "l", col = "black", lty = 1, xlab = "", ylab = "", main = "plane 2") par(mfrow = c(1,1))
Draws the depth contours of bivariate data computed with depthContour
.
plotContours(x, depthContour, data = TRUE)
plotContours(x, depthContour, data = TRUE)
x |
An |
depthContour |
The result of a call to |
data |
Logical value indicating whether the data |
The plot is made using ggplot2
. The plot itself is returned by the function and is fully customisable using standard ggplot2
commands.
P. Segaert
Ruts I., Rousseeuw P.J. (1996). Computing depth contours of bivariate point clouds. Computational Statistics & Data Analysis, 23, 153-168.
# Compute and plot some halfspace depth contours of a two-dimensional dataset. # The returned object is a ggplot2 object that may be edited # using standard ggplot2 commands. data(cardata90) Result <- depthContour(x = cardata90, alpha = c(0.125, 0.25)) plot <- plotContours(x = cardata90, depthContour = Result) plot + ylab("Engine displacement") + xlab("Weight in pounds") # One may consider different depth functions such as projection depth # by changing the input parameter 'type' in the depthcontour function. Result <- depthContour(x = cardata90, alpha = c(0.35, 0.55), type = "projdepth") plotContours(x = cardata90, depthContour = Result)
# Compute and plot some halfspace depth contours of a two-dimensional dataset. # The returned object is a ggplot2 object that may be edited # using standard ggplot2 commands. data(cardata90) Result <- depthContour(x = cardata90, alpha = c(0.125, 0.25)) plot <- plotContours(x = cardata90, depthContour = Result) plot + ylab("Engine displacement") + xlab("Weight in pounds") # One may consider different depth functions such as projection depth # by changing the input parameter 'type' in the depthcontour function. Result <- depthContour(x = cardata90, alpha = c(0.35, 0.55), type = "projdepth") plotContours(x = cardata90, depthContour = Result)
Computes the projection depth of -dimensional points
z
relative to a -dimensional dataset
x
.
projdepth(x, z = NULL, options = list())
projdepth(x, z = NULL, options = list())
x |
An |
z |
An optional |
options |
A list of options to pass to the underlying |
Projection depth is based on the Stahel-Donoho outlyingness (SDO) and is computed as . It is mostly suited to measure the degree of outlyingness of multivariate points with respect to a data cloud from an elliptical distribution.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.
See outlyingness
for more details on the computation of the SDO. To visualize the depth of bivariate data one can apply the mrainbowplot
function. It plots the data colored according to their depth.
The output values of this function are based on the output of the outlyingness
function. More details can be found there.
A list with components:
depthX |
Vector of length |
depthZ |
Vector of length |
cutoff |
Points whose projection depth is smaller than this cutoff can be considered as outliers. Equivalently the outliers are those points whose Stahel-Donoho outlyingness exceed the corresponding cutoff. |
flagX |
Observations of |
flagZ |
Points of |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
inSubspace |
When a direction |
P. Segaert
Zuo Y. (2003). Projection-based depth functions and associated medians. The Annals of Statistics, 31, 1460–1490.
outlyingness
, projmedian
, mrainbowplot
, adjOutl
, dirOutl
# Compute the projection depth of a simple two-dimensional dataset. # Outliers are plotted in red. if (requireNamespace("robustbase", quietly = TRUE)) { BivData <- log(robustbase::Animals2) } else { BivData <- matrix(rnorm(120), ncol = 2) BivData <- rbind(BivData, matrix(c(6,6, 6, -2), ncol = 2)) } Result <- projdepth(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # A multivariate rainbowplot may be obtained using mrainbowplot. plot.options = list(legend.title = "PD") mrainbowplot(x = BivData, depths = Result$depthX, plot.options = plot.options) # Options for the underlying outlyingness routine may be passed # using the options argument. Result <- projdepth(x = BivData, options = list(type = "Affine", ndir = 1000, stand = "MedMad", h = nrow(BivData) ) )
# Compute the projection depth of a simple two-dimensional dataset. # Outliers are plotted in red. if (requireNamespace("robustbase", quietly = TRUE)) { BivData <- log(robustbase::Animals2) } else { BivData <- matrix(rnorm(120), ncol = 2) BivData <- rbind(BivData, matrix(c(6,6, 6, -2), ncol = 2)) } Result <- projdepth(x = BivData) IndOutliers <- which(!Result$flagX) plot(BivData) points(BivData[IndOutliers,], col = "red") # A multivariate rainbowplot may be obtained using mrainbowplot. plot.options = list(legend.title = "PD") mrainbowplot(x = BivData, depths = Result$depthX, plot.options = plot.options) # Options for the underlying outlyingness routine may be passed # using the options argument. Result <- projdepth(x = BivData, options = list(type = "Affine", ndir = 1000, stand = "MedMad", h = nrow(BivData) ) )
Computes a projection depth based location estimate of a -dimensional
dataset
x
.
projmedian(x, projection.depths = NULL, options = NULL)
projmedian(x, projection.depths = NULL, options = NULL)
x |
An |
projection.depths |
Vector containing the projection depth of the observations in |
options |
A list of options to pass to the |
The algorithm depends on the function projdepth
to compute the
projection depth of the observations in x
. If these depth values have already been computed, they can be passed as an optional
argument to save computing time. If not, the projection depth values will be
computed and the user can pass a list with options to the
projdepth
function.
It is first checked whether the data lie in a subspace of dimension smaller
than . If so, a warning is given, as well as the dimension of the
subspace and a direction which is orthogonal to it.
A list with components:
max |
The point of |
Huber |
A weighted center of gravity of all observations.
The weights are defined by the Huber
function with parameter
|
P. Segaert
Maronna, R.A., Yohai, V.J. (1995). The behavior of the Stahel-Donoho robust multivariate estimator. Journal of the American Statistical Association, 90, 330–341.
outlyingness
, projdepth
, adjOutl
, dirOutl
# Compute a location estimate of a two-dimensional dataset. if (requireNamespace("robustbase", quietly = TRUE)) { BivData <- log(robustbase::Animals2) } else { BivData <- matrix(rnorm(120), ncol = 2) BivData <- rbind(BivData, matrix(c(6, 6, 6, -2), ncol = 2)) } result <- projmedian(x = BivData) plot(BivData, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) points(result$Huber, col = "blue", pch = 3, cex = 1.5) # Options for the underlying projdepth routine may be passed # using the options argument. result <- projmedian(x = BivData, options = list(type = "Rotation", ndir = 100, stand = "unimcd", h = 0.75*nrow(BivData))) plot(BivData, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) points(result$Huber, col = "blue", pch = 3, cex = 1.5) # One may also compute the depth values of the observations in the data # separately. This avoids having to recompute them when computing the median. depth.result <- projdepth(x = BivData) result <- projmedian(x = BivData, projection.depths = depth.result$depthX)
# Compute a location estimate of a two-dimensional dataset. if (requireNamespace("robustbase", quietly = TRUE)) { BivData <- log(robustbase::Animals2) } else { BivData <- matrix(rnorm(120), ncol = 2) BivData <- rbind(BivData, matrix(c(6, 6, 6, -2), ncol = 2)) } result <- projmedian(x = BivData) plot(BivData, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) points(result$Huber, col = "blue", pch = 3, cex = 1.5) # Options for the underlying projdepth routine may be passed # using the options argument. result <- projmedian(x = BivData, options = list(type = "Rotation", ndir = 100, stand = "unimcd", h = 0.75*nrow(BivData))) plot(BivData, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) points(result$Huber, col = "blue", pch = 3, cex = 1.5) # One may also compute the depth values of the observations in the data # separately. This avoids having to recompute them when computing the median. depth.result <- projdepth(x = BivData) result <- projmedian(x = BivData, projection.depths = depth.result$depthX)
Computes the regression depth of several hyperplanes with
respect to a multiple regression dataset with explanatory variables.
The computation is exact for
.
An approximate algorithm is used for
.
rdepth(x, z = NULL, ndir = NULL)
rdepth(x, z = NULL, ndir = NULL)
x |
An |
z |
An |
ndir |
Controls the number of directions when the approximate
algorithm is used. Defaults to |
Regression depth has been introduced in Rousseeuw and Hubert (1999). To compute the regression depth of a hyperplane, different algorithms are used. When it can be computed exactly. In higher dimensions an approximate algorithm is used (Rousseeuw and Struyf 1998).
It is first checked whether the data x
lie in a subspace of
dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.
A list with components:
depthZ |
A vector containing the regression depth of the hyperplanes in |
singularSubsets |
A vector of length |
dimension |
When the data |
hyperplane |
When the data |
P. Segaert using Fortran code by P.J. Rousseeuw, I. Ruts and A. Struyf
Rousseeuw P.J., Hubert M. (1999). Regression depth. Journal of the American Statistical Association, 94, 388–402.
Rousseeuw P.J., Struyf A. (1998). Computing location depth and regression depth in higher dimensions. Statistics and Computing, 45, 193–203.
# Illustrate the concept of regression depth in the case of simple # linear regression. data(stars) # Compute the least squares fit. Due to outliers # this fit will be bad and thus should result in a small # regression depth. temp <- lsfit(x = stars[,1], y = stars[,2])$coefficients intercept <- temp[1] slope <- temp[2] plot(stars, ylim = c(4,7)) abline(a = intercept, b = slope) abline(a = -9.2, b = 3.2, col = "red") # Let us compare the regression depth of the least squares fit # with the depth of the better fitting red regression line. z <- rbind(cbind(intercept, slope), cbind(-9.2, 3.2)) result <- rdepth(x = stars, z = z) result$depthZ text(x = 3.8, y = 5.3, labels = round(result$depthZ[1], digits =2)) text(x = 4.45, y = 4.8, labels = round(result$depthZ[2], digits =2), col = "red") # Compute the depth of some other regression lines to illustrate the concept. # Note that the stars data set has 47 observations and 1/47 = 0.0212766. z <- rbind(cbind(6.2, 0), cbind(6.5, 0)) result <- rdepth(x = stars, z = z) abline(a = 6.2, b = 0, col = "blue") abline(a = 6.5, b = 0, col = "darkgreen") text(x = 3.8, y = 6.25, labels = round(result$depthZ[1], digits = 2), col = "blue") text(x = 4, y = 6.55, labels = round(result$depthZ[2], digits = 2), col = "darkgreen") # One point needs to removed to make the blue line a nonfit. The green line is # clearly a nonfit.
# Illustrate the concept of regression depth in the case of simple # linear regression. data(stars) # Compute the least squares fit. Due to outliers # this fit will be bad and thus should result in a small # regression depth. temp <- lsfit(x = stars[,1], y = stars[,2])$coefficients intercept <- temp[1] slope <- temp[2] plot(stars, ylim = c(4,7)) abline(a = intercept, b = slope) abline(a = -9.2, b = 3.2, col = "red") # Let us compare the regression depth of the least squares fit # with the depth of the better fitting red regression line. z <- rbind(cbind(intercept, slope), cbind(-9.2, 3.2)) result <- rdepth(x = stars, z = z) result$depthZ text(x = 3.8, y = 5.3, labels = round(result$depthZ[1], digits =2)) text(x = 4.45, y = 4.8, labels = round(result$depthZ[2], digits =2), col = "red") # Compute the depth of some other regression lines to illustrate the concept. # Note that the stars data set has 47 observations and 1/47 = 0.0212766. z <- rbind(cbind(6.2, 0), cbind(6.5, 0)) result <- rdepth(x = stars, z = z) abline(a = 6.2, b = 0, col = "blue") abline(a = 6.5, b = 0, col = "darkgreen") text(x = 3.8, y = 6.25, labels = round(result$depthZ[1], digits = 2), col = "blue") text(x = 4, y = 6.55, labels = round(result$depthZ[2], digits = 2), col = "darkgreen") # One point needs to removed to make the blue line a nonfit. The green line is # clearly a nonfit.
Computes the deepest regression, i.e. the hyperplane with maximal
regression depth given a regression dataset with explanatory variables. The computation is exact for simple regression, and approximate otherwise.
rdepthmedian(x, maxit = NULL)
rdepthmedian(x, maxit = NULL)
x |
An |
maxit |
The maximum number of iterations. |
In simple regression the deepest regression fit can be computed exactly by considering all lines through two data points and taking the one with maximal regression depth. If several lines have the same maximal regression depth, their average is taken.
When , the approximate MEDSWEEP algorithm is applied (Van Aelst et al, 2002).
It is first checked whether the data lie in a subspace of
dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.
A list with components:
deepest |
A |
depth |
The depth of the deepest regression hyperplane. |
niter |
The number of performed iterations used in the medsweep algorithm. |
dimension |
When the data |
hyperplane |
When the data |
P. Segaert using Fortran code by S. Van Aelst
Van Aelst S., Rousseeuw P.J., Hubert M., Struyf A. (2002). The deepest regression method. Journal of Multivariate Analysis, 81, 138–166.
# Illustrate the concept of deepest regression line in the case of simple # linear regression. data(stars) plot(stars, pch=16) result <- rdepthmedian(x = stars) abline(result$deepest, col="blue", lwd=2) x <- matrix(rnorm(3000), ncol = 3) + 10 rdepthmedian(x = x)
# Illustrate the concept of deepest regression line in the case of simple # linear regression. data(stars) plot(stars, pch=16) result <- rdepthmedian(x = stars) abline(result$deepest, col="blue", lwd=2) x <- matrix(rnorm(3000), ncol = 3) + 10 rdepthmedian(x = x)
Computes the simplicial depth of -dimensional points
z
relative to a -dimensional dataset
x
. Only dimension is supported.
sdepth(x, z = NULL)
sdepth(x, z = NULL)
x |
An |
z |
An optional |
The simplicial depth has been introduced by Liu (1990). The simplicial depth of a point is defined as the number of simplices with vertices in
x
that contain .
Exact computation of the simplicial depth for bivariate data is performed by means of the algorithm described in Rousseeuw and Ruts (1996).
To visualize the depth of bivariate data one can apply the
mrainbowplot
function. It plots the data with coloring according to their depth.
It is first checked whether the data lie in a subspace of dimension smaller than . If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.
A list with components:
depthZ |
Vector of length |
dimension |
When the data |
hyperplane |
When the data |
P. Segaert, based on Fortran code by P.J. Rousseeuw, I. Ruts and A. Struyf.
Liu R. (1990). On a notion of data depth based on random simplices. The Annals of Statistics, 18, 405–414.
Rousseeuw P.J., Ruts I. (1996). AS 307: Bivariate location depth. Applied Statistics, 45, 516–526.
data(bloodfat) Result <- sdepth(x = bloodfat) mrainbowplot(bloodfat, depth = Result$depthZ)
data(bloodfat) Result <- sdepth(x = bloodfat) mrainbowplot(bloodfat, depth = Result$depthZ)
Computes the skewness-adjusted projection depth of -dimensional
points
z
relative to a -dimensional dataset
x
.
sprojdepth(x, z = NULL, options = NULL)
sprojdepth(x, z = NULL, options = NULL)
x |
An |
z |
An optional |
options |
A list of options to pass to the underlying |
Skewness-adjusted projection depth is based on the adjusted
outlyingness and is computed as . As adjusted
outlyingness extends the Stahel-Donoho outlyingness towards
skewed distributions, the skewness-adjusted projection depth
is suited for both elliptical distributions and skewed
multivariate data.
It is first checked whether the data is found to lie in a subspace of
dimension lower than . If so, a warning is given, as well as the
dimension of the subspace and a direction which is orthogonal to it.
See adjOutl
for more details on the computation of the AO.
To visualize the depth of bivariate data one can apply the
mrainbowplot
function. It plots the data colored according to
their depth.
The output values of this function are based on the output of the
adjOutl
function. More details can be found there.
A list with components:
depthX |
Vector of length |
depthZ |
Vector of length |
cutoff |
Points whose skew-adjusted projection depth is smaller than this cutoff can be considered as outliers. |
flagX |
Observations of |
flagZ |
Points of |
singularSubsets |
When the input parameter type is equal to |
dimension |
When the data |
hyperplane |
When the data |
inSubspace |
When a direction |
P. Segaert with original code from M. Maechler, G. Brys, K. Vakili
Hubert M., Van der Veeken S. (2008). Outlier detection for skewed data. Journal of Chemometrics, 22, 235–246.
Hubert M, Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24, 177–202.
adjOutl
, sprojmedian
, mrainbowplot
, dirOutl
, outlyingness
# Compute the skewness-adjusted projection depth # of a two-dimensional dataset. data(bloodfat) Result <- sprojdepth(x = bloodfat) # A multivariate rainbowplot may be obtained using mrainbowplot. plot.options = list(legend.title = "SPD") mrainbowplot(x = bloodfat, depths = Result$depthX, plot.options = plot.options) # Options for the underlying outlyingness routine may be passed # using the options argument. Result <- sprojdepth(x = bloodfat, options = list(type = "Affine", ndir = 1000, seed = 12345 ) )
# Compute the skewness-adjusted projection depth # of a two-dimensional dataset. data(bloodfat) Result <- sprojdepth(x = bloodfat) # A multivariate rainbowplot may be obtained using mrainbowplot. plot.options = list(legend.title = "SPD") mrainbowplot(x = bloodfat, depths = Result$depthX, plot.options = plot.options) # Options for the underlying outlyingness routine may be passed # using the options argument. Result <- sprojdepth(x = bloodfat, options = list(type = "Affine", ndir = 1000, seed = 12345 ) )
Computes a skewness-adjusted projection depth based location estimate of a
-dimensional dataset
x
.
sprojmedian(x, sprojection.depths = NULL, options = NULL)
sprojmedian(x, sprojection.depths = NULL, options = NULL)
x |
An |
sprojection.depths |
Vector containing the skewness-adjusted projection
depth values of the observations in |
options |
A list of options to pass to the |
The algorithm depends on the function sprojdepth
to compute the
skewness-adjusted projection depth values of the observations in x
.
If these depth have already been
computed they can be passed as an optional argument to save computing time.
If not, the skewness-adjusted projection depth values will be computed and the user
can pass a list with options to the sprojdepth
function.
It is first checked whether the data lie in a subspace of dimension smaller
than . If so, a warning is given, as well as the dimension of the
subspace and a direction which is orthogonal to it.
A list with component:
max |
The point of |
P. Segaert
Hubert M., Van der Veeken S. (2008). Outlier detection for skewed data. Journal of Chemometrics, 22, 235–246.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24, 177–202.
adjOutl
, sprojdepth
, dirOutl
, outlyingness
# Compute a location estimate of a two-dimensional dataset. data(bloodfat) result <- sprojmedian(x = bloodfat) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # Options for the underlying sprojdepth routine may be passed # using the options argument. result <- sprojmedian(x = bloodfat, options = list(type = "Rotation", ndir = 1000 ) ) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # One may also compute the depth values of the observations in the data # separately. This avoids having to recompute them when computing the median. depth.result <- sprojdepth(x = bloodfat) result <- sprojmedian(x = bloodfat, sprojection.depths = depth.result$depthX)
# Compute a location estimate of a two-dimensional dataset. data(bloodfat) result <- sprojmedian(x = bloodfat) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # Options for the underlying sprojdepth routine may be passed # using the options argument. result <- sprojmedian(x = bloodfat, options = list(type = "Rotation", ndir = 1000 ) ) plot(bloodfat, pch = 16) points(result$max, col = "red", pch = 18, cex = 1.5) # One may also compute the depth values of the observations in the data # separately. This avoids having to recompute them when computing the median. depth.result <- sprojdepth(x = bloodfat) result <- sprojmedian(x = bloodfat, sprojection.depths = depth.result$depthX)
Data corresponding to a Hertzsprung-Russel diagram of the star cluster CYG OB1 containing 47 stars in the direction of Cygnus. A typical Hertzsprung-Russel diagram shows the logarithm of the temperature in reverse order from high to low.
The data has been taken from Humphreys (1978) by C. Doom who calibrated them according to Vansina and De Greve (1982).
data(stars)
data(stars)
A data frame containing the following variables:
LogTemp
Logarithm of the effective temperature at the star's surface.
LogLight
Logarithm of a star's light intensity.
Humphreys R.M. (1978). Studies of luminous stars in nearby galaxies. I. Supergiants and O stars in the milky way. Astrophysics Journal Supplementary Series, 38, 309–350.
Vansima F., De Greve J.P. (1982). Close binary systems before and after mass transfer. Astrophysics and Space Science, 87, 377–401.
Rousseeuw P.J., Leroy A. (1987). Robust Regression and Outlier Detection. New York: Wiley.
Hand D.J., Daly F., Lunn A., McConway A. (1994). A Handbook of Small Data Sets. Londen: Chapman and Hall, dataset 367.
data(stars) plot(stars, xlim = rev(range(stars[,1])))
data(stars) plot(stars, xlim = rev(range(stars[,1])))
A test based on halfspace depth for angular symmetry of the bivariate data set x
around the point z
. The test is only valid when x
contains no duplicates.
symtest(x, z, options=list())
symtest(x, z, options=list())
x |
An |
z |
A vector of length |
options |
A list of options to pass to |
The following hypothesis test is performed: : The data come from a continuous distribution
which is angularly symmetric about
z
.
The test statistic being used is the halfspace depth of z
with respect to x
. Under the null hypothesis the halfspace depth of z
equals . The distribution of the teststatistic under
is
. The
-value of the test is
with
.
pval |
The |
P. Segaert
Rousseeuw P.J, Struyf A. (2002). A depth test for symmetry. In: Goodness-of-Fit Tests and Model Validity, Birkhäuser Boston, pages 401–412.
# Perform the test on a simple data example. data(cardata90) deepest <- hdepthmedian(cardata90)$median symtest(x = cardata90[!duplicated(cardata90), ], z = deepest) plot(cardata90) points(deepest[1], deepest[2], pch=18, col="red", cex=1.2)
# Perform the test on a simple data example. data(cardata90) deepest <- hdepthmedian(cardata90)$median symtest(x = cardata90[!duplicated(cardata90), ], z = deepest) plot(cardata90) points(deepest[1], deepest[2], pch=18, col="red", cex=1.2)
The original data set consists of Near Infrared Spectroscopy (NIR) data of a batch of pills with different levels of the active compound. This data set considers 70 observations weighing 90 mg and 20 observations weighing 250 mg and also differ in the amount of active compound.
The data correspond to the form discussed in Hubert et al. (2015), see below.
data(tablets)
data(tablets)
A three dimensional by
by
array,
with
the number of observed time points,
the number of functional observations
and
the number of measurements
for every functional observation at every wavelength.
For each wavelength and each curve, a three-dimensional vector of observations is available. The second coordinate corresponds to the original data, the first coordinate corresponds to the mean of the corresponding curve. The last component corresponds to the derivative of the original curve data.
When using this data, please cite both of the references listed below.
Dyrby M., Engelsen S.B., Norgaard L., Bruhn M., Nielsen L. (2002). Chemometric quantitation of the active substance in a pharmaceutical tablet Using near infrared (NIR) transmittance and NIR FT-Raman spectra. Applied Spectroscopy, 56, 579–585.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods & Applications, 24, 177–202.
data(tablets) par(mfrow = c(3,1)) matplot(tablets[,,1], type = "l", lty = 1, col = "black") matplot(tablets[,,2], type = "l", lty = 1, col = "black") matplot(tablets[,,3], type = "l", lty = 1, col = "black") par(mfrow = c(1,1))
data(tablets) par(mfrow = c(3,1)) matplot(tablets[,,1], type = "l", lty = 1, col = "black") matplot(tablets[,,2], type = "l", lty = 1, col = "black") matplot(tablets[,,3], type = "l", lty = 1, col = "black") par(mfrow = c(1,1))
The original data set consists of Proton Nuclear Magnetic Resonance (NMR) spectra of 40 different wine samples in the spectral region from 6.00ppm to 0.50ppm. This data set corresponds to the region between wavelengths 5.62ppm and 5.37ppm only for which measurements are available for each curve. The data has been analyzed in Hubert et al. (2015), see below.
data(wine)
data(wine)
A three dimensional by
by
array,
with
the number of observed time points,
the number of functional observations
and
the number of measurements
for every functional observation at every wavelength.
When using this data set, please cite both of the references below.
Larsen F., van den Berg F., Engelsen S. (2006). An exploratory chemometric study of NMR spectra of table wines. Journal of Chemometrics, 20, 198–208.
Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection (with rejoinder). Statistical Methods & Applications, 24, 177–202.
data(wine) matplot(wine[,,1], type = "l", lty = 1, col = "black")
data(wine) matplot(wine[,,1], type = "l", lty = 1, col = "black")