Title: | Outlier Detection of Functional Data Based on the Minimum Regularized Covariance Trace Estimator |
---|---|
Description: | Detect outlying observations in functional data sets based on the minimum regularized covariance trace (MRCT) estimator. Includes implementation of Oguamalam et al. (2023) <arXiv:2307.13509>. |
Authors: | Jeremy Oguamalam [aut, cre], Una Radojičić [aut], Peter Filzmoser [aut] |
Maintainer: | Jeremy Oguamalam <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.0.1.0 |
Built: | 2024-11-25 06:31:56 UTC |
Source: | CRAN |
functionsCalculate all pairwise inner products between elements from supplied to this function. The integral is approximated by the Trapezoidal rule for uniform grids:
whereas is an uniform grid on
such that
and
the step size, i.e.
.
Therefore, it is assumed that the functions are evaluated at the same, equidistant grid.
innerProduct(grid, data)
innerProduct(grid, data)
grid |
A numeric vector of the uniform grid on which the functions are evaluated. |
data |
A numeric matrix. Each function has to be a vector stored in a column of |
Numeric symmetric matrix containing the approximated pairwise inner products between the functions supplied by data
. The entry of the result is the inner product
between the
-th and
-th column of
data
.
# Create orthogonal fourier basis via `fdapace` package library(fdapace) basis <- fdapace::CreateBasis(K = 10, type = "fourier") iP <- innerProduct(grid = seq(0, 1, length.out = 50), # default grid in CreateBasis() data = basis) round(iP,3) # Since the basis is orthogonal, the resulting matrix will be the identity matrix.
# Create orthogonal fourier basis via `fdapace` package library(fdapace) basis <- fdapace::CreateBasis(K = 10, type = "fourier") iP <- innerProduct(grid = seq(0, 1, length.out = 50), # default grid in CreateBasis() data = basis) round(iP,3) # Since the basis is orthogonal, the resulting matrix will be the identity matrix.
Functional outlier detection based on the minimum
regularized
covariance
trace
estimator (Oguamalam et al. 2023) as a robust covariance estimator.
This estimator uses a generalization of the Mahalanobis distance for the functional setting (Berrendero et al. 2020) and a corresponding theoretical cutoff value.
mrct( data, h = 0.75, alpha = 0.01, initializations = 5, subset.iteration = 10, seed = 123, scaling.iterations = 10, scaling.tolerance = 10^(-4), criterion = "sum", sum.percentage = 0.75 )
mrct( data, h = 0.75, alpha = 0.01, initializations = 5, subset.iteration = 10, seed = 123, scaling.iterations = 10, scaling.tolerance = 10^(-4), criterion = "sum", sum.percentage = 0.75 )
data |
Numeric matrix of a functional data set for which the esimator has to be calculated. Each row contains an observation. They are assumed to be observed on the same regular grid. |
h |
Numeric value between |
alpha |
Numeric (default is |
initializations |
Integer (default is |
subset.iteration |
Integer (default is |
seed |
Integer (default is |
scaling.iterations |
Integer (default is |
scaling.tolerance |
Numeric (default is |
criterion |
Character. Criterion based on which the optimal subset is chosen among the final subsets. Possible options are: " |
sum.percentage |
Numeric value between |
A list:
theoretical |
Integer vector of the indices corresponding to the outliers based on the MRCT estimator. |
theoretical.w |
Same as |
aMHD |
Numeric vector containing the functional Mahalanobis distances of all observations based on the MRCT estimator. |
aMHD.w |
Same as |
quant |
Numeric. Theoretical cutoff value for outlier detection. |
quant.w |
Same as |
k |
Numeric. Scalingparameter |
k.w |
Same as |
optimal.subset |
Integer vector of the optimal h-subset. |
subsets |
Numeric matrix containing all final subsets. Each row of |
objval |
Numeric vector with the objective values of the final subsets based on |
Berrendero JR, Bueno-Larraz B, Cuevas A (2020). “On Mahalanobis Distance in Functional Settings.” J. Mach. Learn. Res., 21(9), 1–33..
Oguamalam J, Radojičić U, Filzmoser P (2023). “Minimum regularized covariance trace estimator and outlier detection for functional data.” https://doi.org/10.48550/arXiv.2307.13509..
# Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50, size=10) # Generate 50 curves on the interval [0,1] at 50 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=50), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x=seq(0,1,length.out=50), y=t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run MRCT mrct.y <- mrct(data=y, h=0.75, alpha=0.1, initializations=10, criterion="sum") # Visualize alpha-Mahalanobis distance with cutoff (horizontal black line) # Colors correspond to simulated outliers, shapes to estimated (MRCT) ones # (circle regular and triangle irregular curves) shapemap <- rep(1,50); shapemap[mrct.y$theoretical.w] <- 2 plot(x=1:50, y=mrct.y$aMHD.w, col=colormap, pch=shapemap, xlab="Index", ylab=expression(alpha*"-MHD")) abline(h = mrct.y$quant.w) # If you dont have any information on possible outliers, # alternatively you could use the S3 method plot.mrct() mrct.plot(mrct.y)
# Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50, size=10) # Generate 50 curves on the interval [0,1] at 50 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=50), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x=seq(0,1,length.out=50), y=t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run MRCT mrct.y <- mrct(data=y, h=0.75, alpha=0.1, initializations=10, criterion="sum") # Visualize alpha-Mahalanobis distance with cutoff (horizontal black line) # Colors correspond to simulated outliers, shapes to estimated (MRCT) ones # (circle regular and triangle irregular curves) shapemap <- rep(1,50); shapemap[mrct.y$theoretical.w] <- 2 plot(x=1:50, y=mrct.y$aMHD.w, col=colormap, pch=shapemap, xlab="Index", ylab=expression(alpha*"-MHD")) abline(h = mrct.y$quant.w) # If you dont have any information on possible outliers, # alternatively you could use the S3 method plot.mrct() mrct.plot(mrct.y)
Calculates the approximation of the integrated square error between the estimated covariance based on non-outlying curves of a data set determined by the MRCT estimator and the true kernel for one of the three outlier settings in the simulation study of Oguamalam et al. 2023.
mrct.ise(data, outliers.est, model)
mrct.ise(data, outliers.est, model)
data |
Numeric matrix of a functional data set for which the esimator has to be calculated. Each row contains an observation. They are assumed to be observed on the same regular grid. |
outliers.est |
Integer vector containing the indices of outliers. |
model |
Integer. |
Numeric value containing the approximated integrated square error between estimated and theoretical covariance.
Oguamalam J, Radojičić U, Filzmoser P (2023). “Minimum regularized covariance trace estimator and outlier detection for functional data.” https://doi.org/10.48550/arXiv.2307.13509..
# Fix seed for reproducibility set.seed(124) # Sample outlying indices cont.ind <- sample(1:100,size=10) # Generate 100 curves on the interval [0,1] at 150 timepoints with 20% outliers. y <- mrct.rgauss(x.grid=seq(0,1,length.out=150), N=100, model=1, outliers=cont.ind, method="linear") # Run MRCT mrct.y <- mrct(data=y, h=0.75, alpha=0.1, initializations=10, criterion="sum") # Two additional curves are regarded as outlying according to the algorithm mrct.y$theoretical.w %in% cont.ind # Compare the ISE between true kernel and 1) true non-outliers, # 2) estimated non-outliers and 3) the complete data ise1 <- mrct.ise(data=y, outliers.est=cont.ind, model=1) ise2 <- mrct.ise(data=y, outliers.est=mrct.y$theoretical.w, model=1) ise3 <- mrct.ise(data=y, outliers.est=c(), model=1) ise1; ise2; ise3
# Fix seed for reproducibility set.seed(124) # Sample outlying indices cont.ind <- sample(1:100,size=10) # Generate 100 curves on the interval [0,1] at 150 timepoints with 20% outliers. y <- mrct.rgauss(x.grid=seq(0,1,length.out=150), N=100, model=1, outliers=cont.ind, method="linear") # Run MRCT mrct.y <- mrct(data=y, h=0.75, alpha=0.1, initializations=10, criterion="sum") # Two additional curves are regarded as outlying according to the algorithm mrct.y$theoretical.w %in% cont.ind # Compare the ISE between true kernel and 1) true non-outliers, # 2) estimated non-outliers and 3) the complete data ise1 <- mrct.ise(data=y, outliers.est=cont.ind, model=1) ise2 <- mrct.ise(data=y, outliers.est=mrct.y$theoretical.w, model=1) ise3 <- mrct.ise(data=y, outliers.est=c(), model=1) ise1; ise2; ise3
mrct()
A function for descriptive plots for an object resulting from a call to mrct()
.
mrct.plot(mrct.object)
mrct.plot(mrct.object)
mrct.object |
A result from a call to |
Descriptive plots
aMHD.plot |
Alpha-Mahalanobis distances, corresponding cutoff values and coloring according to estimated outliers (grey regular, black irregular). |
aMHD.plot.w |
Same as |
# Similar to example in mrct() helpfile # Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50, size=10) # Generate 50 curves on the interval [0,1] at 50 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=50), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x=seq(0,1,length.out=50), y=t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run MRCT mrct.y <- mrct(data=y, h=0.75, alpha=0.1, initializations=10, criterion="sum") # Visualize alpha-Mahalanobis distance # Colorinfromation according to estimated outliers (grey regular, black irregular) mrct.plot(mrct.y)
# Similar to example in mrct() helpfile # Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50, size=10) # Generate 50 curves on the interval [0,1] at 50 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=50), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x=seq(0,1,length.out=50), y=t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run MRCT mrct.y <- mrct(data=y, h=0.75, alpha=0.1, initializations=10, criterion="sum") # Visualize alpha-Mahalanobis distance # Colorinfromation according to estimated outliers (grey regular, black irregular) mrct.plot(mrct.y)
Generate random samples of Gaussian process on a uniform grid for different settings of the simulation study in Oguamalam et al. 2023.
mrct.rgauss( x.grid, N, seed = 123, model, outliers, sigma = 1, l = 1, method = "linear" )
mrct.rgauss( x.grid, N, seed = 123, model, outliers, sigma = 1, l = 1, method = "linear" )
x.grid |
Numeric vector containing a uniform grid on which the process is defined. |
N |
Integer number of observations to generate. |
seed |
Integer (default is |
model |
Integer. Either |
outliers |
Integer vector containing the indices of outliers. If empty, then only regular curves will be generated. |
sigma , l
|
Numeric values with default equal to |
method |
Different types of covariance kernels. Possible options are "
"
or "
. |
Numeric matrix with rows and
length(x.grid)
columns containing the randomly generated curves following a Gaussian process.
Each observations is a row of the result.
Oguamalam J, Radojičić U, Filzmoser P (2023). “Minimum regularized covariance trace estimator and outlier detection for functional data.” https://doi.org/10.48550/arXiv.2307.13509..
# Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50,size=10) # Generate 50 curves on the interval [0,1] at 50 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=50), N=50 ,model=1, outliers=cont.ind) # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x=seq(0,1,length.out=50), y=t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="")
# Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50,size=10) # Generate 50 curves on the interval [0,1] at 50 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=50), N=50 ,model=1, outliers=cont.ind) # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x=seq(0,1,length.out=50), y=t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="")
Robust outlier detection for sparse functional data as a generalization of the minimum
regularized
covariance
trace
(MRCT) estimator (Oguamalam et al. 2023). At first the observations are smoothed
by a B-spline basis and afterwards the MRCT algorithm is performed with the matrix of basis coefficients.
mrct.sparse( data, nbasis = dim(data)[2], new.p = dim(data)[2], h = 0.75, alpha = 0.01, initializations = 5, seed = 123, scaling.iterations = 10, scaling.tolerance = 10^(-4), criterion = "sum", sum.percentage = 0.75 )
mrct.sparse( data, nbasis = dim(data)[2], new.p = dim(data)[2], h = 0.75, alpha = 0.01, initializations = 5, seed = 123, scaling.iterations = 10, scaling.tolerance = 10^(-4), criterion = "sum", sum.percentage = 0.75 )
data |
Numeric matrix of a functional data set for which the esimator has to be calculated. Each row contains an observation. They are assumed to be observed on the same (probably sparse) regular grid. The number of grid points must be at least |
nbasis |
Integer. Number of B-spline basis functions for smoothing. The basis will be of order |
new.p |
Integer. Length of the grid of the smoothed curves. The resulting grid will be an equidistant partition of |
h |
Numeric value between |
alpha |
Numeric (default is |
initializations |
Integer (default is |
seed |
Integer (default is |
scaling.iterations |
Integer (default is |
scaling.tolerance |
Numeric (default is |
criterion |
Character. Criterion based on which the optimal subset is chosen among the final subsets. Possible options are: " |
sum.percentage |
Numeric value between |
A list with two entries
mrct.output |
List. The same output as the function |
data.smooth |
Numeric matrix. Collection of the smoothed curves of |
Oguamalam J, Radojičić U, Filzmoser P (2023). “Minimum regularized covariance trace estimator and outlier detection for functional data.” https://doi.org/10.48550/arXiv.2307.13509..
# Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50,size=10) # Generate 50 sparse curves on the interval [0,1] at 10 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=10), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x = seq(0,1,length.out=10), y = t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run sparse MRCT sparse.mrct.y <- mrct.sparse(data = y, nbasis = 10, h = 0.75, new.p = 50, alpha = 0.1, initializations = 10, criterion = "sum" ) # Visualize smoothed functions matplot(x=seq(0,1,length.out=50), y=t(sparse.mrct.y$data.smooth), type="l", lty="solid", col=colormap, xlab="t", ylab="") # Visualize alpha-Mahalanobis distance with cutoff (horizontal black line) # Colors correspond to simulated outliers, shapes to estimated (sparse MRCT) ones # (circle regular and triangle irregular curves) shapemap <- rep(1,50); shapemap[sparse.mrct.y$mrct.output$theoretical.w] <- 2 plot(x = 1:50, y = sparse.mrct.y$mrct.output$aMHD.w, col=colormap, pch = shapemap, xlab = "Index", ylab = expression(alpha*"-MHD")) abline(h = sparse.mrct.y$mrct.output$quant.w) # If you dont have any information on possible outliers, # alternatively you could use the S3 method plot.mrctsparse() mrct.sparse.plot(mrct.sparse.object = sparse.mrct.y)
# Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50,size=10) # Generate 50 sparse curves on the interval [0,1] at 10 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=10), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x = seq(0,1,length.out=10), y = t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run sparse MRCT sparse.mrct.y <- mrct.sparse(data = y, nbasis = 10, h = 0.75, new.p = 50, alpha = 0.1, initializations = 10, criterion = "sum" ) # Visualize smoothed functions matplot(x=seq(0,1,length.out=50), y=t(sparse.mrct.y$data.smooth), type="l", lty="solid", col=colormap, xlab="t", ylab="") # Visualize alpha-Mahalanobis distance with cutoff (horizontal black line) # Colors correspond to simulated outliers, shapes to estimated (sparse MRCT) ones # (circle regular and triangle irregular curves) shapemap <- rep(1,50); shapemap[sparse.mrct.y$mrct.output$theoretical.w] <- 2 plot(x = 1:50, y = sparse.mrct.y$mrct.output$aMHD.w, col=colormap, pch = shapemap, xlab = "Index", ylab = expression(alpha*"-MHD")) abline(h = sparse.mrct.y$mrct.output$quant.w) # If you dont have any information on possible outliers, # alternatively you could use the S3 method plot.mrctsparse() mrct.sparse.plot(mrct.sparse.object = sparse.mrct.y)
mrct.sparse()
A function for descriptive plots for an object resulting from a call to mrct.sparse()
.
mrct.sparse.plot( x = seq(0, 1, length.out = dim(mrct.sparse.object[[2]])[2]), mrct.sparse.object )
mrct.sparse.plot( x = seq(0, 1, length.out = dim(mrct.sparse.object[[2]])[2]), mrct.sparse.object )
x |
Gridpoints on which the smoothed data is to be plotted on. Default is |
mrct.sparse.object |
A result from a call to |
Descriptive plots.
aMHD.plot |
Alpha-Mahalanobis distances, corresponding cutoff values and coloring according to estimated outliers (grey regular, black irregular). |
aMHD.plot.w |
Same as |
data.plot |
Plot of the smoothed curves, colors corresponding to estimated outliers (gery regular, black irregular). Per default, the x-axis is plotted over
an equidistant grid of the interval |
# Similar to example in mrct.sparse() helpfile # Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50,size=10) # Generate 50 sparse curves on the interval [0,1] at 10 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=10), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x = seq(0,1,length.out=10), y = t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run sparse MRCT sparse.mrct.y <- mrct.sparse(data = y, nbasis = 10, h = 0.75, new.p = 50, alpha = 0.1, initializations = 10, criterion = "sum" ) # Visualize alpha-Mahalanobis distances and smoothed curves # Colorinformation according to estimated outliers (grey regular, black irregular) mrct.sparse.plot(mrct.sparse.object = sparse.mrct.y)
# Similar to example in mrct.sparse() helpfile # Fix seed for reproducibility set.seed(123) # Sample outlying indices cont.ind <- sample(1:50,size=10) # Generate 50 sparse curves on the interval [0,1] at 10 timepoints with 20% outliers y <- mrct.rgauss(x.grid=seq(0,1,length.out=10), N=50, model=1, outliers=cont.ind, method="linear") # Visualize curves (regular curves grey, outliers black) colormap <- rep("grey",50); colormap[cont.ind] <- "black" matplot(x = seq(0,1,length.out=10), y = t(y), type="l", lty="solid", col=colormap, xlab="t",ylab="") # Run sparse MRCT sparse.mrct.y <- mrct.sparse(data = y, nbasis = 10, h = 0.75, new.p = 50, alpha = 0.1, initializations = 10, criterion = "sum" ) # Visualize alpha-Mahalanobis distances and smoothed curves # Colorinformation according to estimated outliers (grey regular, black irregular) mrct.sparse.plot(mrct.sparse.object = sparse.mrct.y)