Title: | Robust Matrix-Variate Parameter Estimation |
---|---|
Description: | Robust covariance estimation for matrix-valued data and data with Kronecker-covariance structure using the Matrix Minimum Covariance Determinant (MMCD) estimators and outlier explanation using and Shapley values. |
Authors: | Marcus Mayrhofer [aut, cre], Una Radojičić [aut], Peter Filzmoser [aut] |
Maintainer: | Marcus Mayrhofer <[email protected]> |
License: | GPL-3 |
Version: | 0.1.3 |
Built: | 2024-11-17 06:56:12 UTC |
Source: | CRAN |
mmcd
function.Probability of obtaining at least one clean h-subset in the mmcd
function.
clean_prob_mmcd(p, q, n_subsets = 500, contamination = 0.5)
clean_prob_mmcd(p, q, n_subsets = 500, contamination = 0.5)
p |
number of rows. |
q |
number of columns. |
n_subsets |
number of elemental h-substs (default is 500). |
contamination |
level of contamination (default is 0.5). |
Probability of obtaining at least one clean h-subset in the mmcd
function.
This function is part of the FastMMCD algorithm (Mayrhofer et al. 2024).
cstep( X, alpha = 0.5, h_init = -1L, init = TRUE, max_iter = 100L, max_iter_MLE = 100L, lambda = 0, adapt_alpha = TRUE )
cstep( X, alpha = 0.5, h_init = -1L, init = TRUE, max_iter = 100L, max_iter_MLE = 100L, lambda = 0, adapt_alpha = TRUE )
X |
a 3d array of dimension |
alpha |
numeric parameter between 0.5 (default) and 1. Controls the size |
h_init |
Integer. Size of initial h-subset. If smaller than 0 (default) size is chosen automatically. |
init |
Logical. If TRUE (default) elemental subsets are used to initialize the procedure. |
max_iter |
upper limit of C-step iterations (default is 100) |
max_iter_MLE |
upper limit of MLE iterations (default is 100) |
lambda |
a smooting parameter for the rowwise and columnwise covariance matrices. |
adapt_alpha |
Logical. If TRUE (default) alpha is adapted to take the dimension of the data into account. |
A list containing the following:
mu |
Estimated |
cov_row |
Estimated |
cov_col |
Estimated |
cov_row_inv |
Inverse of |
cov_col_inv |
Inverse of |
md |
Squared Mahalanobis distances. |
md_raw |
Squared Mahalanobis distances based on raw MMCD estimators. |
det |
Value of objective function (determinant of Kronecker product of rowwise and columnwise covariane). |
dets |
Objective values for the final h-subsets. |
h_subset |
Final h-subset of raw MMCD estimators. |
iterations |
Number of C-steps. |
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) ind <- sample(1:n, 0.3*n) X[,,ind] <- rmatnorm(n = length(ind), matrix(rep(10, p*q), nrow = p, ncol = q), cov_row, cov_col) par_mmle <- mmle(X) par_cstep <- cstep(X) distances_mmle <- mmd(X, par_mmle$mu, par_mmle$cov_row, par_mmle$cov_col) distances_cstep <- mmd(X, par_cstep$mu, par_cstep$cov_row, par_cstep$cov_col) plot(distances_mmle, distances_cstep) abline(h = qchisq(0.99, p*q), lty = 2, col = "red") abline(v = qchisq(0.99, p*q), lty = 2, col = "red")
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) ind <- sample(1:n, 0.3*n) X[,,ind] <- rmatnorm(n = length(ind), matrix(rep(10, p*q), nrow = p, ncol = q), cov_row, cov_col) par_mmle <- mmle(X) par_cstep <- cstep(X) distances_mmle <- mmd(X, par_mmle$mu, par_mmle$cov_row, par_mmle$cov_col) distances_cstep <- mmd(X, par_cstep$mu, par_cstep$cov_row, par_cstep$cov_col) plot(distances_mmle, distances_cstep) abline(h = qchisq(0.99, p*q), lty = 2, col = "red") abline(v = qchisq(0.99, p*q), lty = 2, col = "red")
The DARWIN (Diagnosis AlzheimeR WIth haNdwriting) dataset comprises handwriting samples from 174 individuals. Among them, 89 have been diagnosed with Alzheimer's disease (AD), while the remaining 85 are considered healthy subjects (H). Each participant completed 25 handwriting tasks on paper, and their pen movements were recorded using a graphic tablet. From the raw handwriting data, a set of 18 features was extracted.
data(darwin)
data(darwin)
An array of dimension , comprising
observations,
each represented by a
times
dimensional matrix.
The observed parameters are:
Total Time
Air Time
Paper Time
Mean Speed on paper
Mean Acceleration on paper
Mean Acceleration in air
Mean Jerk on paper
Pressure Mean
Pressure Variance
Generalization of the Mean Relative Tremor (GMRT) on paper
GMTR in air
Mean GMRT
Pendowns Number
Max X Extension
Max Y Extension
Dispersion Index
UC Irvine Machine Learning Repository - DARWIN - doi:10.24432/C55D0K
Cilia ND, De Stefano C, Fontanella F, Di Freca AS (2018).
“An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis.”
Procedia Computer Science, 141, 466–471.
Cilia ND, De Gregorio G, De Stefano C, Fontanella F, Marcelli A, Parziale A (2022).
“Diagnosing Alzheimer’s disease from on-line handwriting: a novel dataset and performance benchmarking.”
Engineering Applications of Artificial Intelligence, 111, 104822.
matrixShapley
decomposes the squared matrix Mahalanobis distance (mmd
) into additive outlyingness contributions of
the rows, columns, or cell of a matrix (Mayrhofer and Filzmoser 2023; Mayrhofer et al. 2024).
matrixShapley(X, mu = NULL, cov_row, cov_col, inverted = FALSE, type = "cell")
matrixShapley(X, mu = NULL, cov_row, cov_col, inverted = FALSE, type = "cell")
X |
a 3d array of dimension |
mu |
a |
cov_row |
a |
cov_col |
a |
inverted |
Logical. FALSE by default.
If TRUE |
type |
Character. Either "row", "col", or "cell" (default) to compute rowwise, columnwise, or cellwise Shapley values. |
Rowwise, columnwise, or cellwise Shapley value(s).
Mayrhofer M, Filzmoser P (2023).
“Multivariate outlier explanations using Shapley values and Mahalanobis distances.”
Econometrics and Statistics.
Mayrhofer M, Radojičić U, Filzmoser P (2024).
“Robust covariance estimation and explainable outlier detection for matrix-valued data.”
arXiv preprint arXiv:2403.03975.
mmd
.
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(5,2,2,4), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) distances <- mmd(X, mu, cov_row, cov_col)
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(5,2,2,4), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) distances <- mmd(X, mu, cov_row, cov_col)
mmcd
computes the robust MMCD estimators of location and covariance for matrix-variate data
using the FastMMCD algorithm (Mayrhofer et al. 2024).
mmcd( X, nsamp = 500L, alpha = 0.5, lambda = 0, max_iter_cstep = 100L, max_iter_MLE = 100L, max_iter_cstep_init = 2L, max_iter_MLE_init = 2L, adapt_alpha = TRUE, reweight = TRUE, scale_consistency = "quant", outlier_quant = 0.975, nthreads = 1L )
mmcd( X, nsamp = 500L, alpha = 0.5, lambda = 0, max_iter_cstep = 100L, max_iter_MLE = 100L, max_iter_cstep_init = 2L, max_iter_MLE_init = 2L, adapt_alpha = TRUE, reweight = TRUE, scale_consistency = "quant", outlier_quant = 0.975, nthreads = 1L )
X |
a 3d array of dimension |
nsamp |
number of initial h-subsets (default is 500). |
alpha |
numeric parameter between 0.5 (default) and 1. Controls the size |
lambda |
a smooting parameter for the rowwise and columnwise covariance matrices. |
max_iter_cstep |
upper limit of C-step iterations (default is 100) |
max_iter_MLE |
upper limit of MLE iterations (default is 100) |
max_iter_cstep_init |
upper limit of C-step iterations for initial h-subsets (default is 2) |
max_iter_MLE_init |
upper limit of MLE iterations for initial h-subsets (default is 2) |
adapt_alpha |
Logical. If TRUE (default) alpha is adapted to take the dimension of the data into account. |
reweight |
Logical. If TRUE (default) the reweighted MMCD estimators are computed. |
scale_consistency |
Character. Either "quant" (default) or "mmd_med". If "quant", the consistency factor is chosen to achieve consistency under the matrix normal distribution. If "mmd_med", the consistency factor is chosen based on the Mahalanobis distances of the observations. |
outlier_quant |
numeric parameter between 0 and 1. Chi-square quantile used in the reweighting step. |
nthreads |
Integer. If 1 (default), all computations are carried out sequentially.
If larger then 1, C-steps are carried out in parallel using |
The MMCD estimators generalize the well-known Minimum Covariance Determinant (MCD)
(Rousseeuw 1985; Rousseeuw and Driessen 1999) to the matrix-variate setting.
It looks for the observations,
, whose covariance matrix has the smallest determinant.
The FastMMCD algorithm is used for computation and is described in detail in (Mayrhofer et al. 2024).
NOTE: The procedure depends on random initial subsets. Currently setting a seed is only possible if
nthreads = 1
.
A list containing the following:
mu |
Estimated |
cov_row |
Estimated |
cov_col |
Estimated |
cov_row_inv |
Inverse of |
cov_col_inv |
Inverse of |
md |
Squared Mahalanobis distances. |
md_raw |
Squared Mahalanobis distances based on raw MMCD estimators. |
det |
Value of objective function (determinant of Kronecker product of rowwise and columnwise covariane). |
alpha |
The (adjusted) value of alpha used to determine the size of the h-subset. |
consistency_factors |
Consistency factors for raw and reweighted MMCD estimators. |
dets |
Objective values for the final h-subsets. |
best_i |
ID of subset with best objective. |
h_subset |
Final h-subset of raw MMCD estimators. |
h_subset_reweighted |
Final h-subset of reweighted MMCD estimators. |
iterations |
Number of C-steps. |
dets_init_first |
Objective values for the |
subsets_first |
Subsets created in subsampling procedure for large |
dets_init_second |
Objective values of the 10 best initial subsets after executing C-steps until convergence. |
Mayrhofer M, Radojičić U, Filzmoser P (2024).
“Robust covariance estimation and explainable outlier detection for matrix-valued data.”
arXiv preprint arXiv:2403.03975.
Rousseeuw P (1985).
“Multivariate Estimation With High Breakdown Point.”
Mathematical Statistics and Applications Vol. B, 283-297.
doi:10.1007/978-94-009-5438-0_20.
Rousseeuw PJ, Driessen KV (1999).
“A Fast Algorithm for the Minimum Covariance Determinant Estimator.”
Technometrics, 41(3), 212-223.
doi:10.1080/00401706.1999.10485670.
The mmcd
algorithm uses the cstep
and mmle
functions.
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = n, mu, cov_row, cov_col) ind <- sample(1:n, 0.3*n) X[,,ind] <- rmatnorm(n = length(ind), matrix(rep(10, p*q), nrow = p, ncol = q), cov_row, cov_col) par_mmle <- mmle(X) par_mmcd <- mmcd(X) distances_mmle <- mmd(X, par_mmle$mu, par_mmle$cov_row, par_mmle$cov_col) distances_mmcd <- mmd(X, par_mmcd$mu, par_mmcd$cov_row, par_mmcd$cov_col) plot(distances_mmle, distances_mmcd) abline(h = qchisq(0.99, p*q), lty = 2, col = "red") abline(v = qchisq(0.99, p*q), lty = 2, col = "red")
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = n, mu, cov_row, cov_col) ind <- sample(1:n, 0.3*n) X[,,ind] <- rmatnorm(n = length(ind), matrix(rep(10, p*q), nrow = p, ncol = q), cov_row, cov_col) par_mmle <- mmle(X) par_mmcd <- mmcd(X) distances_mmle <- mmd(X, par_mmle$mu, par_mmle$cov_row, par_mmle$cov_col) distances_mmcd <- mmd(X, par_mmcd$mu, par_mmcd$cov_row, par_mmcd$cov_col) plot(distances_mmle, distances_mmcd) abline(h = qchisq(0.99, p*q), lty = 2, col = "red") abline(v = qchisq(0.99, p*q), lty = 2, col = "red")
Matrix Mahalanobis distance
mmd(X, mu, cov_row, cov_col, inverted = FALSE)
mmd(X, mu, cov_row, cov_col, inverted = FALSE)
X |
a 3d array of dimension |
mu |
a |
cov_row |
a |
cov_col |
a |
inverted |
Logical. FALSE by default.
If TRUE |
Squared Mahalanobis distance(s) of observation(s) in X
.
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) ind <- sample(1:n, 0.3*n) X[,,ind] <- rmatnorm(n = length(ind), matrix(rep(10, p*q), nrow = p, ncol = q), cov_row, cov_col) distances <- mmd(X, mu, cov_row, cov_col) plot(distances) abline(h = qchisq(0.99, p*q), lty = 2, col = "red")
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) ind <- sample(1:n, 0.3*n) X[,,ind] <- rmatnorm(n = length(ind), matrix(rep(10, p*q), nrow = p, ncol = q), cov_row, cov_col) distances <- mmd(X, mu, cov_row, cov_col) plot(distances) abline(h = qchisq(0.99, p*q), lty = 2, col = "red")
mmle
computes the Maximum Likelihood Estimators (MLEs) for the matrix normal distribution
using the iterative flip-flop algorithm (Dutilleul 1999).
mmle(X, max_iter = 100L, lambda = 0, silent = FALSE)
mmle(X, max_iter = 100L, lambda = 0, silent = FALSE)
X |
a 3d array of dimension |
max_iter |
upper limit of iterations. |
lambda |
a smooting parameter for the rowwise and columnwise covariance matrices. |
silent |
Logical. If FALSE (default) warnings and errors are printed. |
A list containing the following:
mu |
Estimated |
cov_row |
Estimated |
cov_col |
Estimated |
cov_row_inv |
Inverse of |
cov_col_inv |
Inverse of |
norm |
Forbenius norm of squared differences between covariance matrices in final iteration. |
iterations |
Number of iterations of the mmle procedure. |
Dutilleul P (1999). “The mle algorithm for the matrix normal distribution.” Journal of Statistical Computation and Simulation, 64(2), 105-123. doi:10.1080/00949659908811970.
For robust parameter estimation use mmcd
.
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) par_mmle <- mmle(X)
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(1,0.5,0.5,1), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) par_mmle <- mmle(X)
mmcd
function with probability prob
.Number of subsets that are required to obtain at least one clean h-subset in the mmcd
function with probability prob
.
n_subsets_mmcd(p, q, prob = 0.99, contamination = 0.5)
n_subsets_mmcd(p, q, prob = 0.99, contamination = 0.5)
p |
number of rows. |
q |
number of columns. |
prob |
probability (default is 0.99). |
contamination |
level of contamination (default is 0.5). |
Number of subsets that are required to obtain at least one clean h-subset in the mmcd
function with probability prob
.
Simulate from a Matrix Normal Distribution
rmatnorm(n, mu = NULL, cov_row, cov_col)
rmatnorm(n, mu = NULL, cov_row, cov_col)
n |
the number of samples required. |
mu |
a |
cov_row |
a |
cov_col |
a |
If a matrix with
rows and
columns, o
otherwise a 3d array of dimensions
with a sample in each slice.
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(5,2,2,4), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) X[,,9] #printing the 9th sample.
n = 1000; p = 2; q = 3 mu = matrix(rep(0, p*q), nrow = p, ncol = q) cov_row = matrix(c(5,2,2,4), nrow = p, ncol = p) cov_col = matrix(c(3,2,1,2,3,2,1,2,3), nrow = q, ncol = q) X <- rmatnorm(n = 1000, mu, cov_row, cov_col) X[,,9] #printing the 9th sample.
Weather data from Austria's highest weather station, situated in the Austrian Central Alps on the glaciated mountain "Hoher Sonnblick", standing 3106 meters above sea level.
data(weather)
data(weather)
An array of dimension , comprising
observations,
each represented by a
times
dimensional matrix.
Observed parameters are monthly averages of
air pressure (AP)
precipitation (P)
sunshine hours (SH)
temperature (T)
proportion of solid precipitation (SP)
from 1891 to 2022.
Datasource: GeoSphere Austria - https://data.hub.geosphere.at