Title: | Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances |
---|---|
Description: | Based on Shapley values to explain multivariate outlyingness and to detect and impute cellwise outliers. Includes implementations of methods described in Mayrhofer and Filzmoser (2023) <doi:10.1016/j.ecosta.2023.04.003>. |
Authors: | Marcus Mayrhofer [aut, cre], Peter Filzmoser [aut] |
Maintainer: | Marcus Mayrhofer <[email protected]> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2024-12-17 07:02:21 UTC |
Source: | CRAN |
The MOE
function indicates outlying cells for
a data vector with entries or data matrix with
entries containing only numeric entries
x
for a given center mu
and covariance matrix Sigma
using the Shapley value.
It is a more sophisticated alternative to the SCD
algorithm,
which uses the information of the regular cells to derive an alternative reference point (Mayrhofer and Filzmoser 2023).
MOE( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, local = TRUE, max_iter = 1000, q = 0.99, check_outlyingness = FALSE, check = TRUE, cells = NULL, method = "cellMCD" )
MOE( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, local = TRUE, max_iter = 1000, q = 0.99, check_outlyingness = FALSE, check = TRUE, cells = NULL, method = "cellMCD" )
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
local |
Logical. If TRUE (default), the non-central Chi-Squared distribution is used to determine the cutoff value based on |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
check_outlyingness |
Logical. If TRUE (default), the outlyingness is rechecked after applying |
check |
Logical. If |
cells |
Either |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
A list of class shapley_algorithm
(new_shapley_algorithm
) containing the following:
x |
A |
phi |
A |
mu_tilde |
A |
x_original |
A |
x_original |
The non-centrality parameters for the Chi-Squared distribution |
x_history |
A list with |
phi_history |
A list with |
mu_tilde_history |
A list with |
S_history |
A list with |
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) MOE_x <- MOE(x = x, mu = mu, Sigma = Sigma) plot(MOE_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20)
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) MOE_x <- MOE(x = x, mu = mu, Sigma = Sigma) plot(MOE_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20)
shapley
.This function creates an object of class shapley
that is returned by the shapley
function.
new_shapley(phi = numeric(), mu_tilde = NULL, non_centrality = NULL)
new_shapley(phi = numeric(), mu_tilde = NULL, non_centrality = NULL)
phi |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
Named list of class shapley
, containing the input parameters.
shapley_algorithm
.This function creates an object of class shapley_algorithm
that is returned
by the SCD
and MOE
functions.
new_shapley_algorithm( x = numeric(), phi = numeric(), x_original = numeric(), mu_tilde = NULL, non_centrality = NULL, x_history = NULL, phi_history = NULL, mu_tilde_history = NULL, S_history = NULL )
new_shapley_algorithm( x = numeric(), phi = numeric(), x_original = numeric(), mu_tilde = NULL, non_centrality = NULL, x_history = NULL, phi_history = NULL, mu_tilde_history = NULL, S_history = NULL )
x |
A |
phi |
A |
x_original |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
x_history |
Optional. A list with |
phi_history |
Optional. A list with |
mu_tilde_history |
Optional. A list with |
S_history |
Optional. A list with |
Named list of class shapley_algorithm
, containing the input parameters.
shapley_interaction
.This function creates an object of class shapley_interaction
that is returned
by the shapley_interaction
function.
new_shapley_interaction(PHI = numeric())
new_shapley_interaction(PHI = numeric())
PHI |
A |
Matrix of class shapley_interaction
, containing input matrix PHI
.
Barplot of Shapley values
## S3 method for class 'shapley' plot( x, subset = NULL, chi2.q = 0.99, abbrev.var = 3, abbrev.obs = 10, sort.var = FALSE, sort.obs = FALSE, plot_md = TRUE, md_squared = TRUE, rotate_x = TRUE, ... )
## S3 method for class 'shapley' plot( x, subset = NULL, chi2.q = 0.99, abbrev.var = 3, abbrev.obs = 10, sort.var = FALSE, sort.obs = FALSE, plot_md = TRUE, md_squared = TRUE, rotate_x = TRUE, ... )
x |
A list of class |
subset |
Either an integer, |
chi2.q |
Quantile, only used if |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
plot_md |
Logical. If |
md_squared |
Logical. If |
rotate_x |
Logical. If |
... |
Optional arguments passed to methods. |
Returns a barplot that displays the Shapley values (shapley
)for each observation and optionally (plot_md = TRUE
)
includes the squared Mahalanobis distance (black bar) and the corresponding (non-)central chi-square quantile (dotted line).
library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 1:20) plot(call_shapley, subset = 5, rotate_x = FALSE) plot(call_shapley, subset = 5, md_squared = FALSE, rotate_x = FALSE)
library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 1:20) plot(call_shapley, subset = 5, rotate_x = FALSE) plot(call_shapley, subset = 5, md_squared = FALSE, rotate_x = FALSE)
Barplot and tileplot of Shapley values.
## S3 method for class 'shapley_algorithm' plot( x, type = "both", subset = NULL, abbrev.var = FALSE, abbrev.obs = FALSE, sort.var = FALSE, sort.obs = FALSE, n_digits = 2, rotate_x = TRUE, continuous_rowname = FALSE, ... )
## S3 method for class 'shapley_algorithm' plot( x, type = "both", subset = NULL, abbrev.var = FALSE, abbrev.obs = FALSE, sort.var = FALSE, sort.obs = FALSE, n_digits = 2, rotate_x = TRUE, continuous_rowname = FALSE, ... )
x |
A list of class |
type |
Either |
subset |
Either an integer, |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
n_digits |
Integer. If |
rotate_x |
Logical. If |
continuous_rowname |
Logical. If |
... |
Arguments passed on to |
Returns plots for a list of class shapley_algorithm
.
If type
is "bar"
, a barplot is generated. It displays the Shapley values (shapley
)
for each observation and optionally (plot_md = TRUE
) includes the squared Mahalanobis distance (black bar)
and the corresponding (non-)central chi-square quantile (dotted line).
If type
is "cell"
a tileplot is generated. It displays each cells of the dataset and shows the original value from the observations,
color coding indicates whether those values were higher (red) or lower (blue) than the imputed values,
and the color intensity is based on the magnitude of the Shapley value.
If type
is "both"
, the barplot and the tileplot are generated.
library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20, n_digits = 0)
library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20, n_digits = 0)
Plot of Shapley interaction indices
## S3 method for class 'shapley_interaction' plot( x, abbrev = 4, title = "Shapley Interaction", legend = TRUE, text_size = 22, ... )
## S3 method for class 'shapley_interaction' plot( x, abbrev = 4, title = "Shapley Interaction", legend = TRUE, text_size = 22, ... )
x |
A |
abbrev |
Integer. If |
title |
Character. Title of the plot. |
legend |
Logical. If TRUE (default), a legend is plotted. |
text_size |
Integer. Size of the text in the plot |
... |
Optional arguments passed to methods. |
Returns a figure consisting of two panels. The upper panel shows the Shapley values, and the lower panel the Shapley interaction indices.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) PHI <- shapley_interaction(x, mu, Sigma) plot(PHI)
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) PHI <- shapley_interaction(x, mu, Sigma) plot(PHI)
shapley
.Print function for class shapley
.
## S3 method for class 'shapley' print(x, ...)
## S3 method for class 'shapley' print(x, ...)
x |
List of class |
... |
Optional arguments passed to methods. |
Prints the list entries of x
that are not NULL
.
shapley_algorithm
.Print function for class shapley_algorithm
.
## S3 method for class 'shapley_algorithm' print(x, ...)
## S3 method for class 'shapley_algorithm' print(x, ...)
x |
List of class |
... |
Optional arguments passed to methods. |
Prints the imputed data and the Shapley values.
shapley_interaction
.Print function for class shapley_interaction
.
## S3 method for class 'shapley_interaction' print(x, ...)
## S3 method for class 'shapley_interaction' print(x, ...)
x |
Matrix of class |
... |
Optional arguments passed to methods. |
Prints the Shapley interaction indices.
The SCD
function indicates outlying cells for
a data vector with entries or data matrix with
entries containing only numeric entries
x
for a given center mu
and covariance matrix Sigma
using the Shapley value (Mayrhofer and Filzmoser 2023).
SCD( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, max_iter = 1000, q = 0.99, method = "cellMCD", check = TRUE, cells = NULL )
SCD( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, max_iter = 1000, q = 0.99, method = "cellMCD", check = TRUE, cells = NULL )
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
A list of class shapley_algorithm
(new_shapley_algorithm
) containing the following:
x |
A |
phi |
A |
x_original |
A |
x_history |
The path of how the original data vector was modified. |
phi_history |
The Shapley values corresponding to |
S_history |
The indices of the outlying cells in each iteration. |
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) SCD_x <- SCD(x = x, mu = mu, Sigma = Sigma) plot(SCD_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) SCD_X <- SCD(X, mu, Sigma) plot(SCD_X, subset = 20)
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) SCD_x <- SCD(x = x, mu = mu, Sigma = Sigma) plot(SCD_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) SCD_X <- SCD(X, mu, Sigma) plot(SCD_X, subset = 20)
The shapley
function computes a -dimensional vector containing the decomposition of the
squared Mahalanobis distance of
x
(with respect to mu
and Sigma
)
into outlyingness contributions of the individual variables (Mayrhofer and Filzmoser 2023).
The value of the -th coordinate of this vector represents the
average marginal contribution of the
-th variable to the squared Mahalanobis distance of
the individual observation
x
.
If cells
is provided, Shapley values of x
are computed with respect to a local reference point,
that is based on a cellwise prediction of each coordinate, using the information of the regular cells of x
, see (Mayrhofer and Filzmoser 2023).
If x
is a matrix, a
matrix is returned, containing the decomposition for each row.
shapley( x, mu = NULL, Sigma = NULL, inverted = FALSE, method = "cellMCD", check = TRUE, cells = NULL )
shapley( x, mu = NULL, Sigma = NULL, inverted = FALSE, method = "cellMCD", check = TRUE, cells = NULL )
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
phi |
A |
mu_tilde |
A |
non_centrality |
The non-centrality parameters for the Chi-Squared distribution, given by |
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
## Without outlying cells as input in the 'cells' argument#' # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley(x, mu, Sigma) phi <- shapley(x, mu, Sigma_inv, inverted = TRUE) plot(phi) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 20) ## Giving outlying cells as input in the 'cells' argument # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) call_shapley <- shapley(x, mu, Sigma_inv, inverted = TRUE, method = "cellMCD", check = TRUE, cells = c(1,1,0,0,0)) plot(call_shapley) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma, cells = (X_clean - X)!=0) plot(call_shapley, subset = 20)
## Without outlying cells as input in the 'cells' argument#' # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley(x, mu, Sigma) phi <- shapley(x, mu, Sigma_inv, inverted = TRUE) plot(phi) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 20) ## Giving outlying cells as input in the 'cells' argument # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) call_shapley <- shapley(x, mu, Sigma_inv, inverted = TRUE, method = "cellMCD", check = TRUE, cells = c(1,1,0,0,0)) plot(call_shapley) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma, cells = (X_clean - X)!=0) plot(call_shapley, subset = 20)
The shapley_interaction
function computes a matrix
containing pairwise outlyingness scores based on Shapley interaction indices.
It decomposes the squared Mahalanobis distance of
x
(with respect to mu
and Sigma
)
into outlyingness contributions of pairs of variables (Mayrhofer and Filzmoser 2023).
shapley_interaction(x, mu, Sigma, inverted = FALSE)
shapley_interaction(x, mu, Sigma, inverted = FALSE)
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
A matrix containing the decomposition of the squared Mahalanobis distance of
x
into outlyingness scores for pairs of variables with respect to mu
and Sigma
.
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley_interaction(x, mu, Sigma) PHI <- shapley_interaction(x, mu, Sigma_inv, inverted = TRUE) plot(PHI)
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley_interaction(x, mu, Sigma) PHI <- shapley_interaction(x, mu, Sigma_inv, inverted = TRUE) plot(PHI)
Monthly data from the weather station Hohe Warte since April 1872 - Vienna (Stadt Wien 2022).
WeatherVienna
WeatherVienna
A data frame with 1,804 rows and 25 columns:
year
Year
month
Month
t
Daily mean air temperature in °C, (t7 mean + t19 mean + tmax mean + tmin mean)/4; before 1971: t7 mean + t14 mean + 2 x t21 mean)
t_max
Absolute maximum air temperature in °C
t_min
Absolute air temperature minimum in °C
avg_t_max
Mean daily maximum air temperature in °C
avg_t_min
Mean daily minimum air temperature in °C
num_frost
Number of frost days (days with a temperature maximum tmin < 0.0 °C)
num_ice
Number of ice days (days with a temperature maximum tmax < 0.0 °C)
num_summer
Number of summer days (days with a temperature maximum tmax >= 25.0 °C)
num_heat
Number of hot days (days with a temperature maximum tmax >= 30.0 °C)
p
Daily mean air pressure in hPa (mean of all measurements at 7 a.m., 2 p.m., 7 p.m. CET; before 1971 9 p.m. instead of 7 p.m.)
p_max
Maximum air pressure in hPa (maximum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
p_min
Minimum air pressure in hPa (minimum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
sun_h
Monthly total sunshine duration in hours
num_clear
Number of clear days (daily mean cloudiness < 20/100)
num_cloud
Number of cloudy days (daily mean cloudiness > 80/100)
rel_hum
Daily mean relative humidity in percent (2 x RH7 mean + RH14 mean + RH19 mean)/4; before 1971 9 p.m. instead of 7 p.m.)
rel_hum_max
Relative humidity maximum in percent
rel_hum_min
Relative humidity minimum in percent
wind_v
Monthly average wind speed in km/h
num_wind_v60
Number of days with wind peaks >= 60 km/h
wind_v_max
Maximum wind speed in km/h
precp_sum
Monthly total precipitation in mm
num_precp_01
Number of days with precipitation >= 0.1 mm
The data were downloaded from https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien
in September 2022.
Stadt Wien (2022). “Monthly data from the weather station Hohe Warte since April 1872 - Vienna.” https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien.
data("WeatherVienna") summary(WeatherVienna)
data("WeatherVienna") summary(WeatherVienna)