Title: | Prediction Explanation with Dependence-Aware Shapley Values |
---|---|
Description: | Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements the method described in Aas, Jullum and Løland (2019) <arXiv:1903.10464>, which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. |
Authors: | Nikolai Sellereite [aut] , Martin Jullum [cre, aut] , Annabelle Redelmeier [aut], Anders Løland [ctb], Jens Christian Wahl [ctb], Camilla Lingjærde [ctb], Norsk Regnesentral [cph, fnd] |
Maintainer: | Martin Jullum <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.2 |
Built: | 2024-11-10 06:20:45 UTC |
Source: | CRAN |
Explain the output of machine learning models with more accurately estimated Shapley values
explain(x, explainer, approach, prediction_zero, ...) ## S3 method for class 'empirical' explain( x, explainer, approach, prediction_zero, type = "fixed_sigma", fixed_sigma_vec = 0.1, n_samples_aicc = 1000, eval_max_aicc = 20, start_aicc = 0.1, w_threshold = 0.95, ... ) ## S3 method for class 'gaussian' explain( x, explainer, approach, prediction_zero, mu = NULL, cov_mat = NULL, ... ) ## S3 method for class 'copula' explain(x, explainer, approach, prediction_zero, ...) ## S3 method for class 'ctree' explain( x, explainer, approach, prediction_zero, mincriterion = 0.95, minsplit = 20, minbucket = 7, sample = TRUE, ... ) ## S3 method for class 'combined' explain( x, explainer, approach, prediction_zero, mu = NULL, cov_mat = NULL, ... ) ## S3 method for class 'ctree_comb_mincrit' explain(x, explainer, approach, prediction_zero, mincriterion, ...)
explain(x, explainer, approach, prediction_zero, ...) ## S3 method for class 'empirical' explain( x, explainer, approach, prediction_zero, type = "fixed_sigma", fixed_sigma_vec = 0.1, n_samples_aicc = 1000, eval_max_aicc = 20, start_aicc = 0.1, w_threshold = 0.95, ... ) ## S3 method for class 'gaussian' explain( x, explainer, approach, prediction_zero, mu = NULL, cov_mat = NULL, ... ) ## S3 method for class 'copula' explain(x, explainer, approach, prediction_zero, ...) ## S3 method for class 'ctree' explain( x, explainer, approach, prediction_zero, mincriterion = 0.95, minsplit = 20, minbucket = 7, sample = TRUE, ... ) ## S3 method for class 'combined' explain( x, explainer, approach, prediction_zero, mu = NULL, cov_mat = NULL, ... ) ## S3 method for class 'ctree_comb_mincrit' explain(x, explainer, approach, prediction_zero, mincriterion, ...)
x |
A matrix or data.frame. Contains the the features, whose predictions ought to be explained (test data). |
explainer |
An |
approach |
Character vector of length |
prediction_zero |
Numeric. The prediction value for unseen data, typically equal to the mean of the response. |
... |
Additional arguments passed to |
type |
Character. Should be equal to either |
fixed_sigma_vec |
Numeric. Represents the kernel bandwidth. Note that this argument is only
applicable when |
n_samples_aicc |
Positive integer. Number of samples to consider in AICc optimization.
Note that this argument is only applicable when |
eval_max_aicc |
Positive integer. Maximum number of iterations when
optimizing the AICc. Note that this argument is only applicable when
|
start_aicc |
Numeric. Start value of |
w_threshold |
Positive integer between 0 and 1. |
mu |
Numeric vector. (Optional) Containing the mean of the data generating distribution.
If |
cov_mat |
Numeric matrix. (Optional) Containing the covariance matrix of the data
generating distribution. |
mincriterion |
Numeric value or vector where length of vector is the number of features in model. Value is equal to 1 - alpha where alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which mincriterion to use when conditioning on various numbers of features. |
minsplit |
Numeric value. Equal to the value that the sum of the left and right daughter nodes need to exceed. |
minbucket |
Numeric value. Equal to the minimum sum of weights in a terminal node. |
sample |
Boolean. If TRUE, then the method always samples |
The most important thing to notice is that shapr
has implemented four different
approaches for estimating the conditional distributions of the data, namely "empirical"
,
"gaussian"
, "copula"
and "ctree"
.
In addition, the user also has the option of combining the four approaches.
E.g. if you're in a situation where you have trained a model the consists of 10 features,
and you'd like to use the "gaussian"
approach when you condition on a single feature,
the "empirical"
approach if you condition on 2-5 features, and "copula"
version
if you condition on more than 5 features this can be done by simply passing
approach = c("gaussian", rep("empirical", 4), rep("copula", 5))
. If
"approach[i]" = "gaussian"
it means that you'd like to use the "gaussian"
approach
when conditioning on i
features.
Object of class c("shapr", "list")
. Contains the following items:
data.table
Model object
Numeric vector
data.table
Note that the returned items model
, p
and x_test
are mostly added due
to the implementation of plot.shapr
. If you only want to look at the numerical results
it is sufficient to focus on dt
. dt
is a data.table where the number of rows equals
the number of observations you'd like to explain, and the number of columns equals m +1
,
where m
equals the total number of features in your model.
If dt[i, j + 1] > 0
it indicates that the j-th feature increased the prediction for
the i-th observation. Likewise, if dt[i, j + 1] < 0
it indicates that the j-th feature
decreased the prediction for the i-th observation. The magnitude of the value is also important
to notice. E.g. if dt[i, k + 1]
and dt[i, j + 1]
are greater than 0
,
where j != k
, and dt[i, k + 1]
> dt[i, j + 1]
this indicates that feature
j
and k
both increased the value of the prediction, but that the effect of the k-th
feature was larger than the j-th feature.
The first column in dt
, called 'none', is the prediction value not assigned to any of the features
(0).
It's equal for all observations and set by the user through the argument
prediction_zero
.
In theory this value should be the expected prediction without conditioning on any features.
Typically we set this value equal to the mean of the response variable in our training data, but other choices
such as the mean of the predictions in the training data are also reasonable.
Camilla Lingjaerde, Nikolai Sellereite, Martin Jullum, Annabelle Redelmeier
if (requireNamespace("MASS", quietly = TRUE)) { # Load example data data("Boston", package = "MASS") # Split data into test- and training data x_train <- head(Boston, -3) x_test <- tail(Boston, 3) # Fit a linear model model <- lm(medv ~ lstat + rm + dis + indus, data = x_train) # Create an explainer object explainer <- shapr(x_train, model) # Explain predictions p <- mean(x_train$medv) # Empirical approach explain1 <- explain(x_test, explainer, approach = "empirical", prediction_zero = p, n_samples = 1e2 ) # Gaussian approach explain2 <- explain(x_test, explainer, approach = "gaussian", prediction_zero = p, n_samples = 1e2 ) # Gaussian copula approach explain3 <- explain(x_test, explainer, approach = "copula", prediction_zero = p, n_samples = 1e2 ) # ctree approach explain4 <- explain(x_test, explainer, approach = "ctree", prediction_zero = p ) # Combined approach approach <- c("gaussian", "gaussian", "empirical", "empirical") explain5 <- explain(x_test, explainer, approach = approach, prediction_zero = p, n_samples = 1e2 ) # Print the Shapley values print(explain1$dt) # Plot the results if (requireNamespace("ggplot2", quietly = TRUE)) { plot(explain1) } }
if (requireNamespace("MASS", quietly = TRUE)) { # Load example data data("Boston", package = "MASS") # Split data into test- and training data x_train <- head(Boston, -3) x_test <- tail(Boston, 3) # Fit a linear model model <- lm(medv ~ lstat + rm + dis + indus, data = x_train) # Create an explainer object explainer <- shapr(x_train, model) # Explain predictions p <- mean(x_train$medv) # Empirical approach explain1 <- explain(x_test, explainer, approach = "empirical", prediction_zero = p, n_samples = 1e2 ) # Gaussian approach explain2 <- explain(x_test, explainer, approach = "gaussian", prediction_zero = p, n_samples = 1e2 ) # Gaussian copula approach explain3 <- explain(x_test, explainer, approach = "copula", prediction_zero = p, n_samples = 1e2 ) # ctree approach explain4 <- explain(x_test, explainer, approach = "ctree", prediction_zero = p ) # Combined approach approach <- c("gaussian", "gaussian", "empirical", "empirical") explain5 <- explain(x_test, explainer, approach = approach, prediction_zero = p, n_samples = 1e2 ) # Print the Shapley values print(explain1$dt) # Plot the results if (requireNamespace("ggplot2", quietly = TRUE)) { plot(explain1) } }
Define feature combinations, and fetch additional information about each unique combination
feature_combinations( m, exact = TRUE, n_combinations = 200, weight_zero_m = 10^6 )
feature_combinations( m, exact = TRUE, n_combinations = 200, weight_zero_m = 10^6 )
m |
Positive integer. Total number of features. |
exact |
Logical. If |
n_combinations |
Positive integer. Note that if |
weight_zero_m |
Numeric. The value to use as a replacement for infinite combination weights when doing numerical operations. |
A data.table that contains the following columns:
Positive integer. Represents a unique key for each combination. Note that the table
is sorted by id_combination
, so that is always equal to x[["id_combination"]] = 1:nrow(x)
.
List. Each item of the list is an integer vector where features[[i]]
represents the indices of the features included in combination i
. Note that all the items
are sorted such that features[[i]] == sort(features[[i]])
is always true.
Vector of positive integers. n_features[i]
equals the number of features in combination
i
, i.e. n_features[i] = length(features[[i]])
.
.
Positive integer. The number of unique ways to sample n_features[i]
features
from m
different features, without replacement.
Nikolai Sellereite, Martin Jullum
# All combinations x <- feature_combinations(m = 3) nrow(x) # Equals 2^3 = 8 # Subsample of combinations x <- feature_combinations(exact = FALSE, m = 10, n_combinations = 1e2)
# All combinations x <- feature_combinations(m = 3) nrow(x) # Equals 2^3 = 8 # Subsample of combinations x <- feature_combinations(exact = FALSE, m = 10, n_combinations = 1e2)
Initiate the making of dummy variables
make_dummies(traindata, testdata)
make_dummies(traindata, testdata)
traindata |
data.table or data.frame. |
testdata |
data.table or data.frame. New data that has the same
feature names, types, and levels as |
A list that contains the following entries:
List. Output from check_features
A data.frame containing all of the factors in traindata
as
one-hot encoded variables.
A data.frame containing all of the factors in testdata
as
one-hot encoded variables.
Original traindata with correct column ordering and factor levels. To be passed to
shapr.
Original testdata with correct column ordering and factor levels. To be passed to
explain.
Annabelle Redelmeier, Martin Jullum
if (requireNamespace("MASS", quietly = TRUE)) { data("Boston", package = "MASS") x_var <- c("lstat", "rm", "dis", "indus") y_var <- "medv" x_train <- as.data.frame(Boston[401:411, x_var]) y_train <- Boston[401:408, y_var] x_test <- as.data.frame(Boston[1:4, x_var]) # convert to factors for illustational purpose x_train$rm <- factor(round(x_train$rm)) x_test$rm <- factor(round(x_test$rm), levels = levels(x_train$rm)) dummylist <- make_dummies(traindata = x_train, testdata = x_test) }
if (requireNamespace("MASS", quietly = TRUE)) { data("Boston", package = "MASS") x_var <- c("lstat", "rm", "dis", "indus") y_var <- "medv" x_train <- as.data.frame(Boston[401:411, x_var]) y_train <- Boston[401:408, y_var] x_test <- as.data.frame(Boston[1:4, x_var]) # convert to factors for illustational purpose x_train$rm <- factor(round(x_train$rm)) x_test$rm <- factor(round(x_test$rm), levels = levels(x_train$rm)) dummylist <- make_dummies(traindata = x_train, testdata = x_test) }
Plots the individual prediction explanations.
## S3 method for class 'shapr' plot( x, digits = 3, plot_phi0 = TRUE, index_x_test = NULL, top_k_features = NULL, ... )
## S3 method for class 'shapr' plot( x, digits = 3, plot_phi0 = TRUE, index_x_test = NULL, top_k_features = NULL, ... )
x |
An |
digits |
Integer. Number of significant digits to use in the feature description |
plot_phi0 |
Logical. Whether to include |
index_x_test |
Integer vector. Which of the test observations to plot. E.g. if you have
explained 10 observations using |
top_k_features |
Integer. How many features to include in the plot. E.g. if you have 15
features in your model you can plot the 5 most important features, for each explanation, by setting
|
... |
Currently not used. |
See vignette("understanding_shapr", package = "shapr")
for an example of
how you should use the function.
ggplot object with plots of the Shapley value explanations
Martin Jullum
if (requireNamespace("MASS", quietly = TRUE)) { #' # Load example data data("Boston", package = "MASS") # Split data into test- and training data x_train <- head(Boston, -3) x_test <- tail(Boston, 3) # Fit a linear model model <- lm(medv ~ lstat + rm + dis + indus, data = x_train) # Create an explainer object explainer <- shapr(x_train, model) # Explain predictions p <- mean(x_train$medv) # Empirical approach explanation <- explain(x_test, explainer, approach = "empirical", prediction_zero = p, n_samples = 1e2 ) if (requireNamespace("ggplot2", quietly = TRUE)) { # Plot the explantion (this function) plot(explanation) } }
if (requireNamespace("MASS", quietly = TRUE)) { #' # Load example data data("Boston", package = "MASS") # Split data into test- and training data x_train <- head(Boston, -3) x_test <- tail(Boston, 3) # Fit a linear model model <- lm(medv ~ lstat + rm + dis + indus, data = x_train) # Create an explainer object explainer <- shapr(x_train, model) # Explain predictions p <- mean(x_train$medv) # Empirical approach explanation <- explain(x_test, explainer, approach = "empirical", prediction_zero = p, n_samples = 1e2 ) if (requireNamespace("ggplot2", quietly = TRUE)) { # Plot the explantion (this function) plot(explanation) } }
Create an explainer object with Shapley weights for test data.
shapr(x, model, n_combinations = NULL)
shapr(x, model, n_combinations = NULL)
x |
Numeric matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula. |
model |
The model whose predictions we want to explain. Run
|
n_combinations |
Integer. The number of feature combinations to sample. If |
Named list that contains the following items:
Boolean. Equals TRUE
if n_combinations = NULL
or
n_combinations < 2^ncol(x)
, otherwise FALSE
.
Positive integer. The number of columns in x
Binary matrix. The number of rows equals the number of unique combinations, and
the number of columns equals the total number of features. I.e. let's say we have a case with
three features. In that case we have 2^3 = 8
unique combinations. If the j-th
observation for the i-th row equals 1
it indicates that the j-th feature is present in
the i-th combination. Otherwise it equals 0
.
Second item
data.table. Returned object from feature_combinations
data.table. Transformed x
into a data.table.
List. The updated_feature_list
output from
preprocess_data
In addition to the items above, model
and n_combinations
are also present in the returned object.
Nikolai Sellereite
if (requireNamespace("MASS", quietly = TRUE)) { # Load example data data("Boston", package = "MASS") df <- Boston # Example using the exact method x_var <- c("lstat", "rm", "dis", "indus") y_var <- "medv" df1 <- df[, x_var] model <- lm(medv ~ lstat + rm + dis + indus, data = df) explainer <- shapr(df1, model) print(nrow(explainer$X)) # 16 (which equals 2^4) # Example using approximation y_var <- "medv" x_var <- setdiff(colnames(df), y_var) model <- lm(medv ~ ., data = df) df2 <- df[, x_var] explainer <- shapr(df2, model, n_combinations = 1e3) print(nrow(explainer$X)) # Example using approximation where n_combinations > 2^m x_var <- c("lstat", "rm", "dis", "indus") y_var <- "medv" df3 <- df[, x_var] model <- lm(medv ~ lstat + rm + dis + indus, data = df) explainer <- shapr(df1, model, n_combinations = 1e3) print(nrow(explainer$X)) # 16 (which equals 2^4) }
if (requireNamespace("MASS", quietly = TRUE)) { # Load example data data("Boston", package = "MASS") df <- Boston # Example using the exact method x_var <- c("lstat", "rm", "dis", "indus") y_var <- "medv" df1 <- df[, x_var] model <- lm(medv ~ lstat + rm + dis + indus, data = df) explainer <- shapr(df1, model) print(nrow(explainer$X)) # 16 (which equals 2^4) # Example using approximation y_var <- "medv" x_var <- setdiff(colnames(df), y_var) model <- lm(medv ~ ., data = df) df2 <- df[, x_var] explainer <- shapr(df2, model, n_combinations = 1e3) print(nrow(explainer$X)) # Example using approximation where n_combinations > 2^m x_var <- c("lstat", "rm", "dis", "indus") y_var <- "medv" df3 <- df[, x_var] model <- lm(medv ~ lstat + rm + dis + indus, data = df) explainer <- shapr(df1, model, n_combinations = 1e3) print(nrow(explainer$X)) # 16 (which equals 2^4) }