Forward Marginal Effects (FMEs) are probably the most intuitive way
to interpret feature effects in supervised ML models. Remember how we
interpret the beta coefficient of a numerical feature in a linear
regression model πΌ[Y]β=βΞ²0β
+β
Ξ²1x1β
+β
β¦β
+β
Ξ²pxp:
If xj
increases by one unit, the predicted target variable
increases by Ξ²j.
FMEs make use of this instinct and apply it straightforwardly to any
model.
In short, FMEs are the answer to the following question:
What is the change in the predicted target variable if we
change the value of the feature by h
units?
A few examples: What is the change in predicted blood pressure if a patientsβ weight increases by h = 1 kg? What is the change in predicted life satisfaction if a personβs monthly income increases by h = 1,000 US dollars? Per default, h will be 1. However, h can be chosen to match the desired scale of interpretation.
The big advantage of FMEs is that they are very simple. The FME is
defined observation-wise, i.e., it is computed separately for each
observation in the data. Often, we are more interested in estimating a
global effect, so we do the following:
1. Compute the FME for each observation in the
data
2. Compute the Average Marginal Effect (AME)
For a given observation i
and step size hj, the FME of a
single numerical feature xj is computed
as:
FMEx(i),βhjβ=βfΜ(x1(i),ββ¦,βxj(i)β
+β
hj,ββ¦,βxp(i))β
ββ
fΜ(x(i))
As can be seen from the formula, the FME is simply the difference in
predictions between the original observation x(i) and the
changed observation (x1(i),ββ¦,βxj(i)β
+β
hj,ββ¦,βxp(i)),
where hj
is added to the feature xj.
This is just the extension of the univariate FME to two features
xj,βxk
that are affected simultaneously by a step. Therefore, the step size
becomes a vector hβ=β(hj,βhk),
where hj
denotes the change in xj and hk the change in
xk:
FMEx(i),βhβ=βfΜ(x1(i),ββ¦,βxj(i)β
+β
hj,ββ¦,βxk(i)β
+β
hk,ββ¦,βxp(i))β
ββ
fΜ(x(i))
Equivalent to the step size hj of a
numerical feature, we select the category of interest ch for a
categorical feature xj. For a given
observation i and category
ch, the
FME is:
FMEx(i),βchβ=βfΜ(ch,βxβj(i))β
ββ
fΜ(x(i)),ββββββββxjββ βch
where we simply change the categorical feature to ch, leave all other features xβj(i) unchanged, and compare the predicted value of this changed observation to the predicted value of the unchanged observation. Obviously, we can only compute this for observations where the original category is not the category of interest xjββ βch. See here for an example.
The AME is the mean of every observationβs FME as a global estimate
for the feature effect:
$\textrm{AME} = \frac{1}{n}\sum_{i = 1}^{n}{\,
\textrm{FME}_{\mathbf{x}^{(i)}, \, h_{j}}}$
Therefore, the AME is the expected difference in the predicted target
variable if the feature xj is changed by
hj units.
For hj =
1, this corresponds to the way we interpret the coefficient Ξ²j of a linear
regression model. However, be careful: the choice of hj can have a
strong effect on the estimated FMEs and AME for
non-linear prediction functions, auch as random forests
or gradient-boosted trees.
Marginal effects (ME) are already a widely used concept to interpret
statistical models. However, we believe they are ill-suited to
interpret feature effects in most ML models. Here, we explain why you
should abandon MEs in favor of FMEs:
In most implementations (e.g., Leeperβs margins package), MEs are
computed as numerical approximation of the partial derivative of the
prediction function w.r.t. the feature xj. In other
words, they compute a finite difference quotient, similar to this:
$\textrm{dME}_{\mathbf{x}^{(i)}, \, j} =
\cfrac{\widehat{f}(x_{1}^{(i)}, \, \ldots,\, x_{j}^{(i)}+h,\, \ldots, \,
x_{p}^{(i)})-\widehat{f}(\mathbf{x}^{(i)})}{h}$
where h typically is very small (e.g.Β 10β7). As is explained here, these derivative-based MEs (dME) have a number of shortcomings:
Number 1: The formula above computes an estimate for the partial derivative, i.e., the tangent of the prediction function at point x(i). The default way to interpret this is to say: if xj increases by one unit, the predicted target variable can be expected to increase by MEx(i),βj. Unconsciously, we use a unit change (h = 1) to interpret the computed ME even though we computed an instantaneous rate of change. For non-linear prediction functions, this can lead to substantial misinterpretations. The image below illustrates this:
The yellow line is the prediction function, the grey line is the
tangent at point x = 0.5. If
interpreted with a unit change, the dME is subject to an error, due to
the non-linearity of the prediction function. The FME, however,
corresponds to the true change in prediction along the secant (green
line) between x = 0.5 and
x = 1.5. This is simply by way
of design of the FME, as it describes exactly our intuition of
interpreting partial derivatives.
Number 2: In general, dMEs are ill-suited to describe models based on piecewise constant prediction functions (e.g., CART, random forests or gradient-boosted trees), as most observations are located on piecewise constant parts of the prediction function where the derivative equals zero. In contrast, FMEs allow for the choice of a sensible step size h that is large enough to traverse a jump discontinuity, as can be seen in the example below: At x = -2.5 (green), the dME is zero. Using the FME with h = 1, we get the red point with a different (lower) function value. Here, the FME is negative, corresponding to what happens when x = -2.5 increases by one unit.
In a way, the FME is the much smarter, little brother of the
dME:
Scholbeck, Christian A., et al.Β βMarginal Effects for Non-Linear Prediction Functions.β arXiv preprint arXiv:2201.08837 (2022).