The HLMdiag
package provides functionality to examine
diagnostics for hierarchical linear models, including residuals values
and influence diagnostics. This vignette aims to:
hlm_influence()
, which provides any easy way to obtain
multiple influence diagnostics in one tibblehlm_augment()
,
which combines hlm_influence()
and hlm_resid()
to return influence diagnostics and residualslme
created from the
nlme
packageHLMdiag
and those returned by the lme4
and
car
packageshlm_influence()
functionThe hlm_influence()
function provides a wrapper that
returns influence diagnostics appended to the original model frame. It
is useful for assessing the influence of individual observations, or
groups of observations, and can also be used with
dotplot_diag()
to visually explore the influence
diagnostics. The diagnostics returned by hlm_influence()
include Cook’s distance, MDFFITS, covariance trace (covtrace),
covariance ratio (covratio), leverage, and relative variance change
(RVC).
Cook’s distance and MDFFITS both measure the distance between fixed
effects estimated from the full data set and those obtained from the
reduced data set. For Cook’s distance, the change in parameter estimates
is scaled by the estimated covariance matrix of the original parameter
estimates, while MDFFITS scales this change by the estimated covariance
matrix of the deletion estimates. The covariance trace and covariance
ratio both measure how precision is affected by the deletion of a
particular observation or group of observations i. Covariance
trace is a measure of the ratio between the covariance matrices with and
without unit i to the identity matrix, while covariance ratio
is a comparison of the two covariance matrices with and without unit
i using their determinants. Relative variance change (RVC) is a
measurement of the ratio of estimates of the l th variance
component with and without unit i. The final influence
diagnostic returned by hlm_influence()
, leverage, is the
rate of change in the predicted response with respect to the observed
response. For a full explanation of these influence diagnostics,
including formulas, please refer to Loy and Hofmann (2014).
To explore the functionality of hlm_influence()
, we will
use the data set classroom
(West et. al,
2014). The data set consists of 1,190 observations of students.
Students are grouped within classes, which are nested within schools.
There are 312 distinct classes nested within 107 schools.
#> # A tibble: 1,190 × 12
#> sex minority mathkind mathgain ses yearstea mathknow housepov mathprep
#> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 448 32 0.46 1 NA 0.082 2
#> 2 0 1 460 109 -0.27 1 NA 0.082 2
#> 3 1 1 511 56 -0.03 1 NA 0.082 2
#> 4 0 1 449 83 -0.38 2 -0.11 0.082 3.25
#> 5 0 1 425 53 -0.03 2 -0.11 0.082 3.25
#> 6 1 1 450 65 0.76 2 -0.11 0.082 3.25
#> 7 0 1 452 51 -0.03 2 -0.11 0.082 3.25
#> 8 0 1 443 66 0.2 2 -0.11 0.082 3.25
#> 9 1 1 422 88 0.64 2 -0.11 0.082 3.25
#> 10 0 1 480 -7 0.13 2 -0.11 0.082 3.25
#> classid schoolid childid
#> <int> <int> <int>
#> 1 160 1 1
#> 2 160 1 2
#> 3 160 1 3
#> 4 217 1 4
#> 5 217 1 5
#> 6 217 1 6
#> 7 217 1 7
#> 8 217 1 8
#> 9 217 1 9
#> 10 217 1 10
#> # ℹ 1,180 more rows
We will fit a simple three-level hierarchical linear model using the
lme4
package:
level
hlm_influence()
calculates influence diagnostics for
individual cases, or larger groups. For example, to obtain a tibble with
influence diagnostics for all 1,190 students we use the following:
#> # A tibble: 1,190 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 460 0 1 -0.27 0.082 1 160
#> 3 3 56 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00106 0.00106 0.00289 1.00 0.121
#> 2 0.00143 0.00143 0.00246 1.00 0.121
#> 3 0.000336 0.000335 0.00378 1.00 0.122
#> 4 0.000225 0.000225 0.00169 1.00 0.0780
#> 5 0.000173 0.000172 0.00178 1.00 0.0780
#> 6 0.000000807 0.000000804 0.00321 1.00 0.0794
#> 7 0.0000233 0.0000233 0.00113 1.00 0.0775
#> 8 0.0000000504 0.0000000504 0.00120 1.00 0.0775
#> 9 0.000130 0.000129 0.00401 1.00 0.0801
#> 10 0.00108 0.00108 0.00147 1.00 0.0778
#> # ℹ 1,180 more rows
Note that the parameter used to select individual observations is now
called level
, not group
. Previous versions of
HLMdiag
used the group parameter for influence diagnostics,
where group = NULL
was the default, and users could choose
to delete groups instead by setting group
equal to level
names. As of HLMdiag
version 0.4.0, group
has
been replaced by level
. level
defaults to
level = 1
which deletes individual observations, exactly
how group = NULL
used to work.
The resulting tibble can be used along with
dotplot_diag()
to identify potentially influential
observations using either an internal cutoff of 3*IQR or a
user-specified cutoff. For example, the plot shown below reveals and
labels all observations above the internally calculated cutoff for
Cook’s distance.
The plot illustrates that there are many observations with a Cook’s
distance value above the internally calculated cutoff.
dotplot_diag()
labels the top 5, which is difficult to see
in this case. Rather than investigate all observations above the cutoff,
we may be interested in the observations with the highest Cook’s
distance values. The tibble returned by hlm_influence()
provides an easy way to do this when used with the
arrange()
function from dplyr
.
#> # A tibble: 1,190 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid cooksd
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 539 253 290 0 1 -0.4 0.257 47 82 0.0536
#> 2 41 -110 434 0 1 -0.37 0.365 4 179 0.0265
#> 3 1078 203 290 1 1 -0.32 0.067 95 144 0.0256
#> 4 664 -84 629 1 0 2.33 0.17 62 22 0.0245
#> 5 312 138 542 1 0 2.27 0.05 27 104 0.0221
#> 6 754 218 368 0 1 -1.14 0.43 70 152 0.0202
#> 7 337 197 290 1 1 -0.46 0.214 28 203 0.0194
#> 8 723 214 290 1 1 -0.67 0.209 68 244 0.0187
#> 9 812 147 510 0 0 -0.47 0.367 75 42 0.0148
#> 10 1146 11 376 0 1 0.62 0.126 102 44 0.0132
#> mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0527 0.0172 1.02 0.112
#> 2 0.0264 0.00406 1.00 0.143
#> 3 0.0250 0.0250 1.03 0.147
#> 4 0.0242 0.0125 1.01 0.0869
#> 5 0.0219 0.00988 1.01 0.0919
#> 6 0.0200 0.00702 1.01 0.0896
#> 7 0.0190 0.0198 1.02 0.147
#> 8 0.0183 0.0195 1.02 0.155
#> 9 0.0147 0.00797 1.01 0.0722
#> 10 0.0131 0.00825 1.01 0.0962
#> # ℹ 1,180 more rows
There does not appear to be a clear pattern among students who have
relatively higher Cook’s distance values; they do not tend to be from a
particular school or class. In order to investigate these observations
further, one could look to summary statistics for the different
explanatory variables. Additionally, a similar analysis could be done
with the other influence diagnostics provided in the tibble from
hlm_influence()
.
In addition to identifying influential observations, it may also be
useful to identify influential groups. The level
parameter
in hlm_influence()
can be used for this purpose. For
example, we can obtain influence diagnostics for each class:
#> # A tibble: 312 × 6
#> `classid:schoolid` cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 160:1 0.00224 0.00224 0.00990 1.01 0.0980
#> 2 217:1 0.00133 0.00131 0.0238 1.02 0.128
#> 3 197:2 0.00323 0.00320 0.0106 1.01 0.123
#> 4 211:2 0.00724 0.00719 0.0202 1.02 0.0891
#> 5 307:2 0.00346 0.00343 0.0197 1.02 0.152
#> 6 11:3 0.00116 0.00115 0.0141 1.01 0.0960
#> 7 137:3 0.000873 0.000868 0.0131 1.01 0.150
#> 8 145:3 0.00517 0.00515 0.0248 1.02 0.0983
#> 9 228:3 0.000479 0.000478 0.00797 1.01 0.126
#> 10 48:4 0.00371 0.00367 0.0212 1.02 0.134
#> # ℹ 302 more rows
Note that the level
parameter is set as
classid:schoolid
, not classid
. The level
parameter should be specified as found in model@flist
for
lmerMod objects. For three level models, this usually takes the form of
“level2:level3”, where level2 is the name of the second level variable,
and level3 is the name of the third level variable.
Using dotplot_diag()
again with one of the columns from
the resulting tibble of hlm_influence
reveals that class
218 has a relatively high Cook’s distance value.
We can repeat a similar analysis to flag influential schools.
#> # A tibble: 107 × 6
#> schoolid cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.000561 0.000557 0.0381 1.04 0.0901
#> 2 2 0.00681 0.00672 0.0554 1.06 0.115
#> 3 3 0.0343 0.0333 0.0723 1.07 0.108
#> 4 4 0.0248 0.0246 0.0360 1.04 0.125
#> 5 5 0.00224 0.00223 0.0583 1.06 0.0998
#> 6 6 0.0109 0.0107 0.0690 1.07 0.104
#> 7 7 0.00576 0.00560 0.0785 1.08 0.108
#> 8 8 0.0121 0.0118 0.0785 1.08 0.0926
#> 9 9 0.00624 0.00590 0.0793 1.08 0.141
#> 10 10 0.00848 0.00826 0.0723 1.07 0.0928
#> # ℹ 97 more rows
Similarly, we use dotplot_diag
to visually represent the
differences. In this case, there are only four observations above the
cutoff, so we set modify = "dotplot"
in order to better
visualize the influential observations. modify = "dotplot"
works well when there are a relatively small number of observations
above the cutoff.
The plot reveals that schools 68, 75, 70, and 27 have relatively high Cook’s distance values.
delete
The delete
parameter allows the user to calculate
influence diagnostics for a specified group of observations, or group
factor level. For example, influence diagnostics for the influential
schools flagged above (68, 75, 70, and 27) can be calculated as shown
below, deleting all four schools at once:
hlm_influence(class.lmer, level = "schoolid", delete = c("27", "70", "75", "68"))
#> # A tibble: 1 × 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.238 0.222 0.370 1.43
Note that in this case, delete
is specified as a
character vector consisting of the school ID’s of interest.
delete
can also be set as a numeric vector of indices; in
this case, setting delete
to the row numbers of all
students in the data frame who attend schools 68, 75, 70, or 27 would be
equivalent to the line above.
leverage
argumentIn the previous examples, hlm_influence()
only returned
overall leverage. However, hlm_influence()
also allows for
other types of leverage, including the leverage corresponding to the
fixed effects, the leverage corresponding to the random effects as
proposed by Demidenko and Stukel (2005), and the unconfounded leverage
corresponding to the random effects proposed by Nobre and Singer (2011).
These types of leverage are referred to as fixef
,
ranef
, and ranef.uc
, respectively.
None of our observations are flagged for high leverage when looking only at overall leverage:
However, we can obtain the other types of leverage as follows:
infl2 <- hlm_influence(class.lmer, level = 1, leverage = c("overall", "fixef", "ranef", "ranef.uc"))
This example illustrates how to select all four types of leverage, but any one or more may also be selected.
We can then use dotplot_diag()
with one of the new
leverage columns to investigate outlier for that type of leverage.
However, further analysis reveals that several observations have high leverage when considering only the leverage corresponding to the fixed effects.
approx
hlm_influence()
defaults to calculating influence
diagnostics based off a one step approximation (Christensen
et.al 1992; Shi and Chen 2008; Zewotir
2008). However, the approx
parameter allows the
user to calculate influence diagnostics based off a full refit of the
data using hlm_influence()
. For example, if we wished to
calculate influence diagnostics for each school by fully refitting the
model each time, we could use:
#> # A tibble: 107 × 9
#> schoolid cooksd mdffits covtrace covratio rvc.sigma2 rvc.D11 rvc.D22
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.000642 0.000639 0.0565 1.06 -0.00262 0.0154 0.0183
#> 2 2 0.00684 0.00676 0.0423 1.04 -0.00409 0.0161 -0.00403
#> 3 3 0.0389 0.0386 0.0256 0.973 0.00299 0.000124 -0.0952
#> 4 4 0.0249 0.0253 0.0971 0.906 -0.0315 -0.0293 0.0174
#> 5 5 0.00222 0.00221 0.0691 1.07 -0.00229 0.0160 0.00997
#> 6 6 0.0111 0.0109 0.0671 1.07 0.00443 0.00323 -0.0198
#> 7 7 0.00564 0.00546 0.0970 1.10 0.00436 0.0230 -0.0119
#> 8 8 0.0117 0.0115 0.0653 1.07 -0.00744 -0.0804 0.0574
#> 9 9 0.00634 0.00598 0.0922 1.09 0.00313 0.00326 -0.00216
#> 10 10 0.00816 0.00790 0.109 1.11 0.00796 0.0220 -0.00935
#> leverage.overall
#> <dbl>
#> 1 0.0901
#> 2 0.115
#> 3 0.108
#> 4 0.125
#> 5 0.0998
#> 6 0.104
#> 7 0.108
#> 8 0.0926
#> 9 0.141
#> 10 0.0928
#> # ℹ 97 more rows
In most cases, using the default of approx = TRUE
is
sufficient for influence analysis. Setting approx = FALSE
also takes much longer than the default setting since the model must be
refit each time with each group or observation deleted. However, the
full refit method also allows for relative variance change (RVC) to be
returned. If this is desired, approx = FALSE
should be
used.
na.action
and the data
argumenthlm_influence()
was written to respect the
na.action
parameter from the lme4
package.
This argument defaults to na.omit
, which means all rows in
the data sets with NA
s present are simply deleted from the
model. However, there is also an na.exclude
option, which
pads the resulting tibbles with NA
s in the the rows that
contained NA
s in the original data set in place of deleting
them altogether. In order for hlm_influence()
to do this,
the original data set must be passed into hlm_influence()
via the data
argument whenever the na.action
was set to na.exclude
in the model fitting process.
For example, while the class
data set does not have any
NA
s in the data set, we can introduce a couple for the
purposes of an example.
We can then fit the same model using the lme4
package as
before. Below, we fit two models, one with
na.action = "na.exclude"
and one with the default
na.action = "na.omit"
.
class.lmer.exclude <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.exclude")
class.lmer.omit <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.omit")
We then run hlm_influence()
on the model where
na.action = "na.omit"
.
#> # A tibble: 1,188 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 83 449 0 1 -0.38 0.082 1 217
#> 3 3 53 425 0 1 -0.03 0.082 1 217
#> 4 4 65 450 1 1 0.76 0.082 1 217
#> 5 5 51 452 0 1 -0.03 0.082 1 217
#> 6 6 66 443 0 1 0.2 0.082 1 217
#> 7 7 88 422 1 1 0.64 0.082 1 217
#> 8 8 -7 480 0 1 0.13 0.082 1 217
#> 9 9 60 502 0 1 0.83 0.082 1 217
#> 10 10 -2 502 1 1 0.06 0.082 2 197
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000710 0.000708 0.00351 1.00 0.160
#> 2 0.000295 0.000294 0.00182 1.00 0.0801
#> 3 0.000140 0.000140 0.00183 1.00 0.0801
#> 4 0.00000818 0.00000815 0.00325 1.00 0.0814
#> 5 0.0000145 0.0000144 0.00124 1.00 0.0795
#> 6 0.00000224 0.00000224 0.00127 1.00 0.0796
#> 7 0.000184 0.000183 0.00400 1.00 0.0821
#> 8 0.00111 0.00111 0.00163 1.00 0.0799
#> 9 0.000375 0.000374 0.00338 1.00 0.0815
#> 10 0.00250 0.00248 0.00508 1.01 0.136
#> # ℹ 1,178 more rows
Note than there are only 1,188 rows in the returned tibble, although there were 1,190 observations in the original data set. The two rows with NAs were deleted from the returned tibble.
We repeat this with the model where
na.action = "na.exclude"
.
#> # A tibble: 1,190 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 NA 0 1 -0.27 0.082 1 160
#> 3 3 NA 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000710 0.000708 0.00351 1.00 0.160
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 0.000295 0.000294 0.00182 1.00 0.0801
#> 5 0.000140 0.000140 0.00183 1.00 0.0801
#> 6 0.00000818 0.00000815 0.00325 1.00 0.0814
#> 7 0.0000145 0.0000144 0.00124 1.00 0.0795
#> 8 0.00000224 0.00000224 0.00127 1.00 0.0796
#> 9 0.000184 0.000183 0.00400 1.00 0.0821
#> 10 0.00111 0.00111 0.00163 1.00 0.0799
#> # ℹ 1,180 more rows
In this tibble, there are 1,190 rows. Furthermore, the two rows with
NAs display NAs for the influence diagnostics, instead of being entirely
absent as in the above example. It is important to note that the
data
argument is necessary. Failing to provide the data set
through the data
argument in this situation will result in
an error.
The hlm_augment()
function combines the
hlm_resid()
and hlm_influence()
functions to
return a tibble containing information about the residuals and the
influence diagnostics appended to the data. For example, we can obtain
residuals and influence diagnostics at once for all students in the
class
data set with the following:
#> # A tibble: 1,190 × 20
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <dbl> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 460 0 1 -0.27 0.082 1 160
#> 3 3 56 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> .resid .fitted .ls.resid .ls.fitted .mar.resid .mar.fitted cooksd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -37.7 69.7 0 32 -34.6 66.6 0.00106
#> 2 47.5 61.5 0 109 50.6 58.4 0.00143
#> 3 18.5 37.5 0 56 21.6 34.4 0.000336
#> 4 23.3 59.7 35.4 47.6 20.0 63.0 0.000225
#> 5 -19.9 72.9 -15.4 68.4 -23.1 76.1 0.000173
#> 6 1.01 64.0 -4.18 69.2 -2.22 67.2 0.000000807
#> 7 -9.14 60.1 -1.19 52.2 -12.4 63.4 0.0000233
#> 8 0.413 65.6 4.24 61.8 -2.82 68.8 0.0000000504
#> 9 11.5 76.5 4.18 83.8 8.22 79.8 0.000130
#> 10 -54.8 47.8 -45.3 38.3 -58.0 51.0 0.00108
#> mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00106 0.00289 1.00 0.121
#> 2 0.00143 0.00246 1.00 0.121
#> 3 0.000335 0.00378 1.00 0.122
#> 4 0.000225 0.00169 1.00 0.0780
#> 5 0.000172 0.00178 1.00 0.0780
#> 6 0.000000804 0.00321 1.00 0.0794
#> 7 0.0000233 0.00113 1.00 0.0775
#> 8 0.0000000504 0.00120 1.00 0.0775
#> 9 0.000129 0.00401 1.00 0.0801
#> 10 0.00108 0.00147 1.00 0.0778
#> # ℹ 1,180 more rows
This is useful for inspecting residuals values and influence
diagnostics values at the same time. However, hlm_augment()
lacks some of the functionality that hlm_influence()
and
hlm_resid()
have. The delete
and
approx
parameters available for
hlm_influence()
are not available in
hlm_augment()
, so the function will always use a one step
approximation and delete all observations or groups instead of a
selected few. The standardize
parameter from
hlm_resid()
is also not included, so
hlm_influence()
or hlm_resid()
should be used
instead if this functionality is desired. For more information about
available functionality in hlm_resid()
, see the
hlm_resid
vignette.
hlm_augment()
is especially useful for inspecting
residual values of observations with relatively high influence
diagnostics values, or vice versa.
#> # A tibble: 1,190 × 20
#> id mathgain mathkind sex minority ses housepov schoolid classid .resid
#> <dbl> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 539 253 290 0 1 -0.4 0.257 47 82 110.
#> 2 41 -110 434 0 1 -0.37 0.365 4 179 -157.
#> 3 1078 203 290 1 1 -0.32 0.067 95 144 62.0
#> 4 664 -84 629 1 0 2.33 0.17 62 22 -88.7
#> 5 312 138 542 1 0 2.27 0.05 27 104 94.6
#> 6 754 218 368 0 1 -1.14 0.43 70 152 107.
#> 7 337 197 290 1 1 -0.46 0.214 28 203 60.7
#> 8 723 214 290 1 1 -0.67 0.209 68 244 59.8
#> 9 812 147 510 0 0 -0.47 0.367 75 42 87.2
#> 10 1146 11 376 0 1 0.62 0.126 102 44 -79.8
#> .fitted .ls.resid .ls.fitted .mar.resid .mar.fitted cooksd mdffits covtrace
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 143. 4.04e+ 1 213. 117. 136. 0.0536 0.0527 0.0172
#> 2 47.0 0 -110 -177. 66.8 0.0265 0.0264 0.00406
#> 3 141. 0 203 65.9 137. 0.0256 0.0250 0.0250
#> 4 4.67 -2.79e+ 1 -56.1 -81.9 -2.08 0.0245 0.0242 0.0125
#> 5 43.4 -4.44e-15 138 98.1 39.9 0.0221 0.0219 0.00988
#> 6 111. -6.66e-15 218 125. 93.1 0.0202 0.0200 0.00702
#> 7 136. 0 197 62.3 135. 0.0194 0.0190 0.0198
#> 8 154. 0 214 80.4 134. 0.0187 0.0183 0.0195
#> 9 59.8 2.78e+ 1 119. 109. 38.3 0.0148 0.0147 0.00797
#> 10 90.8 -2.60e+ 1 37.0 -91.1 102. 0.0132 0.0131 0.00825
#> covratio leverage.overall
#> <dbl> <dbl>
#> 1 1.02 0.112
#> 2 1.00 0.143
#> 3 1.03 0.147
#> 4 1.01 0.0869
#> 5 1.01 0.0919
#> 6 1.01 0.0896
#> 7 1.02 0.147
#> 8 1.02 0.155
#> 9 1.01 0.0722
#> 10 1.01 0.0962
#> # ℹ 1,180 more rows
We can sort by a particular column in order to inspect the values of other influence diagnostics and the residuals of influential observations.
Previously, only the individual one step approximation influence
functions worked on lme
models fit using the
nlme
package. However, hlm_influence()
can
also be used on lme
objects, as can
hlm_augment()
. This allows the user to calculate influence
diagnostics for lme
models by fulling refitting the model
using the approx = FALSE
argument.
In most cases, using a lme
object for
hlm_influence()
or hlm_augment()
is identical
to their usage with lmerMod
objects from lme4
.
However, there are a few notable exceptions.
For both hlm_influence()
and hlm_augment()
,
levels should be specified by names that appear in
model@flist
. For the second level of a three level
lmerMod
model, this usually looks like “level2:level3”
where level2 and level3 are the names of the second and third level
variables, respectively. However, for a lme
model, levels
should be specified by names that appear in model$groups
.
For example, we can obtain influence diagnostics for each classroom from
the class
data set in the following way:
class.lme <- nlme::lme(mathgain ~ mathkind + sex + minority + ses + housepov, random = ~ 1|schoolid/classid, data = classroom)
hlm_influence(class.lme, level = "classid")
#> # A tibble: 312 × 6
#> classid cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1/160 0.00224 0.00224 0.00990 1.01 0.121
#> 2 1/217 0.00133 0.00131 0.0238 1.02 0.0784
#> 3 2/197 0.00323 0.00320 0.0106 1.01 0.126
#> 4 2/211 0.00724 0.00719 0.0202 1.02 0.102
#> 5 2/307 0.00346 0.00343 0.0197 1.02 0.0752
#> 6 3/11 0.00116 0.00115 0.0141 1.01 0.102
#> 7 3/137 0.000873 0.000868 0.0131 1.01 0.0819
#> 8 3/145 0.00517 0.00515 0.0248 1.02 0.107
#> 9 3/228 0.000479 0.000478 0.00797 1.01 0.148
#> 10 4/48 0.00371 0.00367 0.0212 1.02 0.149
#> # ℹ 302 more rows
For the lmerMod
model, we specified
level = classid:schoolid
. However, for lme
models, the name of the second level alone is sufficient. However,
specifying names of specific groups to delete for lme
models is also slightly different. For example, we can obtain influence
diagnostics for a model created when classes 160 and 217 are deleted for
a lmerMod
model in the following way:
hlm_influence(class.lmer, level = "classid:schoolid", delete = c("160:1", "217:1"))
#> # A tibble: 1 × 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000561 0.000557 0.0381 1.04
Obtaining influence diagnostics for a model created with the deletion
of classes 160 and 217 from a lme
model is a bit
different:
hlm_influence(class.lme, level = "classid", delete = c("1/160", "1/217"))
#> # A tibble: 1 × 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000561 0.000557 0.0381 1.04
Note that both examples specify units for deletion in the same way
they are specified in model@flist
(lmerMod
models) or model$groups
(lme
models).
Additionally, lmerMod
models require that the
data
argument is passed into hlm_influence()
and hlm_augment()
when
na.action = "na.exclude"
during the model fitting process.
However, this is unnecessary for lme
models.
The HLMdiag
package has two different ways to calculate
Cook’s distance. One of them, which is more exact, refits the model with
each observation or group deleted from the model and calculates Cook’s
distance from the resulting coefficient estimates and variance
components estimates. The second is a one-step approximation. For more
information about how the one step approximation works, we refer the
reader to Christensen et.al (1992); Shi and
Chen (2008); and Zewotir (2008). Other R
packages also have functions that can calculate Cook’s distance. In this
section, we highlight the differences between the ways of calculating
Cook’s distance in HLMdiag
versus other methods in the
car
and lme4
packages.
car
packageThe cooks_distance()
function from HLMdiag
calculates Cook’s distance through a full refit of the model using the
following formula:
$C_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$
Notice that the difference in the change in the parameter estimates
is scaled by the estimated covariance matrix of the original parameter
estimates. We can calculate the Cook’s distance values in the
HLMdiag
package by first using the
case_delete()
function, followed by the
cooks.distance()
function.
hlm_case <- HLMdiag:::case_delete.lmerMod(class.lmer)
hlm_cd_full <- HLMdiag:::cooks.distance.case_delete(hlm_case)
head(hlm_cd_full)
#> [1] 9.327238e-04 1.415243e-03 3.316859e-04 2.282337e-04 1.797497e-04
#> [6] 6.968432e-07
Conversely, the mdffits()
function from
HLMdiag
calculates MDFFITS, which is a multivariate version
of the DFFITS statistic. This is calculated as follows:
$MDFFITS_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$
Instead of scaling by the estimated covariance matrix of the original
parameter estimates, calculations for MDFFITS are scaled by the
estimated covariance matrix of the deletion estimates. For each deleted
observation or group, the model is refit and the covariance structure
re-estimated without unit i. We can calculate this similarly to
how we calculated Cook’s distance, using the same
case_delete
object.
hlm_mdffits <- mdffits(hlm_case)
head(hlm_mdffits)
#> [1] 9.304263e-04 1.412796e-03 3.302360e-04 2.278134e-04 1.793468e-04
#> [6] 6.942264e-07
These estimates are pretty similar to those produced by
cooks_distance()
; the difference is due to the use of the
covariance matrix of the deletion estimates, rather than the original
estimates.
The car
package calculates Cook’s distance similarly to
how the HLMdiag
package calculates MDFFITS. Instead of
using the covariance matrix of the original parameter estimates like
HLMdiag
’s cooks_distance()
function, the
cooks.distance()
function from car
uses the
estimated covariance matrix of the deletion estimates. However, the
cooks_distance()
function from car is not identical to the
mdffits
function from HLMdiag
; the
car::cooks.distance()
function also scaled each
observation, i, by dividing it by the estimated variance of the
model calculated without observation i. Therefore, the formula
used to calculate Cook’s distance in the car
package is as
follows:
$C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$
We can calculate this by first using the influence()
function followed by the cooks.distance()
function.
library(car)
car_case <- influence(class.lmer)
car_cd <- cooks.distance(car_case)
head(car_cd)
#> 1 2 3 4 5 6
#> 9.280228e-04 1.408088e-03 3.300041e-04 2.270854e-04 1.788868e-04 6.909834e-07
These values initially seem fairly different from the MDFFITS and
Cook’s distance values produced by HLMdiag
, due to the
additional scaling by the inverse of the variance of each refitted
model. We can adjust for this by multiplying the vector of Cook’s
distance values from car
by the estimated variance from
each refitted model.
sig.sq <- car_case[[4]][,1]
car_cd_adjusted <- car_cd * sig.sq
head(car_cd_adjusted)
#> 1 2 3 4 5 6
#> 0.6793697383 1.0316179048 0.2425235007 0.1667251598 0.1314640726 0.0005079862
Once the values from car
have been adjusted by sigma
squared, they now appear very similar to the MDFFITS values from
HLMdiag
. The plot below shows the difference between the
MDFFITS estimates from HLMdiag
and the variance-adjusted
Cook’s distance estimates from car
for all 1,190
observations in the classroom
data set.
While the difference between the estimates varies slightly due to differences in how the fixed effects and variance components are calculated for refit models in both packages, the difference in values tends to be very small.
lme4
packageSimilar to HLMdiag
, the lme4
package has
two methods that calculate Cook’s distance. One of them, similar to the
case_delete
method, conducts a full refit of models without
each observation or group. The second is a quicker approximation.
lme4
approximationIn order to calculate the approximation of Cook’s distance values
from lme4
, we use the cooks.distance()
function.
lme_cd_approx <- lme4:::cooks.distance.merMod(class.lmer)
head(lme_cd_approx)
#> 1 2 3 4 5 6
#> 0.050602009 0.080124360 0.012322170 0.011276222 0.008216468 0.000021657
However, this approximation is fairly different from the one produced
from HLMdiag
.
hlm_cd_approx <- HLMdiag:::cooks.distance.lmerMod(class.lmer)
head(hlm_cd_approx)
#> [1] 1.060728e-03 1.433652e-03 3.358071e-04 2.249644e-04 1.726769e-04
#> [6] 8.069538e-07
The approximation from HLMdiag
is also closer to the
full refit calculated by HLMdiag
than the lme4
approximation is. The plots below show the difference between the full
refit from HLMdiag
and the HLMdiag
approximation (left) and the lme4
approximation
(right).
The estimates produced by the lme4
approximation are
never smaller than the values from the HLMdiag
full refit,
but they tend to be further off the HLMdiag
full refit
values than the HLMdiag
approximation values are. The
difference between the full refit and the approximation from
HLMdiag
tends to be very small (< 0.0005), while the
difference between the full refit and the lme4
approximation values tends to be much larger.
lme4
full refitlme4
also has a function that performs a full refit of
the models without each observation or group and calculates Cook’s
distance values based off those refits. We can calculate this by using
the influence
function from lme4
.
lme_case <- lme4:::influence.merMod(class.lmer, data = classroom)
cd_lme_full <- cooks.distance(lme_case)
lme4
’s cook’s distance function for objects created from
the influence
function has a bug that prevents us from
obtaining Cook’s distance values from the influence
object
using lme4
; however, we can use the
cooks.distance
function from the stats
package
on the influence
object from lme4
.
The values obtained from lme4
match those from
car
, meaning that lme4
is also scaling the
estimates by dividing by the estimated variance of each refit model, in
addition to using the estimated covariance matrix from the deletion
estimates, rather than the original estimates. Therefore, the
lme4
package is also using this formula to calculate Cook’s
distance:
$C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})$
The full refits from the lme4
and car
packages both choose to calculate what is MDFFITS in the
HLMdiag
package, and additionally choose to scale by
dividing by the variance from each refit model. Those choices explain
almost all of the differences between Cook’s distance values from the
three different packages; additional variation is due to differences in
how the new models are fit and coefficients estimated.
##References Bates D, Maechler M, Bolker B, Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48.
Christensen R, Pearson L, Johnson W (1992). “Case-Deletion Diagnostics for Mixed Models.” Technometrics, 34(1), 38–45.
Fox J, Weisburg S (2019). A {R} Companion to Applied Regression, Third Addition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Loy A, Hofmann H (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models in R. Journal of Statistical Software, 56(5), 1-28.
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2020). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-148, <URL: https://CRAN.R-project.org/package=nlme>.
Shi L, Chen G (2008). “Case Deletion Diagnostics in Multilevel Models.” Journal of Multivariate Analysis, 99(9), 1860–1877.
West, B., Welch, K. & Galecki, A. (2006) Linear Mixed Models: A Practical Guide Using Statistical Software. First Edition. Chapman Hall / CRC Press. ISBN 1584884800
Zewotir T (2008). “Multiple Cases Deletion Diagnostics for Linear Mixed Models.” Communications in Statistics – Theory and Methods, 37(7), 1071–1084.