library(tidygam)
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:nlme':
#>
#> collapse
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
theme_set(theme_light())
The tidymv package offers two main user-oriented functions:
predict_gam()
: returns predictions of the outcome
variable based on the predictors in the GAM model. The user can specify
specific values for any predictor, and exclude model terms.
get_difference()
: returns the difference between two
smooths and those intervals along the smooth that do not include
0.
The output of these function can then be plotted with
plot()
, through the methods plot.tidygam()
and
plot.tidygam.diff()
.
Let’s start with a simple model and get model-based predictions.
We will use the gest
data table, available in tidygam.
The table consists of counts of gestures performed by infants of three
cultural backgrounds who participating in a longitudinal study (see
?gest
for details and references).
data("gest")
gest
#> # A tibble: 540 × 5
#> dyad background months gesture count
#> <fct> <fct> <dbl> <fct> <dbl>
#> 1 b01 Bengali 10 ho_gv 0
#> 2 b01 Bengali 10 point 0
#> 3 b01 Bengali 10 reach 5
#> 4 b01 Bengali 11 ho_gv 0
#> 5 b01 Bengali 11 point 1
#> 6 b01 Bengali 11 reach 8
#> 7 b01 Bengali 12 ho_gv 3
#> 8 b01 Bengali 12 point 0
#> 9 b01 Bengali 12 reach 0
#> 10 b02 Bengali 10 ho_gv 1
#> # ℹ 530 more rows
The following GAM models the overall trend in number of gestures from 10 to 12 months of age.
gs <- gam(
count ~ s(months, k = 3),
data = gest,
family = poisson
)
summary(gs)
#>
#> Family: poisson
#> Link function: log
#>
#> Formula:
#> count ~ s(months, k = 3)
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.27491 0.02361 53.99 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(months) 1.861 1.981 248.9 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.0372 Deviance explained = 6.61%
#> UBRE = 6.1921 Scale est. = 1 n = 540
Now we can obtain the predicted counts with
predict_gam()
.
gs_pred <- predict_gam(gs)
gs_pred
#> # A tibble: 11 × 5
#> months count se lower_ci[,1] upper_ci[,1]
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 0.768 0.0498 0.670 0.865
#> 2 10.2 0.896 0.0396 0.818 0.973
#> 3 10.4 1.02 0.0340 0.954 1.09
#> 4 10.6 1.14 0.0334 1.07 1.21
#> 5 10.8 1.25 0.0352 1.18 1.32
#> 6 11 1.35 0.0363 1.28 1.42
#> 7 11.2 1.44 0.0346 1.37 1.51
#> 8 11.4 1.51 0.0308 1.45 1.58
#> 9 11.6 1.58 0.0269 1.53 1.64
#> 10 11.8 1.65 0.0265 1.59 1.70
#> 11 12 1.71 0.0315 1.64 1.77
predict_gam()
returns an object of class
tidygam
, which can be plotted with plot()
.
The user has to specify the “series” used as the x-axis. The outcome variable is automatically selected for the y-axis.
Since the gs
model used a log-link function, the output
of predict_gam()
is in log-odds, rather than in counts.
We can convert the log-odds to counts by exponentiating them. The
tran_fun
argument allows the user to specify a function to
transform the predicted outcome values with.
by
-variablesSmooths can be fitted to different levels of a factor using so-called
by
-variables, specified within the smooth function
s()
with the by
argument. Note that smooths
are automatically centred so you need to include the
by
-variable as a parametric term too.
In this model, we fit a smooth along months
for each
background in the data.
gs_by <- gam(
count ~ background + s(months, by = background, k = 3),
data = gest,
family = poisson
)
summary(gs_by)
#>
#> Family: poisson
#> Link function: log
#>
#> Formula:
#> count ~ background + s(months, by = background, k = 3)
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.36119 0.03930 34.634 < 2e-16 ***
#> backgroundChinese -0.06916 0.05694 -1.215 0.22451
#> backgroundEnglish -0.21696 0.05814 -3.731 0.00019 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(months):backgroundBengali 1.935 1.996 83.24 <2e-16 ***
#> s(months):backgroundChinese 1.003 1.006 143.06 <2e-16 ***
#> s(months):backgroundEnglish 1.000 1.000 42.38 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.0384 Deviance explained = 7.72%
#> UBRE = 6.1223 Scale est. = 1 n = 540
The predictor for comparison is selected with the
comparison
argument in plot()
.
gs_by %>%
predict_gam(length_out = 20, series = "months", tran_fun = exp) %>%
plot(comparison = "background")
Note that the output of plot()
is a ggplot2 object,
which can be modified using ggplot2 functions.
gs_by %>%
predict_gam(length_out = 20, series = "months", tran_fun = exp) %>%
plot(comparison = "background") +
scale_color_brewer(type = "qual") + scale_fill_brewer(type = "qual")
Let’s try now a model with both gesture
and
background
as by
-variables.
gs_by_2 <- gam(
count ~ gesture + background +
s(months, by = background, k = 3) +
s(months, by = gesture, k = 3),
data = gest,
family = poisson
)
summary(gs_by_2)
#>
#> Family: poisson
#> Link function: log
#>
#> Formula:
#> count ~ gesture + background + s(months, by = background, k = 3) +
#> s(months, by = gesture, k = 3)
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.63549 0.04821 33.927 < 2e-16 ***
#> gesturepoint -0.41072 0.05695 -7.213 5.49e-13 ***
#> gesturereach -0.53554 0.05791 -9.248 < 2e-16 ***
#> backgroundChinese -0.06937 0.05693 -1.219 0.222975
#> backgroundEnglish -0.21719 0.05812 -3.737 0.000186 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(months):backgroundBengali 1.9226651 1.9925158 54.925 < 2e-16 ***
#> s(months):backgroundChinese 1.1043907 1.1961394 84.781 < 2e-16 ***
#> s(months):backgroundEnglish 1.0009022 1.0017885 28.978 < 2e-16 ***
#> s(months):gestureho_gv 1.0014574 1.0028600 4.207 0.0403 *
#> s(months):gesturepoint 0.0003739 0.0007391 0.000 0.9959
#> s(months):gesturereach 1.2383962 1.4167559 25.127 1.51e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 16/17
#> R-sq.(adj) = 0.0824 Deviance explained = 13%
#> UBRE = 5.7338 Scale est. = 1 n = 540
Note that models like this one are conceptually equivalent to linear
models without interactions between the by
-variables.
This is clear when plotting the predictions: notice how the shapes of
the smooths are very similar within each background, and they only
differ in slope (this is the effect of including separate
by
-variables).
gs_by_2 %>%
predict_gam(length_out = 20, series = "months", tran_fun = exp) %>%
plot(comparison = "gesture") +
scale_color_brewer(type = "qual") + scale_fill_brewer(type = "qual") +
facet_grid(~ background)
If you wish to plot the effect of specific by
-variables,
you can exclude terms like in the following code chunk. Note that the
name of terms has to match precisely the name in the model summary and
that you should exclude both parametric and smooth terms with the same
by
-variable. You also need to pick any level of the
excluded variable (otherwise the predictions will be repeated for each
level in the excluded variable, but since the variable is excluded, the
predictions will be the same).
to_exclude <- c("s(months):gestureho_gv", "s(months):gesturepoint", "s(months):gesturereach",
"gesturepoint", "gesturereach")
gs_by_2 %>%
predict_gam(length_out = 20, series = "months", tran_fun = exp,
exclude_terms = to_exclude,
# pick any value of the excluded variables.
values = list(gesture = "point")) %>%
plot(comparison = "background") +
scale_color_brewer(type = "qual") + scale_fill_brewer(type = "qual")
The following section illustrates how to specify and plot models with
the GAM equivalent of classical interactions
(e.g. background * gesture
).
Classical interactions between factors as usually obtained in linear
models with the :
syntax
(e.g. background:gesture
) are not possible in GAMs.
An alternative way to specify what are called interactions in
generalised linear models is by creating a new factor which is the
interaction of the two or more factors using the
interaction()
function, and include this “factor
interaction” predictors as a by
-variable.
gest <- gest %>%
mutate(back_gest = interaction(background, gesture))
gs_i <- gam(
count ~ back_gest + s(months, by = back_gest, k = 3),
data = gest,
family = poisson
)
summary(gs_i)
#>
#> Family: poisson
#> Link function: log
#>
#> Formula:
#> count ~ back_gest + s(months, by = back_gest, k = 3)
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.76112 0.05783 30.452 < 2e-16 ***
#> back_gestChinese.ho_gv 0.01249 0.08045 0.155 0.8766
#> back_gestEnglish.ho_gv -0.87624 0.10592 -8.273 < 2e-16 ***
#> back_gestBengali.point -1.07758 0.11358 -9.487 < 2e-16 ***
#> back_gestChinese.point -1.05327 0.12189 -8.641 < 2e-16 ***
#> back_gestEnglish.point -0.19015 0.08283 -2.296 0.0217 *
#> back_gestBengali.reach -0.48164 0.08943 -5.386 7.22e-08 ***
#> back_gestChinese.reach -0.81182 0.09928 -8.177 2.91e-16 ***
#> back_gestEnglish.reach -1.03646 0.10751 -9.640 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(months):back_gestBengali.ho_gv 1.773 1.949 87.901 < 2e-16 ***
#> s(months):back_gestChinese.ho_gv 1.002 1.003 72.955 < 2e-16 ***
#> s(months):back_gestEnglish.ho_gv 1.001 1.002 40.670 < 2e-16 ***
#> s(months):back_gestBengali.point 1.941 1.997 22.153 2.08e-05 ***
#> s(months):back_gestChinese.point 1.001 1.001 91.257 < 2e-16 ***
#> s(months):back_gestEnglish.point 1.000 1.000 8.542 0.00347 **
#> s(months):back_gestBengali.reach 1.738 1.931 2.725 0.19781
#> s(months):back_gestChinese.reach 1.749 1.937 3.538 0.11830
#> s(months):back_gestEnglish.reach 1.000 1.000 4.261 0.03902 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.0999 Deviance explained = 18.8%
#> UBRE = 5.3243 Scale est. = 1 n = 540
When predicting values, the user can use the separate
argument to specify factor-interaction variables in the model that can
be split back into their individual components.
This gives greater flexibility when plotting.
bs = "fs"
)Factor smooth interactions are the GAM equivalent of random/group-level effects (intercepts and slopes).
Let’s work with the struct
data, which contains
event-related potentials measures of subjects listening to music and
speech. For each type (music vs language), the stimuli were either
“grammatical” or “ungrammatical” (i.e. the stimuli either respected
structural rules or they did not).
This is a subset of the original data, including voltage values only for electrode 62.
data("struct")
struct
#> # A tibble: 4,400 × 7
#> t electrode voltage subject stimulus.condition grammar.condition
#> <dbl> <dbl> <dbl> <fct> <fct> <fct>
#> 1 -100 62 -0.315 03 Language Grammatical
#> 2 -90 62 -0.320 03 Language Grammatical
#> 3 -80 62 -0.297 03 Language Grammatical
#> 4 -70 62 -0.628 03 Language Grammatical
#> 5 -60 62 -1.05 03 Language Grammatical
#> 6 -50 62 -0.734 03 Language Grammatical
#> 7 -40 62 0.0544 03 Language Grammatical
#> 8 -30 62 0.623 03 Language Grammatical
#> 9 -20 62 1.05 03 Language Grammatical
#> 10 -10 62 1.14 03 Language Grammatical
#> # ℹ 4,390 more rows
#> # ℹ 1 more variable: stim_gram <fct>
Let’s fit the model with factor smooth interactions
(bs = "fs"
).
struct <- struct %>%
mutate(stim_gram = interaction(stimulus.condition, grammar.condition))
st <- bam(
voltage ~ stim_gram +
s(t, by = stim_gram, k = 5) +
s(t, subject, bs = "fs", m = 1),
data = struct
)
summary(st)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> voltage ~ stim_gram + s(t, by = stim_gram, k = 5) + s(t, subject,
#> bs = "fs", m = 1)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.61210 0.19052 3.213 0.00132 **
#> stim_gramMusic.Grammatical 0.22318 0.09401 2.374 0.01763 *
#> stim_gramLanguage.Ungrammatical -0.98969 0.09401 -10.528 < 2e-16 ***
#> stim_gramMusic.Ungrammatical 0.37377 0.09401 3.976 7.12e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(t):stim_gramLanguage.Grammatical 2.846 3.249 3.545 0.02137 *
#> s(t):stim_gramMusic.Grammatical 2.627 3.022 4.218 0.00531 **
#> s(t):stim_gramLanguage.Ungrammatical 3.270 3.629 2.875 0.09791 .
#> s(t):stim_gramMusic.Ungrammatical 3.848 3.962 24.074 < 2e-16 ***
#> s(t,subject) 50.511 89.000 6.597 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.202 Deviance explained = 21.4%
#> fREML = 9800.7 Scale est. = 4.8605 n = 4400
When predicting values we want to exclude the factor smooth interaction, as we would with random/group-level effects in linear models.
Note that GAM terms to be excluded must be specified as they are
named in the output of summary()
.
predict_gam(
st,
length_out = 50,
series = "t",
exclude_terms = "s(t,subject)",
# Pick any subject: since we are removing the random effect, it does not
# matter which one you pick, the predictions will be the same
values = c(subject = "03"),
separate = list(stim_gram = c("stimulus", "grammar"))
) %>%
plot(comparison = "grammar") +
geom_hline(yintercept = 0) +
facet_grid(~ stimulus)
If the fs interaction is not removed, the predicted smooth for each individual level in the fs interaction is returned.