Title: | Gradient Boosting for Generalized Additive Mixed Models |
---|---|
Description: | Provides a novel framework to estimate mixed models via gradient boosting. The implemented functions are based on 'mboost' and 'lme4'. Hence, the family range is predetermined by 'lme4'. A correction mechanism for cluster-constant covariates is implemented as well as an estimation of random effects' covariance. |
Authors: | Lars Knieper [aut, cre], Torsten Hothorn [aut], Elisabeth Bergherr [aut], Colin Griesbach [aut] |
Maintainer: | Lars Knieper <[email protected]> |
License: | GPL-2 |
Version: | 0.1.0 |
Built: | 2025-02-24 18:30:44 UTC |
Source: | CRAN |
A filtered sub-sample of the Himalayan Database distributed through the R for Data Science TidyTuesday project. This dataset includes information on the results and conditions for various Himalayan climbing expeditions. Each row corresponds to a single member of a climbing expedition team.
data("climbers_sub")
data("climbers_sub")
A filtered version of the one one from bayesrules. A data frame with 2068 observations (1 per climber) and 19 variables:
unique expedition identifier
unique climber identifier
unique identifier of the expedition's destination peak
name of the expedition's destination peak
year of expedition
season of expedition (Autumn, Spring, Summer, Winter)
climber gender identity which the database oversimplifies to a binary category
climber age
climber citizenship
climber's role in the expedition (eg: Co-Leader)
whether the climber was a hired member of the expedition
whether the climber successfully reached the destination
whether the climber was on a solo expedition
whether the climber utilized supplemental oxygen
whether the climber died during the expedition
whether the climber was injured on the expedition
number of climbers in the expedition
height of the peak in meters
the year of the first recorded summit of the peak (though not necessarily the actual first summit!)
Original source: https://www.himalayandatabase.com/. Complete dataset distributed by: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-22/.
Data from an observational study of whether there is a different in microinvertebrate communities between estuaries that have been heavily modified by human activity and those that have not, across seven estuaries along the coast of New South Wales, Australia (Clark et al. 2015).
data("estuaries")
data("estuaries")
A dataframe containing (among other things):
A factor describing whether the sample was taken from a 'Modified' or 'Pristine' estuary.
Whether the sample was taken from Inner (upstream) or Outer (downstream) zone of the estuary.
A factor with seven levels identifying which estuary the sample was taken from.
Total abundance of all invertebrates in the sample
Richness of taxa in the sample – the number of responses (of those in columns 8-94) taking a non-zero value
Other variables in the dataset give invertebrate counts separately for different taxa.
Clark, G. F., Kelaher, B. P., Dafforn, K. A., Coleman, M. A., Knott, N. A., Marzinelli, E. M., & Johnston, E. L. (2015). What does impacted look like? high diversity and abundance of epibiota in modified estuaries. Environmental Pollution 196, 12-20.
data("estuaries") plot(Total~Estuary,data=estuaries,col=c(4,2,2,4,2,4,2))
data("estuaries") plot(Total~Estuary,data=estuaries,col=c(4,2,2,4,2,4,2))
This function gives out logical indicators whether a variable is cluster-constant.
find_ccc(df, id_char)
find_ccc(df, id_char)
df |
some data frame |
id_char |
a character which is the column name for the cluster identifier. |
For uncorrected boosting of a mixed model the estimates for cluster constant covariates might be biased as part of their effect is held in the random effects. This bias is corrected by the underlying mermboost package.
Gives a logical vector that indicates which variables of the dataframe have the same realisation for all observations of one cluster/individual.
glmermboost
and mermboost
data(Orthodont) find_ccc(Orthodont, "Subject")
data(Orthodont) find_ccc(Orthodont, "Subject")
Gradient boosting for optimizing negative log-likelihoods as loss functions where component-wise linear models are utilized as base-learners and an estimation of random components is guaranteed via a maximum likelihood approach.
glmermboost(formula, data = list(), weights = NULL, offset = NULL, family = gaussian, na.action = na.omit, contrasts.arg = NULL, center = TRUE, control = boost_control(), oobweights = NULL, ...)
glmermboost(formula, data = list(), weights = NULL, offset = NULL, family = gaussian, na.action = na.omit, contrasts.arg = NULL, center = TRUE, control = boost_control(), oobweights = NULL, ...)
formula |
a symbolic description of the model to be fit in the lme4-format including random effects. |
data |
a data frame containing the variables in the model. |
weights |
an optional vector of weights to be used in the fitting process. |
offset |
a numeric vector to be used as offset (optional). |
family |
!! This is in contrast to usual mboost -
"only" a |
na.action |
a function which indicates what should happen when the data
contain |
contrasts.arg |
a list, whose entries are contrasts suitable for input
to the |
center |
logical indicating of the predictor variables are centered before fitting. |
control |
a list of parameters controlling the algorithm. For
more details see |
oobweights |
an additional vector of out-of-bag weights, which is
used for the out-of-bag risk (i.e., if |
... |
additional arguments passed to |
The warning "model with centered covariates does not contain intercept" is correctly given - the intercept is estimated via the mixed model.
A (generalized) linear mixed model is fitted using a boosting algorithm based on component-wise univariate linear models. Additionally, a mixed model gets estimated in every iteration and added to the current fit. The fit, i.e., the regression coefficients and random effects, can be interpreted in the usual way. This particular methodology is described in Knieper et al. (2025).
The description of glmboost
holds while some methods are newly implemented like predict.mermboost
, plot.mer_cv
and mstop.mer_cv
. Only the former one requires a further argument. Additionally, methods VarCorr.mermboost
and ranef.mermboost
are implemented specifically.
See mermboost
for the same approach using additive models.
See mer_cvrisk
for a cluster-sensitive cross-validation.
data(Orthodont) # are there cluster-constant covariates? find_ccc(Orthodont, "Subject") # fit initial model mod <- glmermboost(distance ~ age + Sex + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 100)) # let mermboost do the cluster-sensitive cross-validation for you norm_cv <- mer_cvrisk(mod, no_of_folds = 10) opt_m <- mstop(norm_cv) # fit model with optimal stopping iteration mod_opt <- glmermboost(distance ~ age + Sex + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = opt_m)) # use the model as known from mboost # in additional, there are some methods knwon from lme4 ranef(mod_opt) VarCorr(mod_opt) ####################### set.seed(123) # Parameters n_groups <- 10 # Number of groups n_per_group <- 50 # Number of observations per group beta_fixed <- c(0.5, -0.3, 0.7) # Fixed effects for intercept, covariate1, covariate2 sigma_random <- 1 # Random effect standard deviation # Simulate random effects (group-specific) group_effects <- rnorm(n_groups, mean = 0, sd = sigma_random) # Simulate covariates covariate1 <- rnorm(n_groups * n_per_group) covariate2 <- rnorm(n_groups * n_per_group) # Simulate data group <- rep(1:n_groups, each = n_per_group) random_effect <- group_effects[group] # Linear predictor including fixed effects and random effects linear_predictor <- beta_fixed[1] + beta_fixed[2] * covariate1 + beta_fixed[3] * covariate2 + random_effect prob <- plogis(linear_predictor) # Convert to probabilities # Simulate binomial outcomes y <- rbinom(n_groups * n_per_group, size = 1, prob = prob) # Combine into a data frame sim_data <- data.frame(group = group, y = y, covariate1 = covariate1, covariate2 = covariate2) sim_data$group <- as.factor(sim_data$group) mod3 <- glmermboost(y ~ covariate1 + covariate2 + (1 | group), data = sim_data, family = binomial()) bin_cv <- mer_cvrisk(mod3, no_of_folds = 10) mstop(bin_cv)
data(Orthodont) # are there cluster-constant covariates? find_ccc(Orthodont, "Subject") # fit initial model mod <- glmermboost(distance ~ age + Sex + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 100)) # let mermboost do the cluster-sensitive cross-validation for you norm_cv <- mer_cvrisk(mod, no_of_folds = 10) opt_m <- mstop(norm_cv) # fit model with optimal stopping iteration mod_opt <- glmermboost(distance ~ age + Sex + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = opt_m)) # use the model as known from mboost # in additional, there are some methods knwon from lme4 ranef(mod_opt) VarCorr(mod_opt) ####################### set.seed(123) # Parameters n_groups <- 10 # Number of groups n_per_group <- 50 # Number of observations per group beta_fixed <- c(0.5, -0.3, 0.7) # Fixed effects for intercept, covariate1, covariate2 sigma_random <- 1 # Random effect standard deviation # Simulate random effects (group-specific) group_effects <- rnorm(n_groups, mean = 0, sd = sigma_random) # Simulate covariates covariate1 <- rnorm(n_groups * n_per_group) covariate2 <- rnorm(n_groups * n_per_group) # Simulate data group <- rep(1:n_groups, each = n_per_group) random_effect <- group_effects[group] # Linear predictor including fixed effects and random effects linear_predictor <- beta_fixed[1] + beta_fixed[2] * covariate1 + beta_fixed[3] * covariate2 + random_effect prob <- plogis(linear_predictor) # Convert to probabilities # Simulate binomial outcomes y <- rbinom(n_groups * n_per_group, size = 1, prob = prob) # Combine into a data frame sim_data <- data.frame(group = group, y = y, covariate1 = covariate1, covariate2 = covariate2) sim_data$group <- as.factor(sim_data$group) mod3 <- glmermboost(y ~ covariate1 + covariate2 + (1 | group), data = sim_data, family = binomial()) bin_cv <- mer_cvrisk(mod3, no_of_folds = 10) mstop(bin_cv)
Cross-validated estimation of the empirical risk for hyper-parameter selection. Folds are created cluster-sensitive, hence splitting data into train and tests sets considers the cluster-structure.
mer_cvrisk(object, folds, no_of_folds, cores = 1)
mer_cvrisk(object, folds, no_of_folds, cores = 1)
object |
an object of class |
folds |
a weight matrix with number of rows equal to the number
of observations. The number of columns corresponds to
the number of cross-validation runs. Can be computed
using function |
no_of_folds |
creates the folds itself by taking the cluster structure into account. |
cores |
is passed on to |
The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters mstop
are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied.
This function uses the cluster-identifier held in the mermboost
object
to split the data into cluster-sensitive folds if the corresponding argument
no_of_folds
is given.
As this might lead to imbalanced splits the 1/0 matrix of folds can be given manually
via the folds argument.
An object of class mer_cv
, containing the k-folds as a matrix,
the corresponding estimates of the empirical risks, their average
and the results optimal stopping iteration.
plot
and mstop
methods are available.
data(Orthodont) mod <- mermboost(distance ~ bbs(age, knots = 4) + bols(Sex) + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 100)) # let mermboost do the cluster-sensitive cross-validation for you norm_cv <- mer_cvrisk(mod, no_of_folds = 10) opt_m <- mstop(norm_cv) plot(norm_cv)
data(Orthodont) mod <- mermboost(distance ~ bbs(age, knots = 4) + bols(Sex) + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 100)) # let mermboost do the cluster-sensitive cross-validation for you norm_cv <- mer_cvrisk(mod, no_of_folds = 10) opt_m <- mstop(norm_cv) plot(norm_cv)
Gradient boosting for optimizing negative log-likelihoods as loss functions, where component-wise arbitrary base-learners, e.g., smoothing procedures, are utilized as additive base-learners. In addition, every iteration estimates random component via a maximum likelihood approach using the current fit.
mermboost(formula, data = list(), na.action = na.omit, weights = NULL, offset = NULL, family = gaussian, control = boost_control(), oobweights = NULL, baselearner = c("bbs", "bols", "btree", "bss", "bns"), ...)
mermboost(formula, data = list(), na.action = na.omit, weights = NULL, offset = NULL, family = gaussian, control = boost_control(), oobweights = NULL, baselearner = c("bbs", "bols", "btree", "bss", "bns"), ...)
formula |
a symbolic description of the model to be fit in the lme4-format including random effects. |
data |
a data frame containing the variables in the model. |
na.action |
a function which indicates what should happen when
the data contain |
weights |
(optional) a numeric vector of weights to be used in the fitting process. |
offset |
a numeric vector to be used as offset (optional). |
family |
!! This is in contrast to usual mboost -
"only" a |
control |
a list of parameters controlling the algorithm. For
more details see |
oobweights |
an additional vector of out-of-bag weights, which is
used for the out-of-bag risk (i.e., if |
baselearner |
a character specifying the component-wise base
learner to be used: |
... |
additional arguments passed to |
A (generalized) additive mixed model is fitted using a boosting algorithm based on component-wise base-learners. Additionally, a mixed model gets estimated in every iteration and added to the current fit.
The base-learners can either be specified via the formula
object or via
the baselearner
argument. The latter argument is the default base-learner
which is used for all variables in the formula, without explicit base-learner
specification (i.e., if the base-learners are explicitly specified in formula
,
the baselearner
argument will be ignored for this variable).
Of note, "bss"
and "bns"
are deprecated and only in the list for
backward compatibility.
Note that more base-learners (i.e., in addition to the ones provided
via baselearner
) can be specified in formula
. See
baselearners
for details.
The description of mboost
holds while some methods are newly implemented
like predict.mermboost
, plot.mer_cv
and mstop.mer_cv
. Only the former one requires
an further argument. Additionally, methods VarCorr.mermboost
and ranef.mermboost
are implemented specifically.
See glmermboost
for the same approach using additive models.
See mer_cvrisk
for a cluster-sensitive cross-validation.
data("Orthodont") # are there cluster-constant covariates? find_ccc(Orthodont, "Subject") mod <- mermboost(distance ~ bbs(age, knots = 4) + bols(Sex) + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 100)) # let mermboost do the cluster-sensitive cross-validation for you norm_cv <- mer_cvrisk(mod, no_of_folds = 10) opt_m <- mstop(norm_cv) # fit model with optimal stopping iteration mod_opt <- mermboost(distance ~ bbs(age, knots = 4) + bols(Sex) + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = opt_m)) # use the model as known from mboost # in additional, there are some methods knwon from lme4 ranef(mod_opt) VarCorr(mod_opt)
data("Orthodont") # are there cluster-constant covariates? find_ccc(Orthodont, "Subject") mod <- mermboost(distance ~ bbs(age, knots = 4) + bols(Sex) + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 100)) # let mermboost do the cluster-sensitive cross-validation for you norm_cv <- mer_cvrisk(mod, no_of_folds = 10) opt_m <- mstop(norm_cv) # fit model with optimal stopping iteration mod_opt <- mermboost(distance ~ bbs(age, knots = 4) + bols(Sex) + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = opt_m)) # use the model as known from mboost # in additional, there are some methods knwon from lme4 ranef(mod_opt) VarCorr(mod_opt)
Methods for models fitted by mixed model boosting algorithms.
## S3 method for class 'mermboost' predict(object, newdata = NULL, RE = TRUE, type = c("link", "response", "class"), which = NULL, aggregate = c("sum", "cumsum", "none"), ...) ## S3 method for class 'glmermboost' predict(object, newdata = NULL, RE = TRUE, type = c("link", "response", "class"), which = NULL, aggregate = c("sum", "cumsum", "none"), ...) ## S3 method for class 'mermboost' ranef(object, iteration = mstop(object), ...) ## S3 method for class 'glmermboost' ranef(object, iteration = mstop(object), ...) ## S3 method for class 'mermboost' VarCorr(x, sigma=1, iteration = mstop(x), ...) ## S3 method for class 'glmermboost' VarCorr(x, sigma=1, iteration = mstop(x), ...) ## S3 method for class 'mer_cv' mstop(object, ...) ## S3 method for class 'mer_cv' plot(x, ...)
## S3 method for class 'mermboost' predict(object, newdata = NULL, RE = TRUE, type = c("link", "response", "class"), which = NULL, aggregate = c("sum", "cumsum", "none"), ...) ## S3 method for class 'glmermboost' predict(object, newdata = NULL, RE = TRUE, type = c("link", "response", "class"), which = NULL, aggregate = c("sum", "cumsum", "none"), ...) ## S3 method for class 'mermboost' ranef(object, iteration = mstop(object), ...) ## S3 method for class 'glmermboost' ranef(object, iteration = mstop(object), ...) ## S3 method for class 'mermboost' VarCorr(x, sigma=1, iteration = mstop(x), ...) ## S3 method for class 'glmermboost' VarCorr(x, sigma=1, iteration = mstop(x), ...) ## S3 method for class 'mer_cv' mstop(object, ...) ## S3 method for class 'mer_cv' plot(x, ...)
object |
objects of class |
newdata |
optionally, a data frame in which to look for variables with
which to predict. In case the model was fitted using the |
RE |
a logical values ( |
which |
a subset of base-learners to take into account for computing
predictions or coefficients. If |
type |
the type of prediction required. The default is on the scale
of the predictors; the alternative |
aggregate |
a character specifying how to aggregate predictions
or coefficients of single base-learners. The default
returns the prediction or coefficient for the final number of
boosting iterations. |
iteration |
an integer input that specifies from which iteration the random component is to be drawn. |
sigma |
an argument used in |
x |
a cross-validation object for |
... |
additional arguments passed to callies. |
The methods should correspond to equivalent mboost
and lme4
functions. However, additional arguments about random effects handling might be of interest.
The predict.mermboost
-methods give a vector, matrix or a list depending on the arguments.
A matrix with cluster-identifier as rownames and random effects as element results from ranef.mermboost
.
A VarrCorr.merMod
is the result of applying VarCorr.mermboost
to a mermboost
model.
To deal with cross validtion objects, class mer_cv
, mstop.mer_cv
gives a numeric value of the optimal stopping iteration while plot.mer_cv
plots cross-validation risk-paths.
data(Orthodont) mod <- glmermboost(distance ~ age + Sex + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 50)) any(predict(mod, RE = FALSE) == predict(mod, RE = TRUE)) all(predict(mod, RE = FALSE) == predict.glmboost(mod) + mod$nuisance()[[mstop(mod)]]$ff ) ranef(mod) VarCorr(mod, iteration = 10)
data(Orthodont) mod <- glmermboost(distance ~ age + Sex + (1 |Subject), data = Orthodont, family = gaussian, control = boost_control(mstop = 50)) any(predict(mod, RE = FALSE) == predict(mod, RE = TRUE)) all(predict(mod, RE = FALSE) == predict.glmboost(mod) + mod$nuisance()[[mstop(mod)]]$ff ) ranef(mod) VarCorr(mod, iteration = 10)
The Orthodont
data frame has 108 rows and 4 columns of the
change in an orthdontic measurement over time for several young subjects.
data("Orthodont")
data("Orthodont")
This data frame contains the following columns:
a numeric vector of distances from the pituitary to the pterygomaxillary fissure (mm). These distances are measured on x-ray images of the skull.
a numeric vector of ages of the subject (yr).
an ordered factor indicating the subject on which the
measurement was made. The levels are labeled M01
to M16
for the males and F01
to F13
for
the females. The ordering is by increasing average distance
within sex.
a factor with levels
Male
and
Female
Investigators at the University of North Carolina Dental School followed the growth of 27 children (16 males, 11 females) from age 8 until age 14. Every two years they measured the distance between the pituitary and the pterygomaxillary fissure, two points that are easily identified on x-ray exposures of the side of the head.
Pinheiro, J. C. and Bates, D. M. (2000), Mixed-Effects Models in S and S-PLUS, Springer, New York. (Appendix A.17)
Potthoff, R. F. and Roy, S. N. (1964), “A generalized multivariate analysis of variance model useful especially for growth curve problems”, Biometrika, 51, 313–326.
Potthoff, R. F. and Roy, S. N. (1964), “A generalized multivariate analysis of variance model useful especially for growth curve problems”, Biometrika, 51, 313–326.