Title: | Combining Tree-Boosting with Gaussian Process and Mixed Effects Models |
---|---|
Description: | An R package that allows for combining tree-boosting with Gaussian process and mixed effects models. It also allows for independently doing tree-boosting as well as inference and prediction for Gaussian process and mixed effects models. See <https://github.com/fabsig/GPBoost> for more information on the software and Sigrist (2022, JMLR) <https://www.jmlr.org/papers/v23/20-322.html> and Sigrist (2023, TPAMI) <doi:10.1109/TPAMI.2022.3168152> for more information on the methodology. |
Authors: | Fabio Sigrist [aut, cre], Tim Gyger [aut], Pascal Kuendig [aut], Benoit Jacob [cph], Gael Guennebaud [cph], Nicolas Carre [cph], Pierre Zoppitelli [cph], Gauthier Brun [cph], Jean Ceccato [cph], Jitse Niesen [cph], Other authors of Eigen for the included version of Eigen [ctb, cph], Timothy A. Davis [cph], Guolin Ke [ctb], Damien Soukhavong [ctb], James Lamb [ctb], Other authors of LightGBM for the included version of LightGBM [ctb], Microsoft Corporation [cph], Dropbox, Inc. [cph], Jay Loden [cph], Dave Daeschler [cph], Giampaolo Rodola [cph], Alberto Ferreira [ctb], Daniel Lemire [ctb], Victor Zverovich [cph], IBM Corporation [ctb], Keith O'Hara [cph], Stephen L. Moshier [cph], Jorge Nocedal [cph], Naoaki Okazaki [cph], Yixuan Qiu [cph], Dirk Toewe [cph] |
Maintainer: | Fabio Sigrist <[email protected]> |
License: | Apache License (== 2.0) | file LICENSE |
Version: | 1.5.4 |
Built: | 2024-11-25 15:10:11 UTC |
Source: | CRAN |
This data set is originally from the Mushroom data set, UCI Machine Learning Repository. This data set includes the following fields:
label
: the label for each record
data
: a sparse Matrix of dgCMatrix
class, with 126 columns.
data(agaricus.test)
data(agaricus.test)
A list containing a label vector, and a dgCMatrix object with 1611 rows and 126 variables
https://archive.ics.uci.edu/ml/datasets/Mushroom
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
This data set is originally from the Mushroom data set, UCI Machine Learning Repository. This data set includes the following fields:
label
: the label for each record
data
: a sparse Matrix of dgCMatrix
class, with 126 columns.
data(agaricus.train)
data(agaricus.train)
A list containing a label vector, and a dgCMatrix object with 6513 rows and 127 variables
https://archive.ics.uci.edu/ml/datasets/Mushroom
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
This data set is originally from the Bank Marketing data set, UCI Machine Learning Repository.
It contains only the following: bank.csv with 10 randomly selected from 3 (older version of this dataset with less inputs).
data(bank)
data(bank)
A data.table with 4521 rows and 17 variables
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
S. Moro, P. Cortez and P. Rita. (2014) A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems
A matrix with spatial coordinates for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)
A matrix with spatial coordinates for predictions for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)
gpb.Dataset
Returns a vector of numbers of rows and of columns in an gpb.Dataset
.
## S3 method for class 'gpb.Dataset' dim(x, ...)
## S3 method for class 'gpb.Dataset' dim(x, ...)
x |
Object of class |
... |
other parameters |
Note: since nrow
and ncol
internally use dim
, they can also
be directly used with an gpb.Dataset
object.
a vector of numbers of rows and of columns
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) stopifnot(nrow(dtrain) == nrow(train$data)) stopifnot(ncol(dtrain) == ncol(train$data)) stopifnot(all(dim(dtrain) == dim(train$data)))
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) stopifnot(nrow(dtrain) == nrow(train$data)) stopifnot(ncol(dtrain) == ncol(train$data)) stopifnot(all(dim(dtrain) == dim(train$data)))
gpb.Dataset
Only column names are supported for gpb.Dataset
, thus setting of
row names would have no effect and returned row names would be NULL.
## S3 method for class 'gpb.Dataset' dimnames(x) ## S3 replacement method for class 'gpb.Dataset' dimnames(x) <- value
## S3 method for class 'gpb.Dataset' dimnames(x) ## S3 replacement method for class 'gpb.Dataset' dimnames(x) <- value
x |
object of class |
value |
a list of two elements: the first one is ignored and the second one is column names |
Generic dimnames
methods are used by colnames
.
Since row names are irrelevant, it is recommended to use colnames
directly.
A list with the dimension names of the dataset
A list with the dimension names of the dataset
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain) dimnames(dtrain) colnames(dtrain) colnames(dtrain) <- make.names(seq_len(ncol(train$data))) print(dtrain, verbose = TRUE)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain) dimnames(dtrain) colnames(dtrain) colnames(dtrain) <- make.names(seq_len(ncol(train$data))) print(dtrain, verbose = TRUE)
GPModel
Generic 'fit' method for a GPModel
fit(gp_model, y, X, params, offset = NULL, fixed_effects = NULL)
fit(gp_model, y, X, params, offset = NULL, fixed_effects = NULL)
gp_model |
a |
y |
A |
X |
A |
params |
A
|
offset |
A |
fixed_effects |
This is discontinued. Use the renamed equivalent argument |
Fabio Sigrist
GPModel
Estimates the parameters of a GPModel
by maximizing the marginal likelihood
## S3 method for class 'GPModel' fit(gp_model, y, X = NULL, params = list(), offset = NULL, fixed_effects = NULL)
## S3 method for class 'GPModel' fit(gp_model, y, X = NULL, params = list(), offset = NULL, fixed_effects = NULL)
gp_model |
a |
y |
A |
X |
A |
params |
A
|
offset |
A |
fixed_effects |
This is discontinued. Use the renamed equivalent argument |
A fitted GPModel
Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") fit(gp_model, y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) pred$mu # Predicted mean pred$var # Predicted variances # Also predict covariance matrix pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted mean pred$cov # Predicted covariance #--------------------Gaussian process model---------------- gp_model <- GPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian") fit(gp_model, y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, gp_coords_pred = coords_test, X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted (posterior) mean of GP pred$cov # Predicted (posterior) covariance matrix of GP
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") fit(gp_model, y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) pred$mu # Predicted mean pred$var # Predicted variances # Also predict covariance matrix pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted mean pred$cov # Predicted covariance #--------------------Gaussian process model---------------- gp_model <- GPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian") fit(gp_model, y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, gp_coords_pred = coords_test, X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted (posterior) mean of GP pred$cov # Predicted (posterior) covariance matrix of GP
GPModel
Estimates the parameters of a GPModel
by maximizing the marginal likelihood
fitGPModel(likelihood = "gaussian", group_data = NULL, group_rand_coef_data = NULL, ind_effect_group_rand_coef = NULL, drop_intercept_group_rand_effect = NULL, gp_coords = NULL, gp_rand_coef_data = NULL, cov_function = "matern", cov_fct_shape = 1.5, gp_approx = "none", cov_fct_taper_range = 1, cov_fct_taper_shape = 1, num_neighbors = 20L, vecchia_ordering = "random", ind_points_selection = "kmeans++", num_ind_points = 500L, cover_tree_radius = 1, matrix_inversion_method = "cholesky", seed = 0L, cluster_ids = NULL, free_raw_data = FALSE, y, X = NULL, params = list(), vecchia_approx = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, offset = NULL, fixed_effects = NULL, likelihood_additional_param = 1)
fitGPModel(likelihood = "gaussian", group_data = NULL, group_rand_coef_data = NULL, ind_effect_group_rand_coef = NULL, drop_intercept_group_rand_effect = NULL, gp_coords = NULL, gp_rand_coef_data = NULL, cov_function = "matern", cov_fct_shape = 1.5, gp_approx = "none", cov_fct_taper_range = 1, cov_fct_taper_shape = 1, num_neighbors = 20L, vecchia_ordering = "random", ind_points_selection = "kmeans++", num_ind_points = 500L, cover_tree_radius = 1, matrix_inversion_method = "cholesky", seed = 0L, cluster_ids = NULL, free_raw_data = FALSE, y, X = NULL, params = list(), vecchia_approx = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, offset = NULL, fixed_effects = NULL, likelihood_additional_param = 1)
likelihood |
A
|
group_data |
A |
group_rand_coef_data |
A |
ind_effect_group_rand_coef |
A |
drop_intercept_group_rand_effect |
A |
gp_coords |
A |
gp_rand_coef_data |
A |
cov_function |
A
|
cov_fct_shape |
A |
gp_approx |
A
|
cov_fct_taper_range |
A |
cov_fct_taper_shape |
A |
num_neighbors |
An |
vecchia_ordering |
A
|
ind_points_selection |
A
|
num_ind_points |
An |
cover_tree_radius |
A |
matrix_inversion_method |
A
|
seed |
An |
cluster_ids |
A |
free_raw_data |
A |
y |
A |
X |
A |
params |
A
|
vecchia_approx |
Discontinued. Use the argument |
vecchia_pred_type |
A |
num_neighbors_pred |
an |
offset |
A |
fixed_effects |
This is discontinued. Use the renamed equivalent argument |
likelihood_additional_param |
A |
A fitted GPModel
Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian", params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) pred$mu # Predicted mean pred$var # Predicted variances # Also predict covariance matrix pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted mean pred$cov # Predicted covariance #--------------------Two crossed random effects and a random slope---------------- gp_model <- fitGPModel(group_data = group_data, likelihood="gaussian", group_rand_coef_data = X[,2], ind_effect_group_rand_coef = 1, y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) #--------------------Gaussian process model---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, gp_coords_pred = coords_test, X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted (posterior) mean of GP pred$cov # Predicted (posterior) covariance matrix of GP #--------------------Gaussian process model with Vecchia approximation---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, gp_approx = "vecchia", num_neighbors = 20, likelihood="gaussian", y = y) summary(gp_model) #--------------------Gaussian process model with random coefficients---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, gp_rand_coef_data = X[,2], y=y, likelihood = "gaussian", params = list(std_dev = TRUE)) summary(gp_model) #--------------------Combine Gaussian process with grouped random effects---------------- gp_model <- fitGPModel(group_data = group_data, gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood = "gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model)
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian", params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) pred$mu # Predicted mean pred$var # Predicted variances # Also predict covariance matrix pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted mean pred$cov # Predicted covariance #--------------------Two crossed random effects and a random slope---------------- gp_model <- fitGPModel(group_data = group_data, likelihood="gaussian", group_rand_coef_data = X[,2], ind_effect_group_rand_coef = 1, y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) #--------------------Gaussian process model---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, gp_coords_pred = coords_test, X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted (posterior) mean of GP pred$cov # Predicted (posterior) covariance matrix of GP #--------------------Gaussian process model with Vecchia approximation---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, gp_approx = "vecchia", num_neighbors = 20, likelihood="gaussian", y = y) summary(gp_model) #--------------------Gaussian process model with random coefficients---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, gp_rand_coef_data = X[,2], y=y, likelihood = "gaussian", params = list(std_dev = TRUE)) summary(gp_model) #--------------------Combine Gaussian process with grouped random effects---------------- gp_model <- fitGPModel(group_data = group_data, gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood = "gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model)
Get (estimated) auxiliary (additional) parameters of the likelihood such as the shape parameter of a gamma or a negative binomial distribution. Some likelihoods (e.g., bernoulli_logit or poisson) have no auxiliary parameters
get_aux_pars(gp_model)
get_aux_pars(gp_model)
gp_model |
A |
Fabio Sigrist
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column y_pos <- exp(y) gp_model <- fitGPModel(group_data = group_data[,1], y = y_pos, X = X1, likelihood="gamma") get_aux_pars(gp_model)
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column y_pos <- exp(y) gp_model <- fitGPModel(group_data = group_data[,1], y = y_pos, X = X1, likelihood="gamma") get_aux_pars(gp_model)
Get (estimated) auxiliary (additional) parameters of the likelihood such as the shape parameter of a gamma or a negative binomial distribution. Some likelihoods (e.g., bernoulli_logit or poisson) have no auxiliary parameters
## S3 method for class 'GPModel' get_aux_pars(gp_model)
## S3 method for class 'GPModel' get_aux_pars(gp_model)
gp_model |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column y_pos <- exp(y) gp_model <- fitGPModel(group_data = group_data[,1], y = y_pos, X = X1, likelihood="gamma") get_aux_pars(gp_model)
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column y_pos <- exp(y) gp_model <- fitGPModel(group_data = group_data[,1], y = y_pos, X = X1, likelihood="gamma") get_aux_pars(gp_model)
Get (estimated) linear regression coefficients and standard deviations (if std_dev=TRUE was set in fit
)
get_coef(gp_model)
get_coef(gp_model)
gp_model |
A |
Fabio Sigrist
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_coef(gp_model)
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_coef(gp_model)
Get (estimated) linear regression coefficients and standard deviations (if std_dev=TRUE was set in fit
)
## S3 method for class 'GPModel' get_coef(gp_model)
## S3 method for class 'GPModel' get_coef(gp_model)
gp_model |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_coef(gp_model)
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_coef(gp_model)
Get (estimated) covariance parameters and standard deviations (if std_dev=TRUE was set in fit
)
get_cov_pars(gp_model)
get_cov_pars(gp_model)
gp_model |
A |
Fabio Sigrist
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_cov_pars(gp_model)
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_cov_pars(gp_model)
Get (estimated) covariance parameters and standard deviations (if std_dev=TRUE was set in fit
)
## S3 method for class 'GPModel' get_cov_pars(gp_model)
## S3 method for class 'GPModel' get_cov_pars(gp_model)
gp_model |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_cov_pars(gp_model)
data(GPBoost_data, package = "gpboost") X1 <- cbind(rep(1,dim(X)[1]),X) # Add intercept column gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") get_cov_pars(gp_model)
Auxiliary function to create categorical variables for nested grouped random effects
get_nested_categories(outer_var, inner_var)
get_nested_categories(outer_var, inner_var)
outer_var |
A |
inner_var |
A |
A vector
containing a categorical variable such that inner_var is nested in outer_var
Fabio Sigrist
# Fit a model with Time as categorical fixed effects variables and Diet and Chick # as random effects, where Chick is nested in Diet using lme4 chick_nested_diet <- get_nested_categories(ChickWeight$Diet, ChickWeight$Chick) fixed_effects_matrix <- model.matrix(weight ~ as.factor(Time), data = ChickWeight) mod_gpb <- fitGPModel(X = fixed_effects_matrix, group_data = cbind(diet=ChickWeight$Diet, chick_nested_diet), y = ChickWeight$weight, params = list(std_dev = TRUE)) summary(mod_gpb) # This does (almost) the same thing as the following code using lme4: # mod_lme4 <- lmer(weight ~ as.factor(Time) + (1 | Diet/Chick), data = ChickWeight, REML = FALSE) # summary(mod_lme4)
# Fit a model with Time as categorical fixed effects variables and Diet and Chick # as random effects, where Chick is nested in Diet using lme4 chick_nested_diet <- get_nested_categories(ChickWeight$Diet, ChickWeight$Chick) fixed_effects_matrix <- model.matrix(weight ~ as.factor(Time), data = ChickWeight) mod_gpb <- fitGPModel(X = fixed_effects_matrix, group_data = cbind(diet=ChickWeight$Diet, chick_nested_diet), y = ChickWeight$weight, params = list(std_dev = TRUE)) summary(mod_gpb) # This does (almost) the same thing as the following code using lme4: # mod_lme4 <- lmer(weight ~ as.factor(Time) + (1 | Diet/Chick), data = ChickWeight, REML = FALSE) # summary(mod_lme4)
gpb.Dataset
objectGet one attribute of a gpb.Dataset
getinfo(dataset, ...) ## S3 method for class 'gpb.Dataset' getinfo(dataset, name, ...)
getinfo(dataset, ...) ## S3 method for class 'gpb.Dataset' getinfo(dataset, name, ...)
dataset |
Object of class |
... |
other parameters |
name |
the name of the information field to get (see details) |
The name
field can be one of the following:
label
: label gpboost learn from ;
weight
: to do a weight rescale ;
group
: used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results to be ranked.
For example, if you have a 100-document dataset with group = c(10, 20, 40, 10, 10, 10)
,
that means that you have 6 groups, where the first 10 records are in the first group,
records 11-30 are in the second group, etc.
init_score
: initial score is the base prediction gpboost will boost from.
info data
info data
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain) labels <- gpboost::getinfo(dtrain, "label") gpboost::setinfo(dtrain, "label", 1 - labels) labels2 <- gpboost::getinfo(dtrain, "label") stopifnot(all(labels2 == 1 - labels))
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain) labels <- gpboost::getinfo(dtrain, "label") gpboost::setinfo(dtrain, "label", 1 - labels) labels2 <- gpboost::getinfo(dtrain, "label") stopifnot(all(labels2 == 1 - labels))
Attempts to prepare a clean dataset to prepare to put in a gpb.Dataset
.
Factor, character, and logical columns are converted to integer. Missing values
in factors and characters will be filled with 0L. Missing values in logicals
will be filled with -1L.
This function returns and optionally takes in "rules" the describe exactly how to convert values in columns.
Columns that contain only NA values will be converted by this function but will
not show up in the returned rules
.
gpb.convert_with_rules(data, rules = NULL)
gpb.convert_with_rules(data, rules = NULL)
data |
A data.frame or data.table to prepare. |
rules |
A set of rules from the data preparator, if already used. This should be an R list,
where names are column names in |
A list with the cleaned dataset (data
) and the rules (rules
).
Note that the data must be converted to a matrix format (as.matrix
) for input in
gpb.Dataset
.
data(iris) str(iris) new_iris <- gpb.convert_with_rules(data = iris) str(new_iris$data) data(iris) # Erase iris dataset iris$Species[1L] <- "NEW FACTOR" # Introduce junk factor (NA) # Use conversion using known rules # Unknown factors become 0, excellent for sparse datasets newer_iris <- gpb.convert_with_rules(data = iris, rules = new_iris$rules) # Unknown factor is now zero, perfect for sparse datasets newer_iris$data[1L, ] # Species became 0 as it is an unknown factor newer_iris$data[1L, 5L] <- 1.0 # Put back real initial value # Is the newly created dataset equal? YES! all.equal(new_iris$data, newer_iris$data) # Can we test our own rules? data(iris) # Erase iris dataset # We remapped values differently personal_rules <- list( Species = c( "setosa" = 3L , "versicolor" = 2L , "virginica" = 1L ) ) newest_iris <- gpb.convert_with_rules(data = iris, rules = personal_rules) str(newest_iris$data) # SUCCESS!
data(iris) str(iris) new_iris <- gpb.convert_with_rules(data = iris) str(new_iris$data) data(iris) # Erase iris dataset iris$Species[1L] <- "NEW FACTOR" # Introduce junk factor (NA) # Use conversion using known rules # Unknown factors become 0, excellent for sparse datasets newer_iris <- gpb.convert_with_rules(data = iris, rules = new_iris$rules) # Unknown factor is now zero, perfect for sparse datasets newer_iris$data[1L, ] # Species became 0 as it is an unknown factor newer_iris$data[1L, 5L] <- 1.0 # Put back real initial value # Is the newly created dataset equal? YES! all.equal(new_iris$data, newer_iris$data) # Can we test our own rules? data(iris) # Erase iris dataset # We remapped values differently personal_rules <- list( Species = c( "setosa" = 3L , "versicolor" = 2L , "virginica" = 1L ) ) newest_iris <- gpb.convert_with_rules(data = iris, rules = personal_rules) str(newest_iris$data) # SUCCESS!
Cross validation function for determining number of boosting iterations
gpb.cv(params = list(), data, gp_model = NULL, nrounds = 1000L, early_stopping_rounds = NULL, folds = NULL, nfold = 5L, metric = NULL, verbose = 1L, line_search_step_length = FALSE, use_gp_model_for_validation = TRUE, fit_GP_cov_pars_OOS = FALSE, train_gp_model_cov_pars = TRUE, label = NULL, weight = NULL, obj = NULL, eval = NULL, record = TRUE, eval_freq = 1L, showsd = FALSE, stratified = TRUE, init_model = NULL, colnames = NULL, categorical_feature = NULL, callbacks = list(), reset_data = FALSE, delete_boosters_folds = FALSE, ...)
gpb.cv(params = list(), data, gp_model = NULL, nrounds = 1000L, early_stopping_rounds = NULL, folds = NULL, nfold = 5L, metric = NULL, verbose = 1L, line_search_step_length = FALSE, use_gp_model_for_validation = TRUE, fit_GP_cov_pars_OOS = FALSE, train_gp_model_cov_pars = TRUE, label = NULL, weight = NULL, obj = NULL, eval = NULL, record = TRUE, eval_freq = 1L, showsd = FALSE, stratified = TRUE, init_model = NULL, colnames = NULL, categorical_feature = NULL, callbacks = list(), reset_data = FALSE, delete_boosters_folds = FALSE, ...)
params |
list of "tuning" parameters. See the parameter documentation for more information. A few key parameters:
|
data |
a |
gp_model |
A |
nrounds |
number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting |
early_stopping_rounds |
int. Activates early stopping. Requires at least one validation data
and one metric. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
folds |
|
nfold |
the original dataset is randomly partitioned into |
metric |
Evaluation metric to be monitored when doing CV and parameter tuning.
Can be a |
verbose |
verbosity for output, if <= 0, also will disable the print of evaluation during training |
line_search_step_length |
Boolean. If TRUE, a line search is done to find the optimal step length for every boosting update
(see, e.g., Friedman 2001). This is then multiplied by the |
use_gp_model_for_validation |
Boolean. If TRUE, the |
fit_GP_cov_pars_OOS |
Boolean (default = FALSE). If TRUE, the covariance parameters of the
|
train_gp_model_cov_pars |
Boolean. If TRUE, the covariance parameters
of the |
label |
Vector of labels, used if |
weight |
vector of response values. If not NULL, will set to dataset |
obj |
(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes. |
eval |
Evaluation metric to be monitored when doing CV and parameter tuning. This can be a string, function, or list with a mixture of strings and functions.
|
record |
Boolean, TRUE will record iteration message to |
eval_freq |
evaluation output frequency, only effect when verbose > 0 |
showsd |
|
stratified |
a |
init_model |
path of model file of |
colnames |
feature names, if not null, will use this to overwrite the names in dataset |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
callbacks |
List of callback functions that are applied at each iteration. |
reset_data |
Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets |
delete_boosters_folds |
Boolean, setting it to TRUE (not the default value) will delete the boosters of the individual folds |
... |
other parameters, see Parameters.rst for more information. |
a trained model gpb.CVBooster
.
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
Authors of the LightGBM R package, Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") # Create random effects model and dataset gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") dtrain <- gpb.Dataset(X, label = y) params <- list(learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5) # Run CV cvbst <- gpb.cv(params = params, data = dtrain, gp_model = gp_model, nrounds = 100, nfold = 4, eval = "l2", early_stopping_rounds = 5, use_gp_model_for_validation = TRUE) print(paste0("Optimal number of iterations: ", cvbst$best_iter, ", best test error: ", cvbst$best_score))
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") # Create random effects model and dataset gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") dtrain <- gpb.Dataset(X, label = y) params <- list(learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5) # Run CV cvbst <- gpb.cv(params = params, data = dtrain, gp_model = gp_model, nrounds = 100, nfold = 4, eval = "l2", early_stopping_rounds = 5, use_gp_model_for_validation = TRUE) print(paste0("Optimal number of iterations: ", cvbst$best_iter, ", best test error: ", cvbst$best_score))
gpb.Dataset
objectConstruct gpb.Dataset
object from dense matrix, sparse matrix
or local file (that was created previously by saving an gpb.Dataset
).
gpb.Dataset(data, params = list(), reference = NULL, colnames = NULL, categorical_feature = NULL, free_raw_data = FALSE, info = list(), ...)
gpb.Dataset(data, params = list(), reference = NULL, colnames = NULL, categorical_feature = NULL, free_raw_data = FALSE, info = list(), ...)
data |
a |
params |
a list of parameters. See the "Dataset Parameters" section of the parameter documentation for a list of parameters and valid values. |
reference |
reference dataset. When GPBoost creates a Dataset, it does some preprocessing like binning
continuous features into histograms. If you want to apply the same bin boundaries from an existing
dataset to new |
colnames |
names of columns |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
free_raw_data |
GPBoost constructs its data format, called a "Dataset", from tabular data.
By default, this Dataset object on the R side does keep a copy of the raw data.
If you set |
info |
a list of information of the |
... |
other information to pass to |
constructed dataset
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") gpb.Dataset.save(dtrain, data_file) dtrain <- gpb.Dataset(data_file) gpb.Dataset.construct(dtrain)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") gpb.Dataset.save(dtrain, data_file) dtrain <- gpb.Dataset(data_file) gpb.Dataset.construct(dtrain)
Construct Dataset explicitly
gpb.Dataset.construct(dataset)
gpb.Dataset.construct(dataset)
dataset |
Object of class |
constructed dataset
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain)
Construct validation data according to training data
gpb.Dataset.create.valid(dataset, data, info = list(), ...)
gpb.Dataset.create.valid(dataset, data, info = list(), ...)
dataset |
|
data |
a |
info |
a list of information of the |
... |
other information to pass to |
constructed dataset
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label)
gpb.Dataset
to a binary filePlease note that init_score
is not saved in binary file.
If you need it, please set it again after loading Dataset.
gpb.Dataset.save(dataset, fname)
gpb.Dataset.save(dataset, fname)
dataset |
object of class |
fname |
object filename of output file |
the dataset you passed in
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.save(dtrain, tempfile(fileext = ".bin"))
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.save(dtrain, tempfile(fileext = ".bin"))
gpb.Dataset
Set the categorical features of an gpb.Dataset
object. Use this function
to tell GPBoost which features should be treated as categorical.
gpb.Dataset.set.categorical(dataset, categorical_feature)
gpb.Dataset.set.categorical(dataset, categorical_feature)
dataset |
object of class |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
the dataset you passed in
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") gpb.Dataset.save(dtrain, data_file) dtrain <- gpb.Dataset(data_file) gpb.Dataset.set.categorical(dtrain, 1L:2L)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") gpb.Dataset.save(dtrain, data_file) dtrain <- gpb.Dataset(data_file) gpb.Dataset.set.categorical(dtrain, 1L:2L)
gpb.Dataset
If you want to use validation data, you should set reference to training data
gpb.Dataset.set.reference(dataset, reference)
gpb.Dataset.set.reference(dataset, reference)
dataset |
object of class |
reference |
object of class |
the dataset you passed in
data(agaricus.train, package ="gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset(test$data, test = train$label) gpb.Dataset.set.reference(dtest, dtrain)
data(agaricus.train, package ="gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset(test$data, test = train$label) gpb.Dataset.set.reference(dtest, dtrain)
Dump GPBoost model to json
gpb.dump(booster, num_iteration = NULL)
gpb.dump(booster, num_iteration = NULL)
booster |
Object of class |
num_iteration |
number of iteration want to predict with, NULL or <= 0 means use best iteration |
json format of model
library(gpboost) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , min_data = 1L , learning_rate = 1.0 , early_stopping_rounds = 5L ) json_model <- gpb.dump(model)
library(gpboost) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , min_data = 1L , learning_rate = 1.0 , early_stopping_rounds = 5L ) json_model <- gpb.dump(model)
Given a gpb.Booster
, return evaluation results for a
particular metric on a particular dataset.
gpb.get.eval.result(booster, data_name, eval_name, iters = NULL, is_err = FALSE)
gpb.get.eval.result(booster, data_name, eval_name, iters = NULL, is_err = FALSE)
booster |
Object of class |
data_name |
Name of the dataset to return evaluation results for. |
eval_name |
Name of the evaluation metric to return results for. |
iters |
An integer vector of iterations you want to get evaluation results for. If NULL (the default), evaluation results for all iterations will be returned. |
is_err |
TRUE will return evaluation error instead |
numeric vector of evaluation result
# train a regression model data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 5L , valids = valids , min_data = 1L , learning_rate = 1.0 ) # Examine valid data_name values print(setdiff(names(model$record_evals), "start_iter")) # Examine valid eval_name values for dataset "test" print(names(model$record_evals[["test"]])) # Get L2 values for "test" dataset gpb.get.eval.result(model, "test", "l2")
# train a regression model data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 5L , valids = valids , min_data = 1L , learning_rate = 1.0 ) # Examine valid data_name values print(setdiff(names(model$record_evals), "start_iter")) # Examine valid eval_name values for dataset "test" print(names(model$record_evals[["test"]])) # Get L2 values for "test" dataset gpb.get.eval.result(model, "test", "l2")
Function that allows for choosing tuning parameters from a grid in a determinstic or random way using cross validation or validation data sets.
gpb.grid.search.tune.parameters(param_grid, num_try_random = NULL, data, gp_model = NULL, params = list(), nrounds = 1000L, early_stopping_rounds = NULL, folds = NULL, nfold = 5L, metric = NULL, verbose_eval = 1L, cv_seed = NULL, line_search_step_length = FALSE, use_gp_model_for_validation = TRUE, train_gp_model_cov_pars = TRUE, label = NULL, weight = NULL, obj = NULL, eval = NULL, stratified = TRUE, init_model = NULL, colnames = NULL, categorical_feature = NULL, callbacks = list(), return_all_combinations = FALSE, ...)
gpb.grid.search.tune.parameters(param_grid, num_try_random = NULL, data, gp_model = NULL, params = list(), nrounds = 1000L, early_stopping_rounds = NULL, folds = NULL, nfold = 5L, metric = NULL, verbose_eval = 1L, cv_seed = NULL, line_search_step_length = FALSE, use_gp_model_for_validation = TRUE, train_gp_model_cov_pars = TRUE, label = NULL, weight = NULL, obj = NULL, eval = NULL, stratified = TRUE, init_model = NULL, colnames = NULL, categorical_feature = NULL, callbacks = list(), return_all_combinations = FALSE, ...)
param_grid |
|
num_try_random |
|
data |
a |
gp_model |
A |
params |
|
nrounds |
number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting |
early_stopping_rounds |
int. Activates early stopping. Requires at least one validation data
and one metric. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
folds |
|
nfold |
the original dataset is randomly partitioned into |
metric |
Evaluation metric to be monitored when doing CV and parameter tuning.
Can be a |
verbose_eval |
|
cv_seed |
Seed for generating folds when doing |
line_search_step_length |
Boolean. If TRUE, a line search is done to find the optimal step length for every boosting update
(see, e.g., Friedman 2001). This is then multiplied by the |
use_gp_model_for_validation |
Boolean. If TRUE, the |
train_gp_model_cov_pars |
Boolean. If TRUE, the covariance parameters
of the |
label |
Vector of labels, used if |
weight |
vector of response values. If not NULL, will set to dataset |
obj |
(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes. |
eval |
Evaluation metric to be monitored when doing CV and parameter tuning. This can be a string, function, or list with a mixture of strings and functions.
|
stratified |
a |
init_model |
path of model file of |
colnames |
feature names, if not null, will use this to overwrite the names in dataset |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
callbacks |
List of callback functions that are applied at each iteration. |
return_all_combinations |
a |
... |
other parameters, see Parameters.rst for more information. |
A list
with the best parameter combination and score
The list has the following format:
list("best_params" = best_params, "best_iter" = best_iter, "best_score" = best_score)
If return_all_combinations is TRUE, then the list contains an additional entry 'all_combinations'
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") n <- length(y) param_grid <- list("learning_rate" = c(0.001, 0.01, 0.1, 1, 10), "min_data_in_leaf" = c(1, 10, 100, 1000), "max_depth" = c(-1), "num_leaves" = 2^(1:10), "lambda_l2" = c(0, 1, 10, 100), "max_bin" = c(250, 500, 1000, min(n,10000)), "line_search_step_length" = c(TRUE, FALSE)) # Note: "max_depth" = c(-1) means no depth limit as we tune 'num_leaves'. # Can also additionally tune 'max_depth', e.g., "max_depth" = c(-1, 1, 2, 3, 5, 10) metric = "mse" # Define metric # Note: can also use metric = "test_neg_log_likelihood". # See https://github.com/fabsig/GPBoost/blob/master/docs/Parameters.rst#metric-parameters gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") data_train <- gpb.Dataset(data = X, label = y) set.seed(1) opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, data = data_train, gp_model = gp_model, num_try_random = 100, nfold = 5, nrounds = 1000, early_stopping_rounds = 20, verbose_eval = 1, metric = metric, cv_seed = 4) print(paste0("Best parameters: ", paste0(unlist(lapply(seq_along(opt_params$best_params), function(y, n, i) { paste0(n[[i]],": ", y[[i]]) }, y=opt_params$best_params, n=names(opt_params$best_params))), collapse=", "))) print(paste0("Best number of iterations: ", opt_params$best_iter)) print(paste0("Best score: ", round(opt_params$best_score, digits=3))) # Alternatively and faster: using manually defined validation data instead of cross-validation # use 20% of the data as validation data valid_tune_idx <- sample.int(length(y), as.integer(0.2*length(y))) folds <- list(valid_tune_idx) opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, data = data_train, gp_model = gp_model, num_try_random = 100, folds = folds, nrounds = 1000, early_stopping_rounds = 20, verbose_eval = 1, metric = metric, cv_seed = 4)
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") n <- length(y) param_grid <- list("learning_rate" = c(0.001, 0.01, 0.1, 1, 10), "min_data_in_leaf" = c(1, 10, 100, 1000), "max_depth" = c(-1), "num_leaves" = 2^(1:10), "lambda_l2" = c(0, 1, 10, 100), "max_bin" = c(250, 500, 1000, min(n,10000)), "line_search_step_length" = c(TRUE, FALSE)) # Note: "max_depth" = c(-1) means no depth limit as we tune 'num_leaves'. # Can also additionally tune 'max_depth', e.g., "max_depth" = c(-1, 1, 2, 3, 5, 10) metric = "mse" # Define metric # Note: can also use metric = "test_neg_log_likelihood". # See https://github.com/fabsig/GPBoost/blob/master/docs/Parameters.rst#metric-parameters gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") data_train <- gpb.Dataset(data = X, label = y) set.seed(1) opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, data = data_train, gp_model = gp_model, num_try_random = 100, nfold = 5, nrounds = 1000, early_stopping_rounds = 20, verbose_eval = 1, metric = metric, cv_seed = 4) print(paste0("Best parameters: ", paste0(unlist(lapply(seq_along(opt_params$best_params), function(y, n, i) { paste0(n[[i]],": ", y[[i]]) }, y=opt_params$best_params, n=names(opt_params$best_params))), collapse=", "))) print(paste0("Best number of iterations: ", opt_params$best_iter)) print(paste0("Best score: ", round(opt_params$best_score, digits=3))) # Alternatively and faster: using manually defined validation data instead of cross-validation # use 20% of the data as validation data valid_tune_idx <- sample.int(length(y), as.integer(0.2*length(y))) folds <- list(valid_tune_idx) opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, data = data_train, gp_model = gp_model, num_try_random = 100, folds = folds, nrounds = 1000, early_stopping_rounds = 20, verbose_eval = 1, metric = metric, cv_seed = 4)
Creates a data.table
of feature importances in a model.
gpb.importance(model, percentage = TRUE)
gpb.importance(model, percentage = TRUE)
model |
object of class |
percentage |
whether to show importance in relative percentage. |
For a tree model, a data.table
with the following columns:
Feature
: Feature names in the model.
Gain
: The total gain of this feature's splits.
Cover
: The number of observation related to this feature.
Frequency
: The number of times a feature splited in trees.
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp1 <- gpb.importance(model, percentage = TRUE) tree_imp2 <- gpb.importance(model, percentage = FALSE)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp1 <- gpb.importance(model, percentage = TRUE) tree_imp2 <- gpb.importance(model, percentage = FALSE)
Computes feature contribution components of rawscore prediction.
gpb.interprete(model, data, idxset, num_iteration = NULL)
gpb.interprete(model, data, idxset, num_iteration = NULL)
model |
object of class |
data |
a matrix object or a dgCMatrix object. |
idxset |
an integer vector of indices of rows needed. |
num_iteration |
number of iteration want to predict with, NULL or <= 0 means use best iteration. |
For regression, binary classification and lambdarank model, a list
of data.table
with the following columns:
Feature
: Feature names in the model.
Contribution
: The total contribution of this feature's splits.
For multiclass classification, a list
of data.table
with the Feature column and
Contribution columns to each class.
Logit <- function(x) log(x / (1.0 - x)) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) setinfo(dtrain, "init_score", rep(Logit(mean(train$label)), length(train$label))) data(agaricus.test, package = "gpboost") test <- agaricus.test params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 3L ) tree_interpretation <- gpb.interprete(model, test$data, 1L:5L)
Logit <- function(x) log(x / (1.0 - x)) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) setinfo(dtrain, "init_score", rep(Logit(mean(train$label)), length(train$label))) data(agaricus.test, package = "gpboost") test <- agaricus.test params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 3L ) tree_interpretation <- gpb.interprete(model, test$data, 1L:5L)
Load GPBoost takes in either a file path or model string. If both are provided, Load will default to loading from file Boosters with gp_models can only be loaded from file.
gpb.load(filename = NULL, model_str = NULL)
gpb.load(filename = NULL, model_str = NULL)
filename |
path of model file |
model_str |
a str containing the model |
gpb.Booster
Fabio Sigrist, authors of the LightGBM R package
library(gpboost) data(GPBoost_data, package = "gpboost") # Train model and make prediction gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Save model to file filename <- tempfile(fileext = ".json") gpb.save(bst,filename = filename) # Load from file and make predictions again bst_loaded <- gpb.load(filename = filename) pred_loaded <- predict(bst_loaded, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Check equality pred$fixed_effect - pred_loaded$fixed_effect pred$random_effect_mean - pred_loaded$random_effect_mean pred$random_effect_cov - pred_loaded$random_effect_cov
library(gpboost) data(GPBoost_data, package = "gpboost") # Train model and make prediction gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Save model to file filename <- tempfile(fileext = ".json") gpb.save(bst,filename = filename) # Load from file and make predictions again bst_loaded <- gpb.load(filename = filename) pred_loaded <- predict(bst_loaded, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Check equality pred$fixed_effect - pred_loaded$fixed_effect pred$random_effect_mean - pred_loaded$random_effect_mean pred$random_effect_cov - pred_loaded$random_effect_cov
Parse a GPBoost model json dump into a data.table
structure.
gpb.model.dt.tree(model, num_iteration = NULL)
gpb.model.dt.tree(model, num_iteration = NULL)
model |
object of class |
num_iteration |
number of iterations you want to predict with. NULL or <= 0 means use best iteration |
A data.table
with detailed information about model trees' nodes and leafs.
The columns of the data.table
are:
tree_index
: ID of a tree in a model (integer)
split_index
: ID of a node in a tree (integer)
split_feature
: for a node, it's a feature name (character);
for a leaf, it simply labels it as "NA"
node_parent
: ID of the parent node for current node (integer)
leaf_index
: ID of a leaf in a tree (integer)
leaf_parent
: ID of the parent node for current leaf (integer)
split_gain
: Split gain of a node
threshold
: Splitting threshold value of a node
decision_type
: Decision type of a node
default_left
: Determine how to handle NA value, TRUE -> Left, FALSE -> Right
internal_value
: Node value
internal_count
: The number of observation collected by a node
leaf_value
: Leaf value
leaf_count
: The number of observation collected by a leaf
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.01 , num_leaves = 63L , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train(params, dtrain, 10L) tree_dt <- gpb.model.dt.tree(model)
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.01 , num_leaves = 63L , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train(params, dtrain, 10L) tree_dt <- gpb.model.dt.tree(model)
Plot previously calculated feature importance: Gain, Cover and Frequency, as a bar graph.
gpb.plot.importance(tree_imp, top_n = 10L, measure = "Gain", left_margin = 10L, cex = NULL, ...)
gpb.plot.importance(tree_imp, top_n = 10L, measure = "Gain", left_margin = 10L, cex = NULL, ...)
tree_imp |
a |
top_n |
maximal number of top features to include into the plot. |
measure |
the name of importance measure to plot, can be "Gain", "Cover" or "Frequency". |
left_margin |
(base R barplot) allows to adjust the left margin size to fit feature names. |
cex |
(base R barplot) passed as |
... |
other parameters passed to graphics::barplot |
The graph represents each feature as a horizontal bar of length proportional to the defined importance of a feature. Features are shown ranked in a decreasing importance order.
The gpb.plot.importance
function creates a barplot
and silently returns a processed data.table with top_n
features sorted by defined importance.
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp <- gpb.importance(model, percentage = TRUE) gpb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp <- gpb.importance(model, percentage = TRUE) gpb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")
Plot previously calculated feature contribution as a bar graph.
gpb.plot.interpretation(tree_interpretation_dt, top_n = 10L, cols = 1L, left_margin = 10L, cex = NULL)
gpb.plot.interpretation(tree_interpretation_dt, top_n = 10L, cols = 1L, left_margin = 10L, cex = NULL)
tree_interpretation_dt |
a |
top_n |
maximal number of top features to include into the plot. |
cols |
the column numbers of layout, will be used only for multiclass classification feature contribution. |
left_margin |
(base R barplot) allows to adjust the left margin size to fit feature names. |
cex |
(base R barplot) passed as |
The graph represents each feature as a horizontal bar of length proportional to the defined contribution of a feature. Features are shown ranked in a decreasing contribution order.
The gpb.plot.interpretation
function creates a barplot
.
Logit <- function(x) { log(x / (1.0 - x)) } data(agaricus.train, package = "gpboost") labels <- agaricus.train$label dtrain <- gpb.Dataset( agaricus.train$data , label = labels ) setinfo(dtrain, "init_score", rep(Logit(mean(labels)), length(labels))) data(agaricus.test, package = "gpboost") params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 5L ) tree_interpretation <- gpb.interprete( model = model , data = agaricus.test$data , idxset = 1L:5L ) gpb.plot.interpretation( tree_interpretation_dt = tree_interpretation[[1L]] , top_n = 3L )
Logit <- function(x) { log(x / (1.0 - x)) } data(agaricus.train, package = "gpboost") labels <- agaricus.train$label dtrain <- gpb.Dataset( agaricus.train$data , label = labels ) setinfo(dtrain, "init_score", rep(Logit(mean(labels)), length(labels))) data(agaricus.test, package = "gpboost") params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 ) model <- gpb.train( params = params , data = dtrain , nrounds = 5L ) tree_interpretation <- gpb.interprete( model = model , data = agaricus.test$data , idxset = 1L:5L ) gpb.plot.interpretation( tree_interpretation_dt = tree_interpretation[[1L]] , top_n = 3L )
Plot interaction partial dependence plots
gpb.plot.part.dep.interact(model, data, variables, n.pt.per.var = 20, subsample = pmin(1, n.pt.per.var^2 * 100/nrow(data)), discrete.variables = c(FALSE, FALSE), which.class = NULL, type = "filled.contour", nlevels = 20, xlab = variables[1], ylab = variables[2], zlab = "", main = "", return_plot_data = FALSE, ...)
gpb.plot.part.dep.interact(model, data, variables, n.pt.per.var = 20, subsample = pmin(1, n.pt.per.var^2 * 100/nrow(data)), discrete.variables = c(FALSE, FALSE), which.class = NULL, type = "filled.contour", nlevels = 20, xlab = variables[1], ylab = variables[2], zlab = "", main = "", return_plot_data = FALSE, ...)
model |
A |
data |
A |
variables |
A |
n.pt.per.var |
Number of grid points per variable (used only if a variable is not discrete) For continuous variables, the two-dimensional grid for the interaction plot has dimension c(n.pt.per.var, n.pt.per.var) |
subsample |
Fraction of random samples in |
discrete.variables |
A |
which.class |
An |
type |
A |
nlevels |
Parameter passed to the |
xlab |
Parameter passed to the |
ylab |
Parameter passed to the |
zlab |
Parameter passed to the |
main |
Parameter passed to the |
return_plot_data |
A |
... |
Additional parameters passed to the |
A list
with three entries for creating the partial dependence plot:
the first two entries are vector
s with x and y coordinates.
The third is a two-dimensional matrix
of dimension c(length(x), length(y))
with z-coordinates. This is only returned if return_plot_data==TRUE
Fabio Sigrist
library(gpboost) data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") gpboost_model <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) gpb.plot.part.dep.interact(gpboost_model, X, variables = c(1,2))
library(gpboost) data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") gpboost_model <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) gpb.plot.part.dep.interact(gpboost_model, X, variables = c(1,2))
Plot partial dependence plots
gpb.plot.partial.dependence(model, data, variable, n.pt = 100, subsample = pmin(1, n.pt * 100/nrow(data)), discrete.x = FALSE, which.class = NULL, xlab = deparse(substitute(variable)), ylab = "", type = if (discrete.x) "p" else "b", main = "", return_plot_data = FALSE, ...)
gpb.plot.partial.dependence(model, data, variable, n.pt = 100, subsample = pmin(1, n.pt * 100/nrow(data)), discrete.x = FALSE, which.class = NULL, xlab = deparse(substitute(variable)), ylab = "", type = if (discrete.x) "p" else "b", main = "", return_plot_data = FALSE, ...)
model |
A |
data |
A |
variable |
A |
n.pt |
Evaluation grid size (used only if x is not discrete) |
subsample |
Fraction of random samples in |
discrete.x |
A |
which.class |
An |
xlab |
Parameter passed to |
ylab |
Parameter passed to |
type |
Parameter passed to |
main |
Parameter passed to |
return_plot_data |
A |
... |
Additional parameters passed to |
A two-dimensional matrix
with data for creating the partial dependence plot.
This is only returned if return_plot_data==TRUE
Fabio Sigrist (adapted from a version by Michael Mayer)
library(gpboost) data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") gpboost_model <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) gpb.plot.partial.dependence(gpboost_model, X, variable = 1)
library(gpboost) data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") gpboost_model <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) gpb.plot.partial.dependence(gpboost_model, X, variable = 1)
Save GPBoost model
gpb.save(booster, filename, start_iteration = NULL, num_iteration = NULL, save_raw_data = FALSE, ...)
gpb.save(booster, filename, start_iteration = NULL, num_iteration = NULL, save_raw_data = FALSE, ...)
booster |
Object of class |
filename |
saved filename |
start_iteration |
int or NULL, optional (default=NULL) Start index of the iteration to predict. If NULL or <= 0, starts from the first iteration. |
num_iteration |
int or NULL, optional (default=NULL) Limit number of iterations in the prediction. If NULL, if the best iteration exists and start_iteration is NULL or <= 0, the best iteration is used; otherwise, all iterations from start_iteration are used. If <= 0, all iterations from start_iteration are used (no limits). |
save_raw_data |
If TRUE, the raw data (predictor / covariate data) for the Booster is also saved.
Enable this option if you want to change |
... |
Additional named arguments passed to the |
gpb.Booster
Fabio Sigrist, authors of the LightGBM R package
library(gpboost) data(GPBoost_data, package = "gpboost") # Train model and make prediction gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Save model to file filename <- tempfile(fileext = ".json") gpb.save(bst,filename = filename) # Load from file and make predictions again bst_loaded <- gpb.load(filename = filename) pred_loaded <- predict(bst_loaded, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Check equality pred$fixed_effect - pred_loaded$fixed_effect pred$random_effect_mean - pred_loaded$random_effect_mean pred$random_effect_cov - pred_loaded$random_effect_cov
library(gpboost) data(GPBoost_data, package = "gpboost") # Train model and make prediction gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Save model to file filename <- tempfile(fileext = ".json") gpb.save(bst,filename = filename) # Load from file and make predictions again bst_loaded <- gpb.load(filename = filename) pred_loaded <- predict(bst_loaded, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE, pred_latent = TRUE) # Check equality pred$fixed_effect - pred_loaded$fixed_effect pred$random_effect_mean - pred_loaded$random_effect_mean pred$random_effect_cov - pred_loaded$random_effect_cov
Logic to train with GBPoost
gpb.train(params = list(), data, nrounds = 100L, gp_model = NULL, use_gp_model_for_validation = TRUE, train_gp_model_cov_pars = TRUE, valids = list(), obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, init_model = NULL, colnames = NULL, categorical_feature = NULL, early_stopping_rounds = NULL, callbacks = list(), reset_data = FALSE, ...)
gpb.train(params = list(), data, nrounds = 100L, gp_model = NULL, use_gp_model_for_validation = TRUE, train_gp_model_cov_pars = TRUE, valids = list(), obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, init_model = NULL, colnames = NULL, categorical_feature = NULL, early_stopping_rounds = NULL, callbacks = list(), reset_data = FALSE, ...)
params |
list of "tuning" parameters. See the parameter documentation for more information. A few key parameters:
|
data |
a |
nrounds |
number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting |
gp_model |
A |
use_gp_model_for_validation |
Boolean. If TRUE, the |
train_gp_model_cov_pars |
Boolean. If TRUE, the covariance parameters
of the |
valids |
a list of |
obj |
(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes. |
eval |
Evaluation metric to be monitored when doing CV and parameter tuning. This can be a string, function, or list with a mixture of strings and functions.
|
verbose |
verbosity for output, if <= 0, also will disable the print of evaluation during training |
record |
Boolean, TRUE will record iteration message to |
eval_freq |
evaluation output frequency, only effect when verbose > 0 |
init_model |
path of model file of |
colnames |
feature names, if not null, will use this to overwrite the names in dataset |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
early_stopping_rounds |
int. Activates early stopping. Requires at least one validation data
and one metric. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
callbacks |
List of callback functions that are applied at each iteration. |
reset_data |
Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets |
... |
other parameters, see the parameter documentation for more information. |
a trained booster model gpb.Booster
.
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
Fabio Sigrist, authors of the LightGBM R package
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") #--------------------Combine tree-boosting and grouped random effects model---------------- # Create random effects model gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") # The default optimizer for covariance parameters (hyperparameters) is # Nesterov-accelerated gradient descent. # This can be changed to, e.g., Nelder-Mead as follows: # re_params <- list(optimizer_cov = "nelder_mead") # gp_model$set_optim_params(params=re_params) # Use trace = TRUE to monitor convergence: # re_params <- list(trace = TRUE) # gp_model$set_optim_params(params=re_params) dtrain <- gpb.Dataset(data = X, label = y) # Train model bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE) pred$random_effect_mean # Predicted mean pred$random_effect_cov # Predicted variances pred$fixed_effect # Predicted fixed effect from tree ensemble # Sum them up to otbain a single prediction pred$random_effect_mean + pred$fixed_effect #--------------------Combine tree-boosting and Gaussian process model---------------- # Create Gaussian process model gp_model <- GPModel(gp_coords = coords, cov_function = "exponential", likelihood = "gaussian") # Train model dtrain <- gpb.Dataset(data = X, label = y) bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_cov_mat =TRUE) pred$random_effect_mean # Predicted (posterior) mean of GP pred$random_effect_cov # Predicted (posterior) covariance matrix of GP pred$fixed_effect # Predicted fixed effect from tree ensemble # Sum them up to otbain a single prediction pred$random_effect_mean + pred$fixed_effect #--------------------Using validation data------------------------- set.seed(1) train_ind <- sample.int(length(y),size=250) dtrain <- gpb.Dataset(data = X[train_ind,], label = y[train_ind]) dtest <- gpb.Dataset.create.valid(dtrain, data = X[-train_ind,], label = y[-train_ind]) valids <- list(test = dtest) gp_model <- GPModel(group_data = group_data[train_ind,1], likelihood="gaussian") # Need to set prediction data for gp_model gp_model$set_prediction_data(group_data_pred = group_data[-train_ind,1]) # Training with validation data and use_gp_model_for_validation = TRUE bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 100, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 1, valids = valids, early_stopping_rounds = 10, use_gp_model_for_validation = TRUE) print(paste0("Optimal number of iterations: ", bst$best_iter, ", best test error: ", bst$best_score)) # Plot validation error val_error <- unlist(bst$record_evals$test$l2$eval) plot(1:length(val_error), val_error, type="l", lwd=2, col="blue", xlab="iteration", ylab="Validation error", main="Validation error vs. boosting iteration") #--------------------Do Newton updates for tree leaves--------------- # Note: run the above examples first bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 100, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 1, valids = valids, early_stopping_rounds = 5, use_gp_model_for_validation = FALSE, leaves_newton_update = TRUE) print(paste0("Optimal number of iterations: ", bst$best_iter, ", best test error: ", bst$best_score)) # Plot validation error val_error <- unlist(bst$record_evals$test$l2$eval) plot(1:length(val_error), val_error, type="l", lwd=2, col="blue", xlab="iteration", ylab="Validation error", main="Validation error vs. boosting iteration") #--------------------GPBoostOOS algorithm: GP parameters estimated out-of-sample---------------- # Create random effects model and dataset gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") dtrain <- gpb.Dataset(X, label = y) params <- list(learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5) # Stage 1: run cross-validation to (i) determine to optimal number of iterations # and (ii) to estimate the GPModel on the out-of-sample data cvbst <- gpb.cv(params = params, data = dtrain, gp_model = gp_model, nrounds = 100, nfold = 4, eval = "l2", early_stopping_rounds = 5, use_gp_model_for_validation = TRUE, fit_GP_cov_pars_OOS = TRUE) print(paste0("Optimal number of iterations: ", cvbst$best_iter)) # Estimated random effects model # Note: ideally, one would have to find the optimal combination of # other tuning parameters such as the learning rate, tree depth, etc.) summary(gp_model) # Stage 2: Train tree-boosting model while holding the GPModel fix bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = cvbst$best_iter, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0, train_gp_model_cov_pars = FALSE) # The GPModel has not changed: summary(gp_model)
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") #--------------------Combine tree-boosting and grouped random effects model---------------- # Create random effects model gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") # The default optimizer for covariance parameters (hyperparameters) is # Nesterov-accelerated gradient descent. # This can be changed to, e.g., Nelder-Mead as follows: # re_params <- list(optimizer_cov = "nelder_mead") # gp_model$set_optim_params(params=re_params) # Use trace = TRUE to monitor convergence: # re_params <- list(trace = TRUE) # gp_model$set_optim_params(params=re_params) dtrain <- gpb.Dataset(data = X, label = y) # Train model bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var= TRUE) pred$random_effect_mean # Predicted mean pred$random_effect_cov # Predicted variances pred$fixed_effect # Predicted fixed effect from tree ensemble # Sum them up to otbain a single prediction pred$random_effect_mean + pred$fixed_effect #--------------------Combine tree-boosting and Gaussian process model---------------- # Create Gaussian process model gp_model <- GPModel(gp_coords = coords, cov_function = "exponential", likelihood = "gaussian") # Train model dtrain <- gpb.Dataset(data = X, label = y) bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_cov_mat =TRUE) pred$random_effect_mean # Predicted (posterior) mean of GP pred$random_effect_cov # Predicted (posterior) covariance matrix of GP pred$fixed_effect # Predicted fixed effect from tree ensemble # Sum them up to otbain a single prediction pred$random_effect_mean + pred$fixed_effect #--------------------Using validation data------------------------- set.seed(1) train_ind <- sample.int(length(y),size=250) dtrain <- gpb.Dataset(data = X[train_ind,], label = y[train_ind]) dtest <- gpb.Dataset.create.valid(dtrain, data = X[-train_ind,], label = y[-train_ind]) valids <- list(test = dtest) gp_model <- GPModel(group_data = group_data[train_ind,1], likelihood="gaussian") # Need to set prediction data for gp_model gp_model$set_prediction_data(group_data_pred = group_data[-train_ind,1]) # Training with validation data and use_gp_model_for_validation = TRUE bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 100, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 1, valids = valids, early_stopping_rounds = 10, use_gp_model_for_validation = TRUE) print(paste0("Optimal number of iterations: ", bst$best_iter, ", best test error: ", bst$best_score)) # Plot validation error val_error <- unlist(bst$record_evals$test$l2$eval) plot(1:length(val_error), val_error, type="l", lwd=2, col="blue", xlab="iteration", ylab="Validation error", main="Validation error vs. boosting iteration") #--------------------Do Newton updates for tree leaves--------------- # Note: run the above examples first bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = 100, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 1, valids = valids, early_stopping_rounds = 5, use_gp_model_for_validation = FALSE, leaves_newton_update = TRUE) print(paste0("Optimal number of iterations: ", bst$best_iter, ", best test error: ", bst$best_score)) # Plot validation error val_error <- unlist(bst$record_evals$test$l2$eval) plot(1:length(val_error), val_error, type="l", lwd=2, col="blue", xlab="iteration", ylab="Validation error", main="Validation error vs. boosting iteration") #--------------------GPBoostOOS algorithm: GP parameters estimated out-of-sample---------------- # Create random effects model and dataset gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") dtrain <- gpb.Dataset(X, label = y) params <- list(learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5) # Stage 1: run cross-validation to (i) determine to optimal number of iterations # and (ii) to estimate the GPModel on the out-of-sample data cvbst <- gpb.cv(params = params, data = dtrain, gp_model = gp_model, nrounds = 100, nfold = 4, eval = "l2", early_stopping_rounds = 5, use_gp_model_for_validation = TRUE, fit_GP_cov_pars_OOS = TRUE) print(paste0("Optimal number of iterations: ", cvbst$best_iter)) # Estimated random effects model # Note: ideally, one would have to find the optimal combination of # other tuning parameters such as the learning rate, tree depth, etc.) summary(gp_model) # Stage 2: Train tree-boosting model while holding the GPModel fix bst <- gpb.train(data = dtrain, gp_model = gp_model, nrounds = cvbst$best_iter, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0, train_gp_model_cov_pars = FALSE) # The GPModel has not changed: summary(gp_model)
Simple interface for training a GPBoost model.
gpboost(data, label = NULL, weight = NULL, params = list(), nrounds = 100L, gp_model = NULL, line_search_step_length = FALSE, use_gp_model_for_validation = TRUE, train_gp_model_cov_pars = TRUE, valids = list(), obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, early_stopping_rounds = NULL, init_model = NULL, colnames = NULL, categorical_feature = NULL, callbacks = list(), ...)
gpboost(data, label = NULL, weight = NULL, params = list(), nrounds = 100L, gp_model = NULL, line_search_step_length = FALSE, use_gp_model_for_validation = TRUE, train_gp_model_cov_pars = TRUE, valids = list(), obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, early_stopping_rounds = NULL, init_model = NULL, colnames = NULL, categorical_feature = NULL, callbacks = list(), ...)
data |
a |
label |
Vector of response values / labels, used if |
weight |
Vector of weights. The GPBoost algorithm currently does not support weights |
params |
list of "tuning" parameters. See the parameter documentation for more information. A few key parameters:
|
nrounds |
number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting |
gp_model |
A |
line_search_step_length |
Boolean. If TRUE, a line search is done to find the optimal step length for every boosting update
(see, e.g., Friedman 2001). This is then multiplied by the |
use_gp_model_for_validation |
Boolean. If TRUE, the |
train_gp_model_cov_pars |
Boolean. If TRUE, the covariance parameters
of the |
valids |
a list of |
obj |
(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes. |
eval |
Evaluation metric to be monitored when doing CV and parameter tuning. This can be a string, function, or list with a mixture of strings and functions.
|
verbose |
verbosity for output, if <= 0, also will disable the print of evaluation during training |
record |
Boolean, TRUE will record iteration message to |
eval_freq |
evaluation output frequency, only effect when verbose > 0 |
early_stopping_rounds |
int. Activates early stopping. Requires at least one validation data
and one metric. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
init_model |
path of model file of |
colnames |
feature names, if not null, will use this to overwrite the names in dataset |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
callbacks |
List of callback functions that are applied at each iteration. |
... |
Additional arguments passed to
|
a trained gpb.Booster
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
Fabio Sigrist, authors of the LightGBM R package
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") #--------------------Combine tree-boosting and grouped random effects model---------------- # Create random effects model gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") # The default optimizer for covariance parameters (hyperparameters) is # Nesterov-accelerated gradient descent. # This can be changed to, e.g., Nelder-Mead as follows: # re_params <- list(optimizer_cov = "nelder_mead") # gp_model$set_optim_params(params=re_params) # Use trace = TRUE to monitor convergence: # re_params <- list(trace = TRUE) # gp_model$set_optim_params(params=re_params) # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions # Predict latent variables pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean # For Gaussian data: pred$random_effect_mean + pred$fixed_effect = pred_resp$response_mean pred$random_effect_mean + pred$fixed_effect - pred_resp$response_mean #--------------------Combine tree-boosting and Gaussian process model---------------- # Create Gaussian process model gp_model <- GPModel(gp_coords = coords, cov_function = "exponential", likelihood = "gaussian") # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 8, learning_rate = 0.1, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") #--------------------Combine tree-boosting and grouped random effects model---------------- # Create random effects model gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") # The default optimizer for covariance parameters (hyperparameters) is # Nesterov-accelerated gradient descent. # This can be changed to, e.g., Nelder-Mead as follows: # re_params <- list(optimizer_cov = "nelder_mead") # gp_model$set_optim_params(params=re_params) # Use trace = TRUE to monitor convergence: # re_params <- list(trace = TRUE) # gp_model$set_optim_params(params=re_params) # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions # Predict latent variables pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean # For Gaussian data: pred$random_effect_mean + pred$fixed_effect = pred_resp$response_mean pred$random_effect_mean + pred$fixed_effect - pred_resp$response_mean #--------------------Combine tree-boosting and Gaussian process model---------------- # Create Gaussian process model gp_model <- GPModel(gp_coords = coords, cov_function = "exponential", likelihood = "gaussian") # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 8, learning_rate = 0.1, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean
Simulated example data for the GPBoost package This data set includes the following fields:
y
: response variable
X
: a matrix with covariate information
group_data
: a matrix with categorical grouping variables
coords
: a matrix with spatial coordinates
X_test
: a matrix with covariate information for predictions
group_data_test
: a matrix with categorical grouping variables for predictions
coords_test
: a matrix with spatial coordinates for predictions
data(GPBoost_data)
data(GPBoost_data)
GPModel
objectCreate a GPModel
which contains a Gaussian process and / or mixed effects model with grouped random effects
GPModel(likelihood = "gaussian", group_data = NULL, group_rand_coef_data = NULL, ind_effect_group_rand_coef = NULL, drop_intercept_group_rand_effect = NULL, gp_coords = NULL, gp_rand_coef_data = NULL, cov_function = "matern", cov_fct_shape = 1.5, gp_approx = "none", cov_fct_taper_range = 1, cov_fct_taper_shape = 1, num_neighbors = 20L, vecchia_ordering = "random", ind_points_selection = "kmeans++", num_ind_points = 500L, cover_tree_radius = 1, matrix_inversion_method = "cholesky", seed = 0L, cluster_ids = NULL, free_raw_data = FALSE, vecchia_approx = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, likelihood_additional_param = 1)
GPModel(likelihood = "gaussian", group_data = NULL, group_rand_coef_data = NULL, ind_effect_group_rand_coef = NULL, drop_intercept_group_rand_effect = NULL, gp_coords = NULL, gp_rand_coef_data = NULL, cov_function = "matern", cov_fct_shape = 1.5, gp_approx = "none", cov_fct_taper_range = 1, cov_fct_taper_shape = 1, num_neighbors = 20L, vecchia_ordering = "random", ind_points_selection = "kmeans++", num_ind_points = 500L, cover_tree_radius = 1, matrix_inversion_method = "cholesky", seed = 0L, cluster_ids = NULL, free_raw_data = FALSE, vecchia_approx = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, likelihood_additional_param = 1)
likelihood |
A
|
group_data |
A |
group_rand_coef_data |
A |
ind_effect_group_rand_coef |
A |
drop_intercept_group_rand_effect |
A |
gp_coords |
A |
gp_rand_coef_data |
A |
cov_function |
A
|
cov_fct_shape |
A |
gp_approx |
A
|
cov_fct_taper_range |
A |
cov_fct_taper_shape |
A |
num_neighbors |
An |
vecchia_ordering |
A
|
ind_points_selection |
A
|
num_ind_points |
An |
cover_tree_radius |
A |
matrix_inversion_method |
A
|
seed |
An |
cluster_ids |
A |
free_raw_data |
A |
vecchia_approx |
Discontinued. Use the argument |
vecchia_pred_type |
A |
num_neighbors_pred |
an |
likelihood_additional_param |
A |
A GPModel
containing ontains a Gaussian process and / or mixed effects model with grouped random effects
Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") #--------------------Gaussian process model---------------- gp_model <- GPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian") #--------------------Combine Gaussian process with grouped random effects---------------- gp_model <- GPModel(group_data = group_data, gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian")
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian") #--------------------Gaussian process model---------------- gp_model <- GPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian") #--------------------Combine Gaussian process with grouped random effects---------------- gp_model <- GPModel(group_data = group_data, gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian")
A matrix with categorical grouping variables for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)
A matrix with categorical grouping variables for predictions for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)
GPModel
from a fileLoad a GPModel
from a file
loadGPModel(filename)
loadGPModel(filename)
filename |
filename for loading |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Save model to file filename <- tempfile(fileext = ".json") saveGPModel(gp_model,filename = filename) # Load from file and make predictions again gp_model_loaded <- loadGPModel(filename = filename) pred_loaded <- predict(gp_model_loaded, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Check equality pred$mu - pred_loaded$mu pred$var - pred_loaded$var
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Save model to file filename <- tempfile(fileext = ".json") saveGPModel(gp_model,filename = filename) # Load from file and make predictions again gp_model_loaded <- loadGPModel(filename = filename) pred_loaded <- predict(gp_model_loaded, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Check equality pred$mu - pred_loaded$mu pred$var - pred_loaded$var
Evaluate the negative log-likelihood. If there is a linear fixed effects predictor term, this needs to be calculated "manually" prior to calling this function (see example below)
neg_log_likelihood(gp_model, cov_pars, y, fixed_effects = NULL, aux_pars = NULL)
neg_log_likelihood(gp_model, cov_pars, y, fixed_effects = NULL, aux_pars = NULL)
gp_model |
A |
cov_pars |
A |
y |
A |
fixed_effects |
A |
aux_pars |
A |
Fabio Sigrist
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") X1 <- cbind(rep(1,dim(X)[1]), X) coef <- c(0.1, 0.1, 0.1) fixed_effects <- as.numeric(X1 %*% coef) neg_log_likelihood(gp_model, y = y, cov_pars = c(0.1,1,1), fixed_effects = fixed_effects)
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") X1 <- cbind(rep(1,dim(X)[1]), X) coef <- c(0.1, 0.1, 0.1) fixed_effects <- as.numeric(X1 %*% coef) neg_log_likelihood(gp_model, y = y, cov_pars = c(0.1,1,1), fixed_effects = fixed_effects)
Evaluate the negative log-likelihood. If there is a linear fixed effects predictor term, this needs to be calculated "manually" prior to calling this function (see example below)
## S3 method for class 'GPModel' neg_log_likelihood(gp_model, cov_pars, y, fixed_effects = NULL, aux_pars = NULL)
## S3 method for class 'GPModel' neg_log_likelihood(gp_model, cov_pars, y, fixed_effects = NULL, aux_pars = NULL)
gp_model |
A |
cov_pars |
A |
y |
A |
fixed_effects |
A |
aux_pars |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") X1 <- cbind(rep(1,dim(X)[1]), X) coef <- c(0.1, 0.1, 0.1) fixed_effects <- as.numeric(X1 %*% coef) neg_log_likelihood(gp_model, y = y, cov_pars = c(0.1,1,1), fixed_effects = fixed_effects)
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") X1 <- cbind(rep(1,dim(X)[1]), X) coef <- c(0.1, 0.1, 0.1) fixed_effects <- as.numeric(X1 %*% coef) neg_log_likelihood(gp_model, y = y, cov_pars = c(0.1,1,1), fixed_effects = fixed_effects)
GPModel
Predict ("estimate") training data random effects for a GPModel
predict_training_data_random_effects(gp_model, predict_var = FALSE)
predict_training_data_random_effects(gp_model, predict_var = FALSE)
gp_model |
A |
predict_var |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") all_training_data_random_effects <- predict_training_data_random_effects(gp_model) first_occurences <- match(unique(group_data[,1]), group_data[,1]) unique_training_data_random_effects <- all_training_data_random_effects[first_occurences] head(unique_training_data_random_effects)
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") all_training_data_random_effects <- predict_training_data_random_effects(gp_model) first_occurences <- match(unique(group_data[,1]), group_data[,1]) unique_training_data_random_effects <- all_training_data_random_effects[first_occurences] head(unique_training_data_random_effects)
GPModel
Predict ("estimate") training data random effects for a GPModel
## S3 method for class 'GPModel' predict_training_data_random_effects(gp_model, predict_var = FALSE)
## S3 method for class 'GPModel' predict_training_data_random_effects(gp_model, predict_var = FALSE)
gp_model |
A |
predict_var |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") all_training_data_random_effects <- predict_training_data_random_effects(gp_model) first_occurences <- match(unique(group_data[,1]), group_data[,1]) unique_training_data_random_effects <- all_training_data_random_effects[first_occurences] head(unique_training_data_random_effects)
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") all_training_data_random_effects <- predict_training_data_random_effects(gp_model) first_occurences <- match(unique(group_data[,1]), group_data[,1]) unique_training_data_random_effects <- all_training_data_random_effects[first_occurences] head(unique_training_data_random_effects)
gpb.Booster
objectsPrediction function for gpb.Booster
objects
## S3 method for class 'gpb.Booster' predict(object, data, start_iteration = NULL, num_iteration = NULL, pred_latent = FALSE, predleaf = FALSE, predcontrib = FALSE, header = FALSE, reshape = FALSE, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, predict_cov_mat = FALSE, predict_var = FALSE, cov_pars = NULL, ignore_gp_model = FALSE, rawscore = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, ...)
## S3 method for class 'gpb.Booster' predict(object, data, start_iteration = NULL, num_iteration = NULL, pred_latent = FALSE, predleaf = FALSE, predcontrib = FALSE, header = FALSE, reshape = FALSE, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, predict_cov_mat = FALSE, predict_var = FALSE, cov_pars = NULL, ignore_gp_model = FALSE, rawscore = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, ...)
object |
Object of class |
data |
a |
start_iteration |
int or NULL, optional (default=NULL) Start index of the iteration to predict. If NULL or <= 0, starts from the first iteration. |
num_iteration |
int or NULL, optional (default=NULL) Limit number of iterations in the prediction. If NULL, if the best iteration exists and start_iteration is NULL or <= 0, the best iteration is used; otherwise, all iterations from start_iteration are used. If <= 0, all iterations from start_iteration are used (no limits). |
pred_latent |
If TRUE latent variables, both fixed effects (tree-ensemble)
and random effects ( |
predleaf |
whether predict leaf index instead. |
predcontrib |
return per-feature contributions for each record. |
header |
only used for prediction for text file. True if text file has header |
reshape |
whether to reshape the vector of predictions to a matrix form when there are several prediction outputs per case. |
group_data_pred |
A |
group_rand_coef_data_pred |
A |
gp_coords_pred |
A |
gp_rand_coef_data_pred |
A |
cluster_ids_pred |
A |
predict_cov_mat |
A |
predict_var |
A |
cov_pars |
A |
ignore_gp_model |
A |
rawscore |
This is discontinued. Use the renamed equivalent argument
|
vecchia_pred_type |
A |
num_neighbors_pred |
an |
... |
Additional named arguments passed to the |
either a list with vectors or a single vector / matrix depending on
whether there is a gp_model
or not
If there is a gp_model
, the result dict contains the following entries.
1. If pred_latent
is TRUE, the dict contains the following 3 entries:
- result["fixed_effect"] are the predictions from the tree-ensemble.
- result["random_effect_mean"] are the predicted means of the gp_model
.
- result["random_effect_cov"] are the predicted covariances or variances of the gp_model
(only if 'predict_var' or 'predict_cov' is TRUE).
2. If pred_latent
is FALSE, the dict contains the following 2 entries:
- result["response_mean"] are the predicted means of the response variable (Label) taking into account
both the fixed effects (tree-ensemble) and the random effects (gp_model
)
- result["response_var"] are the predicted covariances or variances of the response variable
(only if 'predict_var' or 'predict_cov' is TRUE)
If there is no gp_model
or predcontrib
or ignore_gp_model
are TRUE, the result contains predictions from the tree-booster only.
Fabio Sigrist, authors of the LightGBM R package
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") #--------------------Combine tree-boosting and grouped random effects model---------------- # Create random effects model gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") # The default optimizer for covariance parameters (hyperparameters) is # Nesterov-accelerated gradient descent. # This can be changed to, e.g., Nelder-Mead as follows: # re_params <- list(optimizer_cov = "nelder_mead") # gp_model$set_optim_params(params=re_params) # Use trace = TRUE to monitor convergence: # re_params <- list(trace = TRUE) # gp_model$set_optim_params(params=re_params) # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions # Predict latent variables pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean # For Gaussian data: pred$random_effect_mean + pred$fixed_effect = pred_resp$response_mean pred$random_effect_mean + pred$fixed_effect - pred_resp$response_mean #--------------------Combine tree-boosting and Gaussian process model---------------- # Create Gaussian process model gp_model <- GPModel(gp_coords = coords, cov_function = "exponential", likelihood = "gaussian") # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 8, learning_rate = 0.1, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples library(gpboost) data(GPBoost_data, package = "gpboost") #--------------------Combine tree-boosting and grouped random effects model---------------- # Create random effects model gp_model <- GPModel(group_data = group_data[,1], likelihood = "gaussian") # The default optimizer for covariance parameters (hyperparameters) is # Nesterov-accelerated gradient descent. # This can be changed to, e.g., Nelder-Mead as follows: # re_params <- list(optimizer_cov = "nelder_mead") # gp_model$set_optim_params(params=re_params) # Use trace = TRUE to monitor convergence: # re_params <- list(trace = TRUE) # gp_model$set_optim_params(params=re_params) # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 16, learning_rate = 0.05, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions # Predict latent variables pred <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, group_data_pred = group_data_test[,1], predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean # For Gaussian data: pred$random_effect_mean + pred$fixed_effect = pred_resp$response_mean pred$random_effect_mean + pred$fixed_effect - pred_resp$response_mean #--------------------Combine tree-boosting and Gaussian process model---------------- # Create Gaussian process model gp_model <- GPModel(gp_coords = coords, cov_function = "exponential", likelihood = "gaussian") # Train model bst <- gpboost(data = X, label = y, gp_model = gp_model, nrounds = 8, learning_rate = 0.1, max_depth = 6, min_data_in_leaf = 5, verbose = 0) # Estimated random effects model summary(gp_model) # Make predictions pred <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = TRUE) pred$random_effect_mean # Predicted latent random effects mean pred$random_effect_cov # Predicted random effects variances pred$fixed_effect # Predicted fixed effects from tree ensemble # Predict response variable pred_resp <- predict(bst, data = X_test, gp_coords_pred = coords_test, predict_var = TRUE, pred_latent = FALSE) pred_resp$response_mean # Predicted response mean
GPModel
Make predictions for a GPModel
## S3 method for class 'GPModel' predict(object, y = NULL, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, predict_cov_mat = FALSE, predict_var = FALSE, cov_pars = NULL, X_pred = NULL, use_saved_data = FALSE, predict_response = TRUE, offset = NULL, offset_pred = NULL, fixed_effects = NULL, fixed_effects_pred = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, ...)
## S3 method for class 'GPModel' predict(object, y = NULL, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, predict_cov_mat = FALSE, predict_var = FALSE, cov_pars = NULL, X_pred = NULL, use_saved_data = FALSE, predict_response = TRUE, offset = NULL, offset_pred = NULL, fixed_effects = NULL, fixed_effects_pred = NULL, vecchia_pred_type = NULL, num_neighbors_pred = NULL, ...)
object |
a |
y |
Observed data (can be NULL, e.g. when the model has been estimated already and the same data is used for making predictions) |
group_data_pred |
A |
group_rand_coef_data_pred |
A |
gp_coords_pred |
A |
gp_rand_coef_data_pred |
A |
cluster_ids_pred |
A |
predict_cov_mat |
A |
predict_var |
A |
cov_pars |
A |
X_pred |
A |
use_saved_data |
A |
predict_response |
A |
offset |
A |
offset_pred |
A |
fixed_effects |
This is discontinued. Use the renamed equivalent argument |
fixed_effects_pred |
This is discontinued. Use the renamed equivalent argument |
vecchia_pred_type |
A |
num_neighbors_pred |
an |
... |
(not used, ignore this, simply here that there is no CRAN warning) |
Predictions from a GPModel
. A list with three entries is returned:
"mu" (first entry): predictive (=posterior) mean. For (generalized) linear mixed effects models, i.e., models with a linear regression term, this consists of the sum of fixed effects and random effects predictions
"cov" (second entry): predictive (=posterior) covariance matrix. This is NULL if 'predict_cov_mat=FALSE'
"var" (third entry) : predictive (=posterior) variances. This is NULL if 'predict_var=FALSE'
Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian", params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) pred$mu # Predicted mean pred$var # Predicted variances # Also predict covariance matrix pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted mean pred$cov # Predicted covariance #--------------------Gaussian process model---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, gp_coords_pred = coords_test, X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted (posterior) mean of GP pred$cov # Predicted (posterior) covariance matrix of GP
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian", params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) pred$mu # Predicted mean pred$var # Predicted variances # Also predict covariance matrix pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted mean pred$cov # Predicted covariance #--------------------Gaussian process model---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model) # Make predictions pred <- predict(gp_model, gp_coords_pred = coords_test, X_pred = X_test1, predict_cov_mat = TRUE) pred$mu # Predicted (posterior) mean of GP pred$cov # Predicted (posterior) covariance matrix of GP
gpb.Booster
modelsAttempts to load a model stored in a .rds
file, using readRDS
readRDS.gpb.Booster(file, refhook = NULL)
readRDS.gpb.Booster(file, refhook = NULL)
file |
a connection or the name of the file where the R object is saved to or read from. |
refhook |
a hook function for handling reference objects. |
gpb.Booster
library(gpboost) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , min_data = 1L , learning_rate = 1.0 , early_stopping_rounds = 5L ) model_file <- tempfile(fileext = ".rds") saveRDS.gpb.Booster(model, model_file) new_model <- readRDS.gpb.Booster(model_file)
library(gpboost) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , min_data = 1L , learning_rate = 1.0 , early_stopping_rounds = 5L ) model_file <- tempfile(fileext = ".rds") saveRDS.gpb.Booster(model, model_file) new_model <- readRDS.gpb.Booster(model_file)
GPModel
Save a GPModel
saveGPModel(gp_model, filename)
saveGPModel(gp_model, filename)
gp_model |
a |
filename |
filename for saving |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Save model to file filename <- tempfile(fileext = ".json") saveGPModel(gp_model,filename = filename) # Load from file and make predictions again gp_model_loaded <- loadGPModel(filename = filename) pred_loaded <- predict(gp_model_loaded, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Check equality pred$mu - pred_loaded$mu pred$var - pred_loaded$var
data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian") pred <- predict(gp_model, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Save model to file filename <- tempfile(fileext = ".json") saveGPModel(gp_model,filename = filename) # Load from file and make predictions again gp_model_loaded <- loadGPModel(filename = filename) pred_loaded <- predict(gp_model_loaded, group_data_pred = group_data_test[,1], X_pred = X_test1, predict_var = TRUE) # Check equality pred$mu - pred_loaded$mu pred$var - pred_loaded$var
gpb.Booster
modelsAttempts to save a model using RDS. Has an additional parameter (raw
)
which decides whether to save the raw model or not.
saveRDS.gpb.Booster(object, file, ascii = FALSE, version = NULL, compress = TRUE, refhook = NULL, raw = TRUE)
saveRDS.gpb.Booster(object, file, ascii = FALSE, version = NULL, compress = TRUE, refhook = NULL, raw = TRUE)
object |
R object to serialize. |
file |
a connection or the name of the file where the R object is saved to or read from. |
ascii |
a logical. If TRUE or NA, an ASCII representation is written; otherwise (default), a binary one is used. See the comments in the help for save. |
version |
the workspace format version to use. |
compress |
a logical specifying whether saving to a named file is to use "gzip" compression,
or one of |
refhook |
a hook function for handling reference objects. |
raw |
whether to save the model in a raw variable or not, recommended to leave it to |
NULL invisibly.
library(gpboost) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , min_data = 1L , learning_rate = 1.0 , early_stopping_rounds = 5L ) model_file <- tempfile(fileext = ".rds") saveRDS.gpb.Booster(model, model_file)
library(gpboost) data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) data(agaricus.test, package = "gpboost") test <- agaricus.test dtest <- gpb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list(objective = "regression", metric = "l2") valids <- list(test = dtest) model <- gpb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , min_data = 1L , learning_rate = 1.0 , early_stopping_rounds = 5L ) model_file <- tempfile(fileext = ".rds") saveRDS.gpb.Booster(model, model_file)
Set parameters for optimization of the covariance parameters of a GPModel
set_optim_params(gp_model, params = list())
set_optim_params(gp_model, params = list())
gp_model |
A |
params |
A
|
Fabio Sigrist
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") set_optim_params(gp_model, params=list(optimizer_cov="nelder_mead"))
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") set_optim_params(gp_model, params=list(optimizer_cov="nelder_mead"))
Set parameters for optimization of the covariance parameters of a GPModel
## S3 method for class 'GPModel' set_optim_params(gp_model, params = list())
## S3 method for class 'GPModel' set_optim_params(gp_model, params = list())
gp_model |
A |
params |
A
|
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") set_optim_params(gp_model, params=list(optimizer_cov="nelder_mead"))
data(GPBoost_data, package = "gpboost") gp_model <- GPModel(group_data = group_data, likelihood="gaussian") set_optim_params(gp_model, params=list(optimizer_cov="nelder_mead"))
GPModel
Set the data required for making predictions with a GPModel
set_prediction_data(gp_model, vecchia_pred_type = NULL, num_neighbors_pred = NULL, cg_delta_conv_pred = NULL, nsim_var_pred = NULL, rank_pred_approx_matrix_lanczos = NULL, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, X_pred = NULL)
set_prediction_data(gp_model, vecchia_pred_type = NULL, num_neighbors_pred = NULL, cg_delta_conv_pred = NULL, nsim_var_pred = NULL, rank_pred_approx_matrix_lanczos = NULL, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, X_pred = NULL)
gp_model |
A |
vecchia_pred_type |
A
|
num_neighbors_pred |
an |
cg_delta_conv_pred |
a |
nsim_var_pred |
an |
rank_pred_approx_matrix_lanczos |
an |
group_data_pred |
A |
group_rand_coef_data_pred |
A |
gp_coords_pred |
A |
gp_rand_coef_data_pred |
A |
cluster_ids_pred |
A |
X_pred |
A |
Fabio Sigrist
data(GPBoost_data, package = "gpboost") set.seed(1) train_ind <- sample.int(length(y),size=250) gp_model <- GPModel(group_data = group_data[train_ind,1], likelihood="gaussian") set_prediction_data(gp_model, group_data_pred = group_data[-train_ind,1])
data(GPBoost_data, package = "gpboost") set.seed(1) train_ind <- sample.int(length(y),size=250) gp_model <- GPModel(group_data = group_data[train_ind,1], likelihood="gaussian") set_prediction_data(gp_model, group_data_pred = group_data[-train_ind,1])
GPModel
Set the data required for making predictions with a GPModel
## S3 method for class 'GPModel' set_prediction_data(gp_model, vecchia_pred_type = NULL, num_neighbors_pred = NULL, cg_delta_conv_pred = NULL, nsim_var_pred = NULL, rank_pred_approx_matrix_lanczos = NULL, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, X_pred = NULL)
## S3 method for class 'GPModel' set_prediction_data(gp_model, vecchia_pred_type = NULL, num_neighbors_pred = NULL, cg_delta_conv_pred = NULL, nsim_var_pred = NULL, rank_pred_approx_matrix_lanczos = NULL, group_data_pred = NULL, group_rand_coef_data_pred = NULL, gp_coords_pred = NULL, gp_rand_coef_data_pred = NULL, cluster_ids_pred = NULL, X_pred = NULL)
gp_model |
A |
vecchia_pred_type |
A
|
num_neighbors_pred |
an |
cg_delta_conv_pred |
a |
nsim_var_pred |
an |
rank_pred_approx_matrix_lanczos |
an |
group_data_pred |
A |
group_rand_coef_data_pred |
A |
gp_coords_pred |
A |
gp_rand_coef_data_pred |
A |
cluster_ids_pred |
A |
X_pred |
A |
A GPModel
Fabio Sigrist
data(GPBoost_data, package = "gpboost") set.seed(1) train_ind <- sample.int(length(y),size=250) gp_model <- GPModel(group_data = group_data[train_ind,1], likelihood="gaussian") set_prediction_data(gp_model, group_data_pred = group_data[-train_ind,1])
data(GPBoost_data, package = "gpboost") set.seed(1) train_ind <- sample.int(length(y),size=250) gp_model <- GPModel(group_data = group_data[train_ind,1], likelihood="gaussian") set_prediction_data(gp_model, group_data_pred = group_data[-train_ind,1])
gpb.Dataset
objectSet one attribute of a gpb.Dataset
setinfo(dataset, ...) ## S3 method for class 'gpb.Dataset' setinfo(dataset, name, info, ...)
setinfo(dataset, ...) ## S3 method for class 'gpb.Dataset' setinfo(dataset, name, info, ...)
dataset |
Object of class |
... |
other parameters |
name |
the name of the field to get |
info |
the specific field of information to set |
The name
field can be one of the following:
label
: vector of labels to use as the target variable
weight
: to do a weight rescale
init_score
: initial score is the base prediction gpboost will boost from
group
: used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results to be ranked.
For example, if you have a 100-document dataset with group = c(10, 20, 40, 10, 10, 10)
,
that means that you have 6 groups, where the first 10 records are in the first group,
records 11-30 are in the second group, etc.
the dataset you passed in
the dataset you passed in
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain) labels <- gpboost::getinfo(dtrain, "label") gpboost::setinfo(dtrain, "label", 1 - labels) labels2 <- gpboost::getinfo(dtrain, "label") stopifnot(all.equal(labels2, 1 - labels))
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) gpb.Dataset.construct(dtrain) labels <- gpboost::getinfo(dtrain, "label") gpboost::setinfo(dtrain, "label", 1 - labels) labels2 <- gpboost::getinfo(dtrain, "label") stopifnot(all.equal(labels2, 1 - labels))
Get a new gpb.Dataset
containing the specified rows of
original gpb.Dataset
object
slice(dataset, ...) ## S3 method for class 'gpb.Dataset' slice(dataset, idxset, ...)
slice(dataset, ...) ## S3 method for class 'gpb.Dataset' slice(dataset, idxset, ...)
dataset |
Object of class |
... |
other parameters (currently not used) |
idxset |
an integer vector of indices of rows needed |
constructed sub dataset
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) dsub <- gpboost::slice(dtrain, seq_len(42L)) gpb.Dataset.construct(dsub) labels <- gpboost::getinfo(dsub, "label")
data(agaricus.train, package = "gpboost") train <- agaricus.train dtrain <- gpb.Dataset(train$data, label = train$label) dsub <- gpboost::slice(dtrain, seq_len(42L)) gpb.Dataset.construct(dsub) labels <- gpboost::getinfo(dsub, "label")
GPModel
Summary for a GPModel
## S3 method for class 'GPModel' summary(object, ...)
## S3 method for class 'GPModel' summary(object, ...)
object |
a |
... |
(not used, ignore this, simply here that there is no CRAN warning) |
Summary of a (fitted) GPModel
Fabio Sigrist
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian", params = list(std_dev = TRUE)) summary(gp_model) #--------------------Gaussian process model---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model)
# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples data(GPBoost_data, package = "gpboost") # Add intercept column X1 <- cbind(rep(1,dim(X)[1]),X) X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test) #--------------------Grouped random effects model: single-level random effect---------------- gp_model <- fitGPModel(group_data = group_data[,1], y = y, X = X1, likelihood="gaussian", params = list(std_dev = TRUE)) summary(gp_model) #--------------------Gaussian process model---------------- gp_model <- fitGPModel(gp_coords = coords, cov_function = "matern", cov_fct_shape = 1.5, likelihood="gaussian", y = y, X = X1, params = list(std_dev = TRUE)) summary(gp_model)
A matrix with covariate data for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)
A matrix with covariate information for the predictions for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)
Response variable for the example data of the GPBoost package
data(GPBoost_data)
data(GPBoost_data)