Title: | Light Gradient Boosting Machine |
---|---|
Description: | Tree based algorithms can be improved by introducing boosting frameworks. 'LightGBM' is one such framework, based on Ke, Guolin et al. (2017) <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision>. This package offers an R interface to work with it. It is designed to be distributed and efficient with the following advantages: 1. Faster training speed and higher efficiency. 2. Lower memory usage. 3. Better accuracy. 4. Parallel learning supported. 5. Capable of handling large-scale data. In recognition of these advantages, 'LightGBM' has been widely-used in many winning solutions of machine learning competitions. Comparison experiments on public datasets suggest that 'LightGBM' can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. In addition, parallel experiments suggest that in certain circumstances, 'LightGBM' can achieve a linear speed-up in training time by using multiple machines. |
Authors: | Yu Shi [aut], Guolin Ke [aut], Damien Soukhavong [aut], James Lamb [aut, cre], Qi Meng [aut], Thomas Finley [aut], Taifeng Wang [aut], Wei Chen [aut], Weidong Ma [aut], Qiwei Ye [aut], Tie-Yan Liu [aut], Nikita Titov [aut], Yachen Yan [ctb], Microsoft Corporation [cph], Dropbox, Inc. [cph], Alberto Ferreira [ctb], Daniel Lemire [ctb], Victor Zverovich [cph], IBM Corporation [ctb], David Cortes [aut], Michael Mayer [ctb] |
Maintainer: | James Lamb <[email protected]> |
License: | MIT + file LICENSE |
Version: | 4.5.0 |
Built: | 2024-10-26 06:22:28 UTC |
Source: | CRAN |
This data set is originally from the Mushroom data set, UCI Machine Learning Repository. This data set includes the following fields:
label
: the label for each record
data
: a sparse Matrix of dgCMatrix
class, with 126 columns.
data(agaricus.test)
data(agaricus.test)
A list containing a label vector, and a dgCMatrix object with 1611 rows and 126 variables
https://archive.ics.uci.edu/ml/datasets/Mushroom
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
This data set is originally from the Mushroom data set, UCI Machine Learning Repository. This data set includes the following fields:
label
: the label for each record
data
: a sparse Matrix of dgCMatrix
class, with 126 columns.
data(agaricus.train)
data(agaricus.train)
A list containing a label vector, and a dgCMatrix object with 6513 rows and 127 variables
https://archive.ics.uci.edu/ml/datasets/Mushroom
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
This data set is originally from the Bank Marketing data set, UCI Machine Learning Repository.
It contains only the following: bank.csv with 10 randomly selected from 3 (older version of this dataset with less inputs).
data(bank)
data(bank)
A data.table with 4521 rows and 17 variables
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
S. Moro, P. Cortez and P. Rita. (2014) A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems
lgb.Dataset
Returns a vector of numbers of rows and of columns in an lgb.Dataset
.
## S3 method for class 'lgb.Dataset' dim(x)
## S3 method for class 'lgb.Dataset' dim(x)
x |
Object of class |
Note: since nrow
and ncol
internally use dim
, they can also
be directly used with an lgb.Dataset
object.
a vector of numbers of rows and of columns
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) stopifnot(nrow(dtrain) == nrow(train$data)) stopifnot(ncol(dtrain) == ncol(train$data)) stopifnot(all(dim(dtrain) == dim(train$data)))
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) stopifnot(nrow(dtrain) == nrow(train$data)) stopifnot(ncol(dtrain) == ncol(train$data)) stopifnot(all(dim(dtrain) == dim(train$data)))
lgb.Dataset
Only column names are supported for lgb.Dataset
, thus setting of
row names would have no effect and returned row names would be NULL.
## S3 method for class 'lgb.Dataset' dimnames(x) ## S3 replacement method for class 'lgb.Dataset' dimnames(x) <- value
## S3 method for class 'lgb.Dataset' dimnames(x) ## S3 replacement method for class 'lgb.Dataset' dimnames(x) <- value
x |
object of class |
value |
a list of two elements: the first one is ignored and the second one is column names |
Generic dimnames
methods are used by colnames
.
Since row names are irrelevant, it is recommended to use colnames
directly.
A list with the dimension names of the dataset
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain) dimnames(dtrain) colnames(dtrain) colnames(dtrain) <- make.names(seq_len(ncol(train$data))) print(dtrain, verbose = TRUE)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain) dimnames(dtrain) colnames(dtrain) colnames(dtrain) <- make.names(seq_len(ncol(train$data))) print(dtrain, verbose = TRUE)
lgb.Dataset
Get one attribute of a lgb.Dataset
get_field(dataset, field_name) ## S3 method for class 'lgb.Dataset' get_field(dataset, field_name)
get_field(dataset, field_name) ## S3 method for class 'lgb.Dataset' get_field(dataset, field_name)
dataset |
Object of class |
field_name |
String with the name of the attribute to get. One of the following.
|
requested attribute
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain) labels <- lightgbm::get_field(dtrain, "label") lightgbm::set_field(dtrain, "label", 1 - labels) labels2 <- lightgbm::get_field(dtrain, "label") stopifnot(all(labels2 == 1 - labels))
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain) labels <- lightgbm::get_field(dtrain, "label") lightgbm::set_field(dtrain, "label", 1 - labels) labels2 <- lightgbm::get_field(dtrain, "label") stopifnot(all(labels2 == 1 - labels))
LightGBM attempts to speed up many operations by using multi-threading.
The number of threads used in those operations can be controlled via the
num_threads
parameter passed through params
to functions like
lgb.train and lgb.Dataset. However, some operations (like materializing
a model from a text file) are done via code paths that don't explicitly accept thread-control
configuration.
Use this function to see the default number of threads LightGBM will use for such operations.
getLGBMthreads()
getLGBMthreads()
number of threads as an integer. -1
means that in situations where parameter num_threads
is
not explicitly supplied, LightGBM will choose a number of threads to use automatically.
Pre-configures a LightGBM model object to produce fast single-row predictions for a given input data type, prediction type, and parameters.
lgb.configure_fast_predict( model, csr = FALSE, start_iteration = NULL, num_iteration = NULL, type = "response", params = list() )
lgb.configure_fast_predict( model, csr = FALSE, start_iteration = NULL, num_iteration = NULL, type = "response", params = list() )
model |
LighGBM model object (class The object will be modified in-place. |
csr |
Whether the prediction function is going to be called on sparse CSR inputs.
If |
start_iteration |
int or None, optional (default=None) Start index of the iteration to predict. If None or <= 0, starts from the first iteration. |
num_iteration |
int or None, optional (default=None) Limit number of iterations in the prediction. If None, if the best iteration exists and start_iteration is None or <= 0, the best iteration is used; otherwise, all iterations from start_iteration are used. If <= 0, all iterations from start_iteration are used (no limits). |
type |
Type of prediction to output. Allowed types are:
Note that, if using custom objectives, types "class" and "response" will not be available and will default towards using "raw" instead. If the model was fit through function lightgbm and it was passed a factor as labels,
passing the prediction type through New in version 4.0.0 |
params |
a list of additional named parameters. See
the "Predict Parameters" section of the documentation for a list of parameters and
valid values. Where these conflict with the values of keyword arguments to this function,
the values in |
Calling this function multiple times with different parameters might not override the previous configuration and might trigger undefined behavior.
Any saved configuration for fast predictions might be lost after making a single-row prediction of a different type than what was configured (except for types "response" and "class", which can be switched between each other at any time without losing the configuration).
In some situations, setting a fast prediction configuration for one type of prediction might cause the prediction function to keep using that configuration for single-row predictions even if the requested type of prediction is different from what was configured.
Note that this function will not accept argument type="class"
- for such cases, one
can pass type="response"
to this function and then type="class"
to the
predict
function - the fast configuration will not be lost or altered if the switch
is between "response" and "class".
The configuration does not survive de-serializations, so it has to be generated
anew in every R process that is going to use it (e.g. if loading a model object
through readRDS
, whatever configuration was there previously will be lost).
Requesting a different prediction type or passing parameters to predict.lgb.Booster will cause it to ignore the fast-predict configuration and take the slow route instead (but be aware that an existing configuration might not always be overriden by supplying different parameters or prediction type, so make sure to check that the output is what was expected when a prediction is to be made on a single row for something different than what is configured).
Note that, if configuring a non-default prediction type (such as leaf indices),
then that type must also be passed in the call to predict.lgb.Booster in
order for it to use the configuration. This also applies for start_iteration
and num_iteration
, but the params
list must be empty in the call to predict
.
Predictions about feature contributions do not allow a fast route for CSR inputs,
and as such, this function will produce an error if passing csr=TRUE
and
type = "contrib"
together.
The same model
that was passed as input, invisibly, with the desired
configuration stored inside it and available to be used in future calls to
predict.lgb.Booster.
library(lightgbm) data(mtcars) X <- as.matrix(mtcars[, -1L]) y <- mtcars[, 1L] dtrain <- lgb.Dataset(X, label = y, params = list(max_bin = 5L)) params <- list( min_data_in_leaf = 2L , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , obj = "regression" , nrounds = 5L , verbose = -1L ) lgb.configure_fast_predict(model) x_single <- X[11L, , drop = FALSE] predict(model, x_single) # Will not use it if the prediction to be made # is different from what was configured predict(model, x_single, type = "leaf")
library(lightgbm) data(mtcars) X <- as.matrix(mtcars[, -1L]) y <- mtcars[, 1L] dtrain <- lgb.Dataset(X, label = y, params = list(max_bin = 5L)) params <- list( min_data_in_leaf = 2L , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , obj = "regression" , nrounds = 5L , verbose = -1L ) lgb.configure_fast_predict(model) x_single <- X[11L, , drop = FALSE] predict(model, x_single) # Will not use it if the prediction to be made # is different from what was configured predict(model, x_single, type = "leaf")
Attempts to prepare a clean dataset to prepare to put in a lgb.Dataset
.
Factor, character, and logical columns are converted to integer. Missing values
in factors and characters will be filled with 0L. Missing values in logicals
will be filled with -1L.
This function returns and optionally takes in "rules" the describe exactly how to convert values in columns.
Columns that contain only NA values will be converted by this function but will
not show up in the returned rules
.
NOTE: In previous releases of LightGBM, this function was called lgb.prepare_rules2
.
lgb.convert_with_rules(data, rules = NULL)
lgb.convert_with_rules(data, rules = NULL)
data |
A data.frame or data.table to prepare. |
rules |
A set of rules from the data preparator, if already used. This should be an R list,
where names are column names in |
A list with the cleaned dataset (data
) and the rules (rules
).
Note that the data must be converted to a matrix format (as.matrix
) for input in
lgb.Dataset
.
data(iris) str(iris) new_iris <- lgb.convert_with_rules(data = iris) str(new_iris$data) data(iris) # Erase iris dataset iris$Species[1L] <- "NEW FACTOR" # Introduce junk factor (NA) # Use conversion using known rules # Unknown factors become 0, excellent for sparse datasets newer_iris <- lgb.convert_with_rules(data = iris, rules = new_iris$rules) # Unknown factor is now zero, perfect for sparse datasets newer_iris$data[1L, ] # Species became 0 as it is an unknown factor newer_iris$data[1L, 5L] <- 1.0 # Put back real initial value # Is the newly created dataset equal? YES! all.equal(new_iris$data, newer_iris$data) # Can we test our own rules? data(iris) # Erase iris dataset # We remapped values differently personal_rules <- list( Species = c( "setosa" = 3L , "versicolor" = 2L , "virginica" = 1L ) ) newest_iris <- lgb.convert_with_rules(data = iris, rules = personal_rules) str(newest_iris$data) # SUCCESS!
data(iris) str(iris) new_iris <- lgb.convert_with_rules(data = iris) str(new_iris$data) data(iris) # Erase iris dataset iris$Species[1L] <- "NEW FACTOR" # Introduce junk factor (NA) # Use conversion using known rules # Unknown factors become 0, excellent for sparse datasets newer_iris <- lgb.convert_with_rules(data = iris, rules = new_iris$rules) # Unknown factor is now zero, perfect for sparse datasets newer_iris$data[1L, ] # Species became 0 as it is an unknown factor newer_iris$data[1L, 5L] <- 1.0 # Put back real initial value # Is the newly created dataset equal? YES! all.equal(new_iris$data, newer_iris$data) # Can we test our own rules? data(iris) # Erase iris dataset # We remapped values differently personal_rules <- list( Species = c( "setosa" = 3L , "versicolor" = 2L , "virginica" = 1L ) ) newest_iris <- lgb.convert_with_rules(data = iris, rules = personal_rules) str(newest_iris$data) # SUCCESS!
Cross validation logic used by LightGBM
lgb.cv( params = list(), data, nrounds = 100L, nfold = 3L, label = NULL, weight = NULL, obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, showsd = TRUE, stratified = TRUE, folds = NULL, init_model = NULL, colnames = NULL, categorical_feature = NULL, early_stopping_rounds = NULL, callbacks = list(), reset_data = FALSE, serializable = TRUE, eval_train_metric = FALSE )
lgb.cv( params = list(), data, nrounds = 100L, nfold = 3L, label = NULL, weight = NULL, obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, showsd = TRUE, stratified = TRUE, folds = NULL, init_model = NULL, colnames = NULL, categorical_feature = NULL, early_stopping_rounds = NULL, callbacks = list(), reset_data = FALSE, serializable = TRUE, eval_train_metric = FALSE )
params |
a list of parameters. See the "Parameters" section of the documentation for a list of parameters and valid values. |
data |
a |
nrounds |
number of training rounds |
nfold |
the original dataset is randomly partitioned into |
label |
Deprecated. See "Deprecated Arguments" section below. |
weight |
Deprecated. See "Deprecated Arguments" section below. |
obj |
objective function, can be character or custom objective function. Examples include
|
eval |
evaluation function(s). This can be a character vector, function, or list with a mixture of strings and functions.
|
verbose |
verbosity for output, if <= 0 and |
record |
Boolean, TRUE will record iteration message to |
eval_freq |
evaluation output frequency, only effective when verbose > 0 and |
showsd |
|
stratified |
a |
folds |
|
init_model |
path of model file or |
colnames |
Deprecated. See "Deprecated Arguments" section below. |
categorical_feature |
Deprecated. See "Deprecated Arguments" section below. |
early_stopping_rounds |
int. Activates early stopping. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
callbacks |
List of callback functions that are applied at each iteration. |
reset_data |
Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets |
serializable |
whether to make the resulting objects serializable through functions such as
|
eval_train_metric |
|
a trained model lgb.CVBooster
.
A future release of lightgbm
will require passing an lgb.Dataset
to argument 'data'
. It will also remove support for passing arguments
'categorical_feature'
, 'colnames'
, 'label'
, and 'weight'
.
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) model <- lgb.cv( params = params , data = dtrain , nrounds = 5L , nfold = 3L )
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) model <- lgb.cv( params = params , data = dtrain , nrounds = 5L , nfold = 3L )
lgb.Dataset
objectLightGBM does not train on raw data. It discretizes continuous features into histogram bins, tries to combine categorical features, and automatically handles missing and
The Dataset
class handles that preprocessing, and holds that
alternative representation of the input data.
lgb.Dataset( data, params = list(), reference = NULL, colnames = NULL, categorical_feature = NULL, free_raw_data = TRUE, label = NULL, weight = NULL, group = NULL, init_score = NULL )
lgb.Dataset( data, params = list(), reference = NULL, colnames = NULL, categorical_feature = NULL, free_raw_data = TRUE, label = NULL, weight = NULL, group = NULL, init_score = NULL )
data |
a |
params |
a list of parameters. See The "Dataset Parameters" section of the documentation for a list of parameters and valid values. |
reference |
reference dataset. When LightGBM creates a Dataset, it does some preprocessing like binning
continuous features into histograms. If you want to apply the same bin boundaries from an existing
dataset to new |
colnames |
names of columns |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
free_raw_data |
LightGBM constructs its data format, called a "Dataset", from tabular data.
By default, that Dataset object on the R side does not keep a copy of the raw data.
This reduces LightGBM's memory consumption, but it means that the Dataset object
cannot be changed after it has been constructed. If you'd prefer to be able to
change the Dataset object after construction, set |
label |
vector of labels to use as the target variable |
weight |
numeric vector of sample weights |
group |
used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results
to be ranked. For example, if you have a 100-document dataset with
|
init_score |
initial score is the base prediction lightgbm will boost from |
constructed dataset
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") lgb.Dataset.save(dtrain, data_file) dtrain <- lgb.Dataset(data_file) lgb.Dataset.construct(dtrain)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") lgb.Dataset.save(dtrain, data_file) dtrain <- lgb.Dataset(data_file) lgb.Dataset.construct(dtrain)
Construct Dataset explicitly
lgb.Dataset.construct(dataset)
lgb.Dataset.construct(dataset)
dataset |
Object of class |
constructed dataset
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain)
Construct validation data according to training data
lgb.Dataset.create.valid( dataset, data, label = NULL, weight = NULL, group = NULL, init_score = NULL, params = list() )
lgb.Dataset.create.valid( dataset, data, label = NULL, weight = NULL, group = NULL, init_score = NULL, params = list() )
dataset |
|
data |
a |
label |
vector of labels to use as the target variable |
weight |
numeric vector of sample weights |
group |
used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results
to be ranked. For example, if you have a 100-document dataset with
|
init_score |
initial score is the base prediction lightgbm will boost from |
params |
a list of parameters. See
The "Dataset Parameters" section of the documentation for a list of parameters
and valid values. If this is an empty list (the default), the validation Dataset
will have the same parameters as the Dataset passed to argument |
constructed dataset
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) # parameters can be changed between the training data and validation set, # for example to account for training data in a text file with a header row # and validation data in a text file without it train_file <- tempfile(pattern = "train_", fileext = ".csv") write.table( data.frame(y = rnorm(100L), x1 = rnorm(100L), x2 = rnorm(100L)) , file = train_file , sep = "," , col.names = TRUE , row.names = FALSE , quote = FALSE ) valid_file <- tempfile(pattern = "valid_", fileext = ".csv") write.table( data.frame(y = rnorm(100L), x1 = rnorm(100L), x2 = rnorm(100L)) , file = valid_file , sep = "," , col.names = FALSE , row.names = FALSE , quote = FALSE ) dtrain <- lgb.Dataset( data = train_file , params = list(has_header = TRUE) ) dtrain$construct() dvalid <- lgb.Dataset( data = valid_file , params = list(has_header = FALSE) ) dvalid$construct()
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) # parameters can be changed between the training data and validation set, # for example to account for training data in a text file with a header row # and validation data in a text file without it train_file <- tempfile(pattern = "train_", fileext = ".csv") write.table( data.frame(y = rnorm(100L), x1 = rnorm(100L), x2 = rnorm(100L)) , file = train_file , sep = "," , col.names = TRUE , row.names = FALSE , quote = FALSE ) valid_file <- tempfile(pattern = "valid_", fileext = ".csv") write.table( data.frame(y = rnorm(100L), x1 = rnorm(100L), x2 = rnorm(100L)) , file = valid_file , sep = "," , col.names = FALSE , row.names = FALSE , quote = FALSE ) dtrain <- lgb.Dataset( data = train_file , params = list(has_header = TRUE) ) dtrain$construct() dvalid <- lgb.Dataset( data = valid_file , params = list(has_header = FALSE) ) dvalid$construct()
lgb.Dataset
to a binary filePlease note that init_score
is not saved in binary file.
If you need it, please set it again after loading Dataset.
lgb.Dataset.save(dataset, fname)
lgb.Dataset.save(dataset, fname)
dataset |
object of class |
fname |
object filename of output file |
the dataset you passed in
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.save(dtrain, tempfile(fileext = ".bin"))
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.save(dtrain, tempfile(fileext = ".bin"))
lgb.Dataset
Set the categorical features of an lgb.Dataset
object. Use this function
to tell LightGBM which features should be treated as categorical.
lgb.Dataset.set.categorical(dataset, categorical_feature)
lgb.Dataset.set.categorical(dataset, categorical_feature)
dataset |
object of class |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
the dataset you passed in
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") lgb.Dataset.save(dtrain, data_file) dtrain <- lgb.Dataset(data_file) lgb.Dataset.set.categorical(dtrain, 1L:2L)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data_file <- tempfile(fileext = ".data") lgb.Dataset.save(dtrain, data_file) dtrain <- lgb.Dataset(data_file) lgb.Dataset.set.categorical(dtrain, 1L:2L)
lgb.Dataset
If you want to use validation data, you should set reference to training data
lgb.Dataset.set.reference(dataset, reference)
lgb.Dataset.set.reference(dataset, reference)
dataset |
object of class |
reference |
object of class |
the dataset you passed in
# create training Dataset data(agaricus.train, package ="lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) # create a validation Dataset, using dtrain as a reference data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset(test$data, label = test$label) lgb.Dataset.set.reference(dtest, dtrain)
# create training Dataset data(agaricus.train, package ="lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) # create a validation Dataset, using dtrain as a reference data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset(test$data, label = test$label) lgb.Dataset.set.reference(dtest, dtrain)
If a LightGBM model object was produced with argument 'serializable=TRUE', the R object will keep a copy of the underlying C++ object as raw bytes, which can be used to reconstruct such object after getting serialized and de-serialized, but at the cost of extra memory usage. If these raw bytes are not needed anymore, they can be dropped through this function in order to save memory. Note that the object will be modified in-place.
New in version 4.0.0
lgb.drop_serialized(model)
lgb.drop_serialized(model)
model |
|
lgb.Booster
(the same 'model' object that was passed as input, as invisible).
lgb.restore_handle, lgb.make_serializable.
Dump LightGBM model to json
lgb.dump(booster, num_iteration = NULL, start_iteration = 1L)
lgb.dump(booster, num_iteration = NULL, start_iteration = 1L)
booster |
Object of class |
num_iteration |
Number of iterations to be dumped. NULL or <= 0 means use best iteration |
start_iteration |
Index (1-based) of the first boosting round to dump.
For example, passing New in version 4.4.0 |
json format of model
library(lightgbm) data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , early_stopping_rounds = 5L ) json_model <- lgb.dump(model)
library(lightgbm) data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , early_stopping_rounds = 5L ) json_model <- lgb.dump(model)
Given a lgb.Booster
, return evaluation results for a
particular metric on a particular dataset.
lgb.get.eval.result( booster, data_name, eval_name, iters = NULL, is_err = FALSE )
lgb.get.eval.result( booster, data_name, eval_name, iters = NULL, is_err = FALSE )
booster |
Object of class |
data_name |
Name of the dataset to return evaluation results for. |
eval_name |
Name of the evaluation metric to return results for. |
iters |
An integer vector of iterations you want to get evaluation results for. If NULL (the default), evaluation results for all iterations will be returned. |
is_err |
TRUE will return evaluation error instead |
numeric vector of evaluation result
# train a regression model data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids ) # Examine valid data_name values print(setdiff(names(model$record_evals), "start_iter")) # Examine valid eval_name values for dataset "test" print(names(model$record_evals[["test"]])) # Get L2 values for "test" dataset lgb.get.eval.result(model, "test", "l2")
# train a regression model data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids ) # Examine valid data_name values print(setdiff(names(model$record_evals), "start_iter")) # Examine valid eval_name values for dataset "test" print(names(model$record_evals[["test"]])) # Get L2 values for "test" dataset lgb.get.eval.result(model, "test", "l2")
Creates a data.table
of feature importances in a model.
lgb.importance(model, percentage = TRUE)
lgb.importance(model, percentage = TRUE)
model |
object of class |
percentage |
whether to show importance in relative percentage. |
For a tree model, a data.table
with the following columns:
Feature
: Feature names in the model.
Gain
: The total gain of this feature's splits.
Cover
: The number of observation related to this feature.
Frequency
: The number of times a feature splited in trees.
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp1 <- lgb.importance(model, percentage = TRUE) tree_imp2 <- lgb.importance(model, percentage = FALSE)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp1 <- lgb.importance(model, percentage = TRUE) tree_imp2 <- lgb.importance(model, percentage = FALSE)
Computes feature contribution components of rawscore prediction.
lgb.interprete(model, data, idxset, num_iteration = NULL)
lgb.interprete(model, data, idxset, num_iteration = NULL)
model |
object of class |
data |
a matrix object or a dgCMatrix object. |
idxset |
an integer vector of indices of rows needed. |
num_iteration |
number of iteration want to predict with, NULL or <= 0 means use best iteration. |
For regression, binary classification and lambdarank model, a list
of data.table
with the following columns:
Feature
: Feature names in the model.
Contribution
: The total contribution of this feature's splits.
For multiclass classification, a list
of data.table
with the Feature column and
Contribution columns to each class.
Logit <- function(x) log(x / (1.0 - x)) data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) set_field( dataset = dtrain , field_name = "init_score" , data = rep(Logit(mean(train$label)), length(train$label)) ) data(agaricus.test, package = "lightgbm") test <- agaricus.test params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 3L ) tree_interpretation <- lgb.interprete(model, test$data, 1L:5L)
Logit <- function(x) log(x / (1.0 - x)) data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) set_field( dataset = dtrain , field_name = "init_score" , data = rep(Logit(mean(train$label)), length(train$label)) ) data(agaricus.test, package = "lightgbm") test <- agaricus.test params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 3L ) tree_interpretation <- lgb.interprete(model, test$data, 1L:5L)
Load LightGBM takes in either a file path or model string. If both are provided, Load will default to loading from file
lgb.load(filename = NULL, model_str = NULL)
lgb.load(filename = NULL, model_str = NULL)
filename |
path of model file |
model_str |
a str containing the model (as a |
lgb.Booster
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids , early_stopping_rounds = 3L ) model_file <- tempfile(fileext = ".txt") lgb.save(model, model_file) load_booster <- lgb.load(filename = model_file) model_string <- model$save_model_to_string(NULL) # saves best iteration load_booster_from_str <- lgb.load(model_str = model_string)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids , early_stopping_rounds = 3L ) model_file <- tempfile(fileext = ".txt") lgb.save(model, model_file) load_booster <- lgb.load(filename = model_file) model_string <- model$save_model_to_string(NULL) # saves best iteration load_booster_from_str <- lgb.load(model_str = model_string)
If a LightGBM model object was produced with argument 'serializable=FALSE', the R object will not
be serializable (e.g. cannot save and load with saveRDS
and readRDS
) as it will lack the raw bytes
needed to reconstruct its underlying C++ object. This function can be used to forcibly produce those serialized
raw bytes and make the object serializable. Note that the object will be modified in-place.
New in version 4.0.0
lgb.make_serializable(model)
lgb.make_serializable(model)
model |
|
lgb.Booster
(the same 'model' object that was passed as input, as invisible).
lgb.restore_handle, lgb.drop_serialized.
Parse a LightGBM model json dump into a data.table
structure.
lgb.model.dt.tree(model, num_iteration = NULL, start_iteration = 1L)
lgb.model.dt.tree(model, num_iteration = NULL, start_iteration = 1L)
model |
object of class |
num_iteration |
Number of iterations to include. NULL or <= 0 means use best iteration. |
start_iteration |
Index (1-based) of the first boosting round to include in the output.
For example, passing New in version 4.4.0 |
A data.table
with detailed information about model trees' nodes and leafs.
The columns of the data.table
are:
tree_index
: ID of a tree in a model (integer)
split_index
: ID of a node in a tree (integer)
split_feature
: for a node, it's a feature name (character);
for a leaf, it simply labels it as "NA"
node_parent
: ID of the parent node for current node (integer)
leaf_index
: ID of a leaf in a tree (integer)
leaf_parent
: ID of the parent node for current leaf (integer)
split_gain
: Split gain of a node
threshold
: Splitting threshold value of a node
decision_type
: Decision type of a node
default_left
: Determine how to handle NA value, TRUE -> Left, FALSE -> Right
internal_value
: Node value
internal_count
: The number of observation collected by a node
leaf_value
: Leaf value
leaf_count
: The number of observation collected by a leaf
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.01 , num_leaves = 63L , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train(params, dtrain, 10L) tree_dt <- lgb.model.dt.tree(model)
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.01 , num_leaves = 63L , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train(params, dtrain, 10L) tree_dt <- lgb.model.dt.tree(model)
Plot previously calculated feature importance: Gain, Cover and Frequency, as a bar graph.
lgb.plot.importance( tree_imp, top_n = 10L, measure = "Gain", left_margin = 10L, cex = NULL )
lgb.plot.importance( tree_imp, top_n = 10L, measure = "Gain", left_margin = 10L, cex = NULL )
tree_imp |
a |
top_n |
maximal number of top features to include into the plot. |
measure |
the name of importance measure to plot, can be "Gain", "Cover" or "Frequency". |
left_margin |
(base R barplot) allows to adjust the left margin size to fit feature names. |
cex |
(base R barplot) passed as |
The graph represents each feature as a horizontal bar of length proportional to the defined importance of a feature. Features are shown ranked in a decreasing importance order.
The lgb.plot.importance
function creates a barplot
and silently returns a processed data.table with top_n
features sorted by defined importance.
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp <- lgb.importance(model, percentage = TRUE) lgb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) params <- list( objective = "binary" , learning_rate = 0.1 , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 5L ) tree_imp <- lgb.importance(model, percentage = TRUE) lgb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")
Plot previously calculated feature contribution as a bar graph.
lgb.plot.interpretation( tree_interpretation_dt, top_n = 10L, cols = 1L, left_margin = 10L, cex = NULL )
lgb.plot.interpretation( tree_interpretation_dt, top_n = 10L, cols = 1L, left_margin = 10L, cex = NULL )
tree_interpretation_dt |
a |
top_n |
maximal number of top features to include into the plot. |
cols |
the column numbers of layout, will be used only for multiclass classification feature contribution. |
left_margin |
(base R barplot) allows to adjust the left margin size to fit feature names. |
cex |
(base R barplot) passed as |
The graph represents each feature as a horizontal bar of length proportional to the defined contribution of a feature. Features are shown ranked in a decreasing contribution order.
The lgb.plot.interpretation
function creates a barplot
.
Logit <- function(x) { log(x / (1.0 - x)) } data(agaricus.train, package = "lightgbm") labels <- agaricus.train$label dtrain <- lgb.Dataset( agaricus.train$data , label = labels ) set_field( dataset = dtrain , field_name = "init_score" , data = rep(Logit(mean(labels)), length(labels)) ) data(agaricus.test, package = "lightgbm") params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 5L ) tree_interpretation <- lgb.interprete( model = model , data = agaricus.test$data , idxset = 1L:5L ) lgb.plot.interpretation( tree_interpretation_dt = tree_interpretation[[1L]] , top_n = 3L )
Logit <- function(x) { log(x / (1.0 - x)) } data(agaricus.train, package = "lightgbm") labels <- agaricus.train$label dtrain <- lgb.Dataset( agaricus.train$data , label = labels ) set_field( dataset = dtrain , field_name = "init_score" , data = rep(Logit(mean(labels)), length(labels)) ) data(agaricus.test, package = "lightgbm") params <- list( objective = "binary" , learning_rate = 0.1 , max_depth = -1L , min_data_in_leaf = 1L , min_sum_hessian_in_leaf = 1.0 , num_threads = 2L ) model <- lgb.train( params = params , data = dtrain , nrounds = 5L ) tree_interpretation <- lgb.interprete( model = model , data = agaricus.test$data , idxset = 1L:5L ) lgb.plot.interpretation( tree_interpretation_dt = tree_interpretation[[1L]] , top_n = 3L )
After a LightGBM model object is de-serialized through functions such as save
or
saveRDS
, its underlying C++ object will be blank and needs to be restored to able to use it. Such
object is restored automatically when calling functions such as predict
, but this function can be
used to forcibly restore it beforehand. Note that the object will be modified in-place.
New in version 4.0.0
lgb.restore_handle(model)
lgb.restore_handle(model)
model |
|
Be aware that fast single-row prediction configurations are not restored through this
function. If you wish to make fast single-row predictions using a lgb.Booster
loaded this way,
call lgb.configure_fast_predict on the loaded lgb.Booster
object.
lgb.Booster
(the same 'model' object that was passed as input, invisibly).
lgb.make_serializable, lgb.drop_serialized.
library(lightgbm) data("agaricus.train") model <- lightgbm( agaricus.train$data , agaricus.train$label , params = list(objective = "binary") , nrounds = 5L , verbose = 0 , num_threads = 2L ) fname <- tempfile(fileext="rds") saveRDS(model, fname) model_new <- readRDS(fname) model_new$check_null_handle() lgb.restore_handle(model_new) model_new$check_null_handle()
library(lightgbm) data("agaricus.train") model <- lightgbm( agaricus.train$data , agaricus.train$label , params = list(objective = "binary") , nrounds = 5L , verbose = 0 , num_threads = 2L ) fname <- tempfile(fileext="rds") saveRDS(model, fname) model_new <- readRDS(fname) model_new$check_null_handle() lgb.restore_handle(model_new) model_new$check_null_handle()
Save LightGBM model
lgb.save(booster, filename, num_iteration = NULL, start_iteration = 1L)
lgb.save(booster, filename, num_iteration = NULL, start_iteration = 1L)
booster |
Object of class |
filename |
Saved filename |
num_iteration |
Number of iterations to save, NULL or <= 0 means use best iteration |
start_iteration |
Index (1-based) of the first boosting round to save.
For example, passing New in version 4.4.0 |
lgb.Booster
library(lightgbm) data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , early_stopping_rounds = 5L ) lgb.save(model, tempfile(fileext = ".txt"))
library(lightgbm) data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 10L , valids = valids , early_stopping_rounds = 5L ) lgb.save(model, tempfile(fileext = ".txt"))
Get a new lgb.Dataset
containing the specified rows of
original lgb.Dataset
object
Renamed from slice()
in 4.4.0
lgb.slice.Dataset(dataset, idxset)
lgb.slice.Dataset(dataset, idxset)
dataset |
Object of class |
idxset |
an integer vector of indices of rows needed |
constructed sub dataset
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) dsub <- lgb.slice.Dataset(dtrain, seq_len(42L)) lgb.Dataset.construct(dsub) labels <- lightgbm::get_field(dsub, "label")
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) dsub <- lgb.slice.Dataset(dtrain, seq_len(42L)) lgb.Dataset.construct(dsub) labels <- lightgbm::get_field(dsub, "label")
Low-level R interface to train a LightGBM model. Unlike lightgbm
,
this function is focused on performance (e.g. speed, memory efficiency). It is also
less likely to have breaking API changes in new releases than lightgbm
.
lgb.train( params = list(), data, nrounds = 100L, valids = list(), obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, init_model = NULL, colnames = NULL, categorical_feature = NULL, early_stopping_rounds = NULL, callbacks = list(), reset_data = FALSE, serializable = TRUE )
lgb.train( params = list(), data, nrounds = 100L, valids = list(), obj = NULL, eval = NULL, verbose = 1L, record = TRUE, eval_freq = 1L, init_model = NULL, colnames = NULL, categorical_feature = NULL, early_stopping_rounds = NULL, callbacks = list(), reset_data = FALSE, serializable = TRUE )
params |
a list of parameters. See the "Parameters" section of the documentation for a list of parameters and valid values. |
data |
a |
nrounds |
number of training rounds |
valids |
a list of |
obj |
objective function, can be character or custom objective function. Examples include
|
eval |
evaluation function(s). This can be a character vector, function, or list with a mixture of strings and functions.
|
verbose |
verbosity for output, if <= 0 and |
record |
Boolean, TRUE will record iteration message to |
eval_freq |
evaluation output frequency, only effective when verbose > 0 and |
init_model |
path of model file or |
colnames |
Deprecated. See "Deprecated Arguments" section below. |
categorical_feature |
Deprecated. See "Deprecated Arguments" section below. |
early_stopping_rounds |
int. Activates early stopping. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
callbacks |
List of callback functions that are applied at each iteration. |
reset_data |
Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets |
serializable |
whether to make the resulting objects serializable through functions such as
|
a trained booster model lgb.Booster
.
A future release of lightgbm
will remove support for passing arguments
'categorical_feature'
and 'colnames'
. Pass those things to
lgb.Dataset
instead.
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids , early_stopping_rounds = 3L )
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids , early_stopping_rounds = 3L )
High-level R interface to train a LightGBM model. Unlike lgb.train
, this function
is focused on compatibility with other statistics and machine learning interfaces in R.
This focus on compatibility means that this interface may experience more frequent breaking API changes
than lgb.train
.
For efficiency-sensitive applications, or for applications where breaking API changes across releases
is very expensive, use lgb.train
.
lightgbm( data, label = NULL, weights = NULL, params = list(), nrounds = 100L, verbose = 1L, eval_freq = 1L, early_stopping_rounds = NULL, init_model = NULL, callbacks = list(), serializable = TRUE, objective = "auto", init_score = NULL, num_threads = NULL, colnames = NULL, categorical_feature = NULL, ... )
lightgbm( data, label = NULL, weights = NULL, params = list(), nrounds = 100L, verbose = 1L, eval_freq = 1L, early_stopping_rounds = NULL, init_model = NULL, callbacks = list(), serializable = TRUE, objective = "auto", init_score = NULL, num_threads = NULL, colnames = NULL, categorical_feature = NULL, ... )
data |
a |
label |
Vector of labels, used if |
weights |
Sample / observation weights for rows in the input data. If Changed from 'weight', in version 4.0.0 |
params |
a list of parameters. See the "Parameters" section of the documentation for a list of parameters and valid values. |
nrounds |
number of training rounds |
verbose |
verbosity for output, if <= 0 and |
eval_freq |
evaluation output frequency, only effective when verbose > 0 and |
early_stopping_rounds |
int. Activates early stopping. When this parameter is non-null,
training will stop if the evaluation of any metric on any validation set
fails to improve for |
init_model |
path of model file or |
callbacks |
List of callback functions that are applied at each iteration. |
serializable |
whether to make the resulting objects serializable through functions such as
|
objective |
Optimization objective (e.g. '"regression"', '"binary"', etc.). For a list of accepted objectives, see the "objective" item of the "Parameters" section of the documentation. If passing
New in version 4.0.0 |
init_score |
initial score is the base prediction lightgbm will boost from New in version 4.0.0 |
num_threads |
Number of parallel threads to use. For best speed, this should be set to the number of physical cores in the CPU - in a typical x86-64 machine, this corresponds to half the number of maximum threads. Be aware that using too many threads can result in speed degradation in smaller datasets (see the parameters documentation for more details). If passing zero, will use the default number of threads configured for OpenMP
(typically controlled through an environment variable If passing This parameter gets overriden by New in version 4.0.0 |
colnames |
Character vector of features. Only used if |
categorical_feature |
categorical features. This can either be a character vector of feature
names or an integer vector with the indices of the features (e.g.
|
... |
Additional arguments passed to
|
a trained lgb.Booster
"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.
If multiple arguments are given to eval
, their order will be preserved. If you enable
early stopping by setting early_stopping_rounds
in params
, by default all
metrics will be considered for early stopping.
If you want to only consider the first metric for early stopping, pass
first_metric_only = TRUE
in params
. Note that if you also specify metric
in params
, that metric will be considered the "first" one. If you omit metric
,
a default metric will be used based on your choice for the parameter obj
(keyword argument)
or objective
(passed into params
).
Predicted values based on class lgb.Booster
New in version 4.0.0
## S3 method for class 'lgb.Booster' predict( object, newdata, type = "response", start_iteration = NULL, num_iteration = NULL, header = FALSE, params = list(), ... )
## S3 method for class 'lgb.Booster' predict( object, newdata, type = "response", start_iteration = NULL, num_iteration = NULL, header = FALSE, params = list(), ... )
object |
Object of class |
newdata |
a For sparse inputs, if predictions are only going to be made for a single row, it will be faster to
use CSR format, in which case the data may be passed as either a single-row CSR matrix (class
If single-row predictions are going to be performed frequently, it is recommended to pre-configure the model object for fast single-row sparse predictions through function lgb.configure_fast_predict. Changed from 'data', in version 4.0.0 |
type |
Type of prediction to output. Allowed types are:
Note that, if using custom objectives, types "class" and "response" will not be available and will default towards using "raw" instead. If the model was fit through function lightgbm and it was passed a factor as labels,
passing the prediction type through New in version 4.0.0 |
start_iteration |
int or None, optional (default=None) Start index of the iteration to predict. If None or <= 0, starts from the first iteration. |
num_iteration |
int or None, optional (default=None) Limit number of iterations in the prediction. If None, if the best iteration exists and start_iteration is None or <= 0, the best iteration is used; otherwise, all iterations from start_iteration are used. If <= 0, all iterations from start_iteration are used (no limits). |
header |
only used for prediction for text file. True if text file has header |
params |
a list of additional named parameters. See
the "Predict Parameters" section of the documentation for a list of parameters and
valid values. Where these conflict with the values of keyword arguments to this function,
the values in |
... |
ignored |
If the model object has been configured for fast single-row predictions through lgb.configure_fast_predict, this function will use the prediction parameters that were configured for it - as such, extra prediction parameters should not be passed here, otherwise the configuration will be ignored and the slow route will be taken.
For prediction types that are meant to always return one output per observation (e.g. when predicting
type="response"
or type="raw"
on a binary classification or regression objective), will
return a vector with one element per row in newdata
.
For prediction types that are meant to return more than one output per observation (e.g. when predicting
type="response"
or type="raw"
on a multi-class objective, or when predicting
type="leaf"
, regardless of objective), will return a matrix with one row per observation in
newdata
and one column per output.
For type="leaf"
predictions, will return a matrix with one row per observation in newdata
and one column per tree. Note that for multiclass objectives, LightGBM trains one tree per class at each
boosting iteration. That means that, for example, for a multiclass model with 3 classes, the leaf
predictions for the first class can be found in columns 1, 4, 7, 10, etc.
For type="contrib"
, will return a matrix of SHAP values with one row per observation in
newdata
and columns corresponding to features. For regression, ranking, cross-entropy, and binary
classification objectives, this matrix contains one column per feature plus a final column containing the
Shapley base value. For multiclass objectives, this matrix will represent num_classes
such matrices,
in the order "feature contributions for first class, feature contributions for second class, feature
contributions for third class, etc.".
If the model was fit through function lightgbm and it was passed a factor as labels, predictions
returned from this function will retain the factor levels (either as values for type="class"
, or
as column names for type="response"
and type="raw"
for multi-class objectives). Note that
passing the requested prediction type under params
instead of through type
might result in
the factor levels not being present in the output.
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids ) preds <- predict(model, test$data) # pass other prediction parameters preds <- predict( model, test$data, params = list( predict_disable_shape_check = TRUE ) )
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) data(agaricus.test, package = "lightgbm") test <- agaricus.test dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label) params <- list( objective = "regression" , metric = "l2" , min_data = 1L , learning_rate = 1.0 , num_threads = 2L ) valids <- list(test = dtest) model <- lgb.train( params = params , data = dtrain , nrounds = 5L , valids = valids ) preds <- predict(model, test$data) # pass other prediction parameters preds <- predict( model, test$data, params = list( predict_disable_shape_check = TRUE ) )
Show summary information about a LightGBM model object (same as summary
).
New in version 4.0.0
## S3 method for class 'lgb.Booster' print(x, ...)
## S3 method for class 'lgb.Booster' print(x, ...)
x |
Object of class |
... |
Not used |
The same input x
, returned as invisible.
lgb.Dataset
objectSet one attribute of a lgb.Dataset
set_field(dataset, field_name, data) ## S3 method for class 'lgb.Dataset' set_field(dataset, field_name, data)
set_field(dataset, field_name, data) ## S3 method for class 'lgb.Dataset' set_field(dataset, field_name, data)
dataset |
Object of class |
field_name |
String with the name of the attribute to set. One of the following.
|
data |
The data for the field. See examples. |
The lgb.Dataset
you passed in.
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain) labels <- lightgbm::get_field(dtrain, "label") lightgbm::set_field(dtrain, "label", 1 - labels) labels2 <- lightgbm::get_field(dtrain, "label") stopifnot(all.equal(labels2, 1 - labels))
data(agaricus.train, package = "lightgbm") train <- agaricus.train dtrain <- lgb.Dataset(train$data, label = train$label) lgb.Dataset.construct(dtrain) labels <- lightgbm::get_field(dtrain, "label") lightgbm::set_field(dtrain, "label", 1 - labels) labels2 <- lightgbm::get_field(dtrain, "label") stopifnot(all.equal(labels2, 1 - labels))
LightGBM attempts to speed up many operations by using multi-threading.
The number of threads used in those operations can be controlled via the
num_threads
parameter passed through params
to functions like
lgb.train and lgb.Dataset. However, some operations (like materializing
a model from a text file) are done via code paths that don't explicitly accept thread-control
configuration.
Use this function to set the maximum number of threads LightGBM will use for such operations.
This function affects all LightGBM operations in the same process.
So, for example, if you call setLGBMthreads(4)
, no other multi-threaded LightGBM
operation in the same process will use more than 4 threads.
Call setLGBMthreads(-1)
to remove this limitation.
setLGBMthreads(num_threads)
setLGBMthreads(num_threads)
num_threads |
maximum number of threads to be used by LightGBM in multi-threaded operations |
Show summary information about a LightGBM model object (same as print
).
New in version 4.0.0
## S3 method for class 'lgb.Booster' summary(object, ...)
## S3 method for class 'lgb.Booster' summary(object, ...)
object |
Object of class |
... |
Not used |
The same input object
, returned as invisible.