Title: | A Tree and Forest Tool for Classification and Regression |
---|---|
Description: | Build decision trees and random forests for classification and regression. The implementation strikes a balance between minimizing computing efforts and maximizing the expected predictive accuracy, thus scales well to large data sets. Multi-threading is available through 'OpenMP' <https://gcc.gnu.org/wiki/openmp>. |
Authors: | Yanchao Liu [aut, cre] |
Maintainer: | Yanchao Liu <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4.1 |
Built: | 2024-11-15 06:24:31 UTC |
Source: | CRAN |
Build decision trees and random forests for classification and regression. The implementation strikes a balance between minimizing computing efforts and maximizing the expected predictive accuracy, thus scales well to large data sets. Multi-threading is available through 'OpenMP'.
Use brif
to build a random forest and (optionally) make predictions.
Use brifTree
to build a single decision tree.
Use printRules
to print out the decision rules of a tree.
Use predict.brif
to make predictions using a brif model (tree or forest).
Yanchao Liu
Depending on the arguments supplied, the function brif.formula
, brif.default
or brif.trainpredict
will be called.
brif(x, ...)
brif(x, ...)
x |
a data frame or a |
... |
arguments passed on to |
a data frame, a vector or a list. If newdata
is supplied, prediction results for newdata
will be returned in a data frame or a vector, depending on the problem type (classification or regression) and the type
argument; otherwise, an object of class "brif" is returned, which is to be used in the function predict.brif
for making predictions. See brif.default
for components of the "brif" object.
trainset <- sample(1:nrow(iris), 0.5*nrow(iris)) validset <- setdiff(1:nrow(iris), trainset) # Train and predict at once pred_scores <- brif(Species~., data = iris, subset = trainset, newdata = iris[validset, 1:4], type = 'score') pred_labels <- brif(Species~., data = iris, subset = trainset, newdata = iris[validset, 1:4], type = 'class') # Confusion matrix table(pred_labels, iris[validset, 5]) # Accuracy sum(pred_labels == iris[validset, 5])/length(validset) # Train using the formula format bf <- brif(Species~., data = iris, subset = trainset) # Or equivalently, train using the data.frame format bf <- brif(iris[trainset, c(5,1:4)]) # Make a prediction pred_scores <- predict(bf, iris[validset, 1:4], type = 'score') pred_labels <- predict(bf, iris[validset, 1:4], type = 'class') # Regression bf <- brif(mpg ~., data = mtcars) pred <- predict(bf, mtcars[2:11]) plot(pred, mtcars$mpg) abline(0, 1) # Optionally, delete the model object to release memory rm(list = c("bf")) gc()
trainset <- sample(1:nrow(iris), 0.5*nrow(iris)) validset <- setdiff(1:nrow(iris), trainset) # Train and predict at once pred_scores <- brif(Species~., data = iris, subset = trainset, newdata = iris[validset, 1:4], type = 'score') pred_labels <- brif(Species~., data = iris, subset = trainset, newdata = iris[validset, 1:4], type = 'class') # Confusion matrix table(pred_labels, iris[validset, 5]) # Accuracy sum(pred_labels == iris[validset, 5])/length(validset) # Train using the formula format bf <- brif(Species~., data = iris, subset = trainset) # Or equivalently, train using the data.frame format bf <- brif(iris[trainset, c(5,1:4)]) # Make a prediction pred_scores <- predict(bf, iris[validset, 1:4], type = 'score') pred_labels <- predict(bf, iris[validset, 1:4], type = 'class') # Regression bf <- brif(mpg ~., data = mtcars) pred <- predict(bf, mtcars[2:11]) plot(pred, mtcars$mpg) abline(0, 1) # Optionally, delete the model object to release memory rm(list = c("bf")) gc()
Write data set to file
brif_write_data(df, resp_col_num = 1, outfile = "data")
brif_write_data(df, resp_col_num = 1, outfile = "data")
df |
a data frame |
resp_col_num |
an integer indicating the column number (in df) of the response variable. For test data without the response column, use 0 here. |
outfile |
a character string specifying the file name prefix of output files |
a list of four elements. n: number of rows, p: number of predictors, data_file: name of the data file, config_file: name of the configuration file
Build a model taking a data frame as input
## Default S3 method: brif( x, n_numeric_cuts = 31, n_integer_cuts = 31, max_integer_classes = 20, max_depth = 20, min_node_size = 1, ntrees = 200, ps = 0, max_factor_levels = 30, seed = 0, bagging_method = 0, bagging_proportion = 0.9, split_search = 4, search_radius = 5, verbose = 0, nthreads = 2, ... )
## Default S3 method: brif( x, n_numeric_cuts = 31, n_integer_cuts = 31, max_integer_classes = 20, max_depth = 20, min_node_size = 1, ntrees = 200, ps = 0, max_factor_levels = 30, seed = 0, bagging_method = 0, bagging_proportion = 0.9, split_search = 4, search_radius = 5, verbose = 0, nthreads = 2, ... )
x |
a data frame containing the training data set. The first column is taken as the target variable and all other columns are used as predictors. |
n_numeric_cuts |
an integer value indicating the maximum number of split points to generate for each numeric variable. |
n_integer_cuts |
an integer value indicating the maximum number of split points to generate for each integer variable. |
max_integer_classes |
an integer value. If the target variable is integer and has more than max_integer_classes unique values in the training data, then the target variable will be grouped into max_integer_classes bins. If the target variable is numeric, then the smaller of max_integer_classes and the number of unique values number of bins will be created on the target variables and the regression problem will be solved as a classification problem. |
max_depth |
an integer specifying the maximum depth of each tree. Maximum is 40. |
min_node_size |
an integer specifying the minimum number of training cases a leaf node must contain. |
ntrees |
an integer specifying the number of trees in the forest. |
ps |
an integer indicating the number of predictors to sample at each node split. Default is 0, meaning to use sqrt(p), where p is the number of predictors in the input. |
max_factor_levels |
an integer. If any factor variables has more than max_factor_levels, the program stops and prompts the user to increase the value of this parameter if the too-many-level factor is indeed intended. |
seed |
an integer specifying the seed used by the internal random number generator. Default is 0, meaning not to set a seed but to accept the set seed from the calling environment. |
bagging_method |
an integer indicating the bagging sampling method: 0 for sampling without replacement; 1 for sampling with replacement (bootstrapping). |
bagging_proportion |
a numeric scalar between 0 and 1, indicating the proportion of training observations to be used in each tree. |
split_search |
an integer indicating the choice of the split search method. 0: randomly pick a split point; 1: do a local search; 2: random pick subject to regulation; 3: local search subject to regulation; 4 or above: a mix of options 0 to 3. |
search_radius |
an positive integer indicating the split point search radius. This parameter takes effect only in the self-regulating local search (split_search = 2 or above). |
verbose |
an integer (0 or 1) specifying the verbose level. |
nthreads |
an integer specifying the number of threads used by the program. This parameter takes effect only on systems supporting OpenMP. |
... |
additional arguments. |
an object of class brif
, which is a list containing the following components. Note: this object is not intended for any use other than that by the function predict.brif
. Do not apply the str
function on this object because the output can be long and meaningless especially when ntrees is large. Use summary
to get a peek of its structure. Use printRules
to print out the decision rules of a particular tree. Most of the data in the object is stored in the tree_leaves element (which is a list of lists by itself) of this list.
p |
an integer scalar, the number of variables (predictors) used in the model |
var_types |
an character vector of length (p+1) containing the variable names, including the target variable name as its first element |
var_labels |
an character vector of length (p+1) containing the variable types, including that of the target variable as its first element |
n_bcols |
an integer vector of length (p+1), containing the numbers of binary columns generated for each variable |
ntrees |
an integer scalar indicating the number of trees in the model |
index_in_group |
an integer vector specifying the internal index, for each variable, in its type group |
numeric_cuts |
a list containing split point information on numeric variables |
integer_cuts |
a list containing split point information on integer variables |
factor_cuts |
a list containing split point information on factor variables |
n_num_vars |
an integer scalar indicating the numeric variables in the model |
n_int_vars |
an integer scalar indicating the integer variables in the model |
n_fac_vars |
an integer scalar indicating the factor variables in the model |
tree_leaves |
a list containing all the leaves in the forest |
yc |
a list containing the target variable encoding scheme |
Build a model (and make predictions) with formula
## S3 method for class 'formula' brif( formula, data, subset, na.action = stats::na.pass, newdata = NULL, type = c("score", "class"), ... )
## S3 method for class 'formula' brif( formula, data, subset, na.action = stats::na.pass, newdata = NULL, type = c("score", "class"), ... )
formula |
an object of class " |
data |
an optional data frame, list or environment (or object coercible by |
subset |
an optional vector specifying a subset (in terms of index numbers, not actual data) of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain NAs. |
newdata |
a data frame containing the data set for prediction. Default is NULL. If newdata is supplied, prediction results will be returned. |
type |
a character string specifying the prediction format, which takes effect only when |
... |
additional algorithmic parameters. See |
an object of class brif
to be used by predict.brif
.
bf <- brif(Species ~ ., data = iris) pred <- predict(bf, iris[,1:4])
bf <- brif(Species ~ ., data = iris) pred <- predict(bf, iris[,1:4])
If the model is built to predict for just one test data set (newdata), then this function should be used instead of the brif
and predict.brif
pipeline. Transporting the model object between the training and prediction functions through saving and loading the brif
object takes a subtantial amount of time, and using the pred.trainpredict
function eliminates such time-consuming operations. This function will be automatically invoked by the brif
function when the newdata argument is supplied there.
If GPU is used for training (GPU = 1 or 2), the total execution time of this function includes writing and reading temporary data files. To see timing of different steps, use verbose = 1.
Note: Using GPU for training can improve training time only when the number of rows in the training data is extremely large, e.g., over 1 million. Even in such cases, GPU = 2 (hybrid mode) is recommended over GPU = 1 (force using GPU).
## S3 method for class 'trainpredict' brif( x, newdata, type = c("score", "class"), n_numeric_cuts = 31, n_integer_cuts = 31, max_integer_classes = 20, max_depth = 20, min_node_size = 1, ntrees = 200, ps = 0, max_factor_levels = 30, seed = 0, bagging_method = 0, bagging_proportion = 0.9, vote_method = 1, split_search = 4, search_radius = 5, verbose = 0, nthreads = 2, CUDA = 0, CUDA_blocksize = 128, CUDA_n_lb_GPU = 20480, cubrif_main = "cubrif_main.exe", tmp_file_prefix = "cbf", ... )
## S3 method for class 'trainpredict' brif( x, newdata, type = c("score", "class"), n_numeric_cuts = 31, n_integer_cuts = 31, max_integer_classes = 20, max_depth = 20, min_node_size = 1, ntrees = 200, ps = 0, max_factor_levels = 30, seed = 0, bagging_method = 0, bagging_proportion = 0.9, vote_method = 1, split_search = 4, search_radius = 5, verbose = 0, nthreads = 2, CUDA = 0, CUDA_blocksize = 128, CUDA_n_lb_GPU = 20480, cubrif_main = "cubrif_main.exe", tmp_file_prefix = "cbf", ... )
x |
a data frame containing the training data set. The first column is taken as the target variable and all other columns are used as predictors. |
newdata |
a data frame containing the new data to be predicted. All columns in x (except for the first column which is the target variable) must be present in newdata and the data types must match. |
type |
a character string specifying the prediction format. Available values include "score" and "class". Default is "score". |
n_numeric_cuts |
an integer value indicating the maximum number of split points to generate for each numeric variable. |
n_integer_cuts |
an integer value indicating the maximum number of split points to generate for each integer variable. |
max_integer_classes |
an integer value. If the target variable is integer and has more than max_integer_classes unique values in the training data, then the target variable will be grouped into max_integer_classes bins. If the target variable is numeric, then the smaller of max_integer_classes and the number of unique values number of bins will be created on the target variables and the regression problem will be solved as a classification problem. |
max_depth |
an integer specifying the maximum depth of each tree. Maximum is 40. |
min_node_size |
an integer specifying the minimum number of training cases a leaf node must contain. |
ntrees |
an integer specifying the number of trees in the forest. |
ps |
an integer indicating the number of predictors to sample at each node split. Default is 0, meaning to use sqrt(p), where p is the number of predictors in the input. |
max_factor_levels |
an integer. If any factor variables has more than max_factor_levels, the program stops and prompts the user to increase the value of this parameter if the too-many-level factor is indeed intended. |
seed |
an integer specifying the seed used by the internal random number generator. Default is 0, meaning not to set a seed but to accept the set seed from the calling environment. |
bagging_method |
an integer indicating the bagging sampling method: 0 for sampling without replacement; 1 for sampling with replacement (bootstrapping). |
bagging_proportion |
a numeric scalar between 0 and 1, indicating the proportion of training observations to be used in each tree. |
vote_method |
an integer (0 or 1) specifying the voting method in prediction. 0: each leaf contributes the raw count and an average is taken on the sum over all leaves; 1: each leaf contributes an intra-node fraction which is then averaged over all leaves with equal weight. |
split_search |
an integer indicating the choice of the split search method. 0: randomly pick a split point; 1: do a local search; 2: random pick subject to regulation; 3: local search subject to regulation; 4 or above: a mix of options 0 to 3. |
search_radius |
an positive integer indicating the split point search radius. This parameter takes effect only in regulated search (split_search = 2 or above). |
verbose |
an integer (0 or 1) specifying the verbose level. |
nthreads |
an integer specifying the number of threads used by the program. This parameter takes effect only on systems supporting OpenMP. |
CUDA |
an integer (0, 1 or 2). 0: Do not use GPU. 1: Use GPU to build the forest. 2: Hybrid mode: Use GPU to split a node only when the node size is greater than CUDA_n_lb_GPU. |
CUDA_blocksize |
a positive integer specifying the CUDA thread block size, must be a multiple of 64 up to 1024. |
CUDA_n_lb_GPU |
a positive integer. The number of training cases must be greater than this number to enable the GPU computing when GPU = 2. |
cubrif_main |
a string containing the path and name of the cubrif executable (see https://github.com/profyliu/cubrif for how to build it). |
tmp_file_prefix |
a string for the path and prefix of temporary files created when CUDA is used. |
... |
additional arguments. |
a data frame or a vector containing the prediction results. See predict.brif
for details.
trainset <- sample(1:nrow(iris), 0.5*nrow(iris)) validset <- setdiff(1:nrow(iris), trainset) pred_score <- brif.trainpredict(iris[trainset, c(5,1:4)], iris[validset, c(1:4)], type = 'score') pred_label <- colnames(pred_score)[apply(pred_score, 1, which.max)]
trainset <- sample(1:nrow(iris), 0.5*nrow(iris)) validset <- setdiff(1:nrow(iris), trainset) pred_score <- brif.trainpredict(iris[trainset, c(5,1:4)], iris[validset, c(1:4)], type = 'score') pred_label <- colnames(pred_score)[apply(pred_score, 1, which.max)]
This is a wrapper for brif
to build a single tree of a given depth. See brifTree.default
and brifTree.formula
for details.
brifTree(x, ...)
brifTree(x, ...)
x |
a data frame or a |
... |
arguments passed on to |
an object of class brif
. See brif.default
for details.
# Build a single tree bt <- brifTree(Species ~., data = iris, depth = 3) # Print out the decision rules printRules(bt) # Get the accuracy on the training set sum(predict(bt, newdata = iris, type = 'class') == iris[,'Species'])/nrow(iris)
# Build a single tree bt <- brifTree(Species ~., data = iris, depth = 3) # Print out the decision rules printRules(bt) # Get the accuracy on the training set sum(predict(bt, newdata = iris, type = 'class') == iris[,'Species'])/nrow(iris)
This function invokes brif.default
with appropriately set parameters to generate a single tree with the maximum expected predictive accuracy.
## Default S3 method: brifTree( x, depth = 3, n_cuts = 2047, max_integer_classes = 20, max_factor_levels = 30, seed = 0, ... )
## Default S3 method: brifTree( x, depth = 3, n_cuts = 2047, max_integer_classes = 20, max_factor_levels = 30, seed = 0, ... )
x |
a data frame containing the training data. The first column is treated as the target variable. |
depth |
a positive integer indicating the desired depth of the tree. |
n_cuts |
a positive integer indicating the maximum number of split points to generate on each numeric or integer variable. A large value is preferred for a single tree. |
max_integer_classes |
a positive integer. See |
max_factor_levels |
a positive integer. See |
seed |
a non-negative positive integer specifying the random number generator seed. |
... |
other relevant arguments. |
an object of class brif
. See brif.default
for details.
Build a single brif tree taking a formula as input
## S3 method for class 'formula' brifTree( formula, data, subset, na.action = stats::na.pass, depth = 3, n_cuts = 2047, max_integer_classes = 20, max_factor_levels = 30, seed = 0, ... )
## S3 method for class 'formula' brifTree( formula, data, subset, na.action = stats::na.pass, depth = 3, n_cuts = 2047, max_integer_classes = 20, max_factor_levels = 30, seed = 0, ... )
formula |
an object of class " |
data |
an optional data frame, list or environment (or object coercible by |
subset |
an optional vector specifying a subset (in terms of index numbers, not actual data) of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain NAs. |
depth |
a positive integer indicating the desired depth of the tree. |
n_cuts |
a positive integer indicating the maximum number of split points to generate on each numeric or integer variable. A large value is preferred for a single tree. |
max_integer_classes |
a positive integer. See |
max_factor_levels |
a positive integer. See |
seed |
a non-negative positive integer specifying the random number generator seed. |
... |
other relevant arguments. |
an object of class brif
to be used by predict.brif
.
Make predictions for newdata
using a brif model object
.
## S3 method for class 'brif' predict( object, newdata = NULL, type = c("score", "class"), vote_method = 1, nthreads = 2, ... )
## S3 method for class 'brif' predict( object, newdata = NULL, type = c("score", "class"), vote_method = 1, nthreads = 2, ... )
object |
an object of class "brif" as returned by the brif training function. |
newdata |
a data frame. The predictor column names and data types must match those supplied for training. The order of the predictor columns does not matter though. |
type |
a character string indicating the return content. For a classification problem, "score" means the by-class probabilities and "class" means the class labels (i.e., the target variable levels). For regression, the predicted values are returned. |
vote_method |
an integer (0 or 1) specifying the voting method in prediction. 0: each leaf contributes the raw count and an average is taken on the sum over all leaves; 1: each leaf contributes an intra-node fraction which is then averaged over all leaves with equal weight. |
nthreads |
an integer specifying the number of threads used by the program. This parameter only takes effect on systems supporting OpenMP. |
... |
additional arguments. |
Note: If a model is built just for making predictions on one test set (i.e., no need to save the model object for future use), then the brif.trainpredict
should be used.
a data frame or a vector containing the prediction results. For regression, a numeric vector of predicted values will be returned. For classification, if type = "class"
, a character vector of the predicted class labels will be returned; if type = "score"
, a data frame will be returned, in which each column contains the probability of the new case being in the corresponding class.
# Predict using a model built by brif pred_score <- predict(brif(Species ~ ., data = iris), iris, type = 'score') pred_label <- predict(brif(Species ~ ., data = iris), iris, type = 'class') # Equivalently and more efficiently: pred_score <- brif(Species ~., data = iris, newdata = iris, type = 'score') pred_label <- brif(Species ~., data = iris, newdata = iris, type = 'class') # Or, retrieve predicted labels from the scores: pred_label <- colnames(pred_score)[apply(pred_score, 1, which.max)]
# Predict using a model built by brif pred_score <- predict(brif(Species ~ ., data = iris), iris, type = 'score') pred_label <- predict(brif(Species ~ ., data = iris), iris, type = 'class') # Equivalently and more efficiently: pred_score <- brif(Species ~., data = iris, newdata = iris, type = 'score') pred_label <- brif(Species ~., data = iris, newdata = iris, type = 'class') # Or, retrieve predicted labels from the scores: pred_label <- colnames(pred_score)[apply(pred_score, 1, which.max)]
Print the decision rules of a Brif tree
printBrifTree(rf, which_tree)
printBrifTree(rf, which_tree)
rf |
an object of class 'brif', as returned by rftrain. |
which_tree |
an integer indicating the tree number |
No return value. The function is intended for producing a side effect, which prints the decision rules to the standard output.
Print the decision rules of a brif tree
printRules(object, which_tree = 0)
printRules(object, which_tree = 0)
object |
an object of class "brif" as returned by the brif training function. |
which_tree |
a nonnegative integer indicating the tree number (starting from 0) in the forest to be printed. |
No return value. The function is called for side effect. The decision rules of the given tree is printed to the console output. Users can use sink
to direct the output to a file.
# Build a single tree bt <- brifTree(Species ~., data = iris, depth = 3) # Print out the decision rules printRules(bt) # Get the training accuracy sum(predict(bt, newdata = iris, type = 'class') == iris[,'Species'])/nrow(iris)
# Build a single tree bt <- brifTree(Species ~., data = iris, depth = 3) # Print out the decision rules printRules(bt) # Get the training accuracy sum(predict(bt, newdata = iris, type = 'class') == iris[,'Species'])/nrow(iris)
This function is not intended for end users. Users should use the predict.brif function instead.
rfpredict(rf, rdf, vote_method, nthreads)
rfpredict(rf, rdf, vote_method, nthreads)
rf |
an object of class 'brif', as returned by rftrain. |
rdf |
a data frame containing the new cases to be predicted. |
vote_method |
an integer (0 or 1) indicating the voting mechanism among leaf predictions. |
nthreads |
an integer specifying the number of threads to be used in prediction. |
a data frame containing the predicted values.
This function is not intended for end users. Users should use the brif.formula or brif.default function.
rftrain(rdf, par)
rftrain(rdf, par)
rdf |
a data frame. The first column is treated as the target variable. |
par |
a list containing all parameters. |
a list, of class "brif", containing the trained random forest model.
This function is not intended for end users. Users should use the function brif or brif.trainpredict and supply the newdata argument thereof.
rftrainpredict(rdf, rdf_new, par)
rftrainpredict(rdf, rdf_new, par)
rdf |
a data frame containing the training data. |
rdf_new |
a data frame containing new cases to be predicted. |
par |
a list containing all parameters. |
a data frame containing the predicted values.
Stratified permutation of rows by the first column
stratpar(x, stride)
stratpar(x, stride)
x |
a data frame to be permuted by row |
stride |
an integer indicating how many rows are to be groups in one block |
a data frame, which is a permutation of x