Package 'Rborist'

Title: Extensible, Parallelizable Implementation of the Random Forest Algorithm
Description: Scalable implementation of classification and regression forests, as described by Breiman (2001), <DOI:10.1023/A:1010933404324>.
Authors: Mark Seligman [aut, cre]
Maintainer: Mark Seligman <[email protected]>
License: MPL (>= 2) | GPL (>= 2) | file LICENSE
Version: 0.3-11
Built: 2025-02-03 06:54:42 UTC
Source: CRAN

Help Index

Expands forest values into front-end readable vectors.


Formats training output into a form suitable for illustration of feature contributions.


## Default S3 method:



an object of type rfTrain produced by training.


An object of type ExpandReg or ExpandCtg containing human-readable representations of the trained forest.


Mark Seligman at Suiji.


## Not run: 
    rb <- Rborist(iris[,-5], iris[,5])
    ffe <- expandfe(rb)

    # An rfTrain counterpart is NYI.
## End(Not run)

Exportation Format for rfArb Training Output


Formats training output into a form suitable for illustration of feature contributions.


## Default S3 method:



an object of type Rborist produced by training.


An object of type Export.


Mark Seligman at Suiji.


## Not run: 
    rb <- Rborist(iris[,-5], iris[,5])
    ffe <- Export(rb)
## End(Not run)

Meinshausen forest weights


Normalized observation counts across a prediction set.


## Default S3 method:
forestWeight(objTrain, prediction, sampler=objTrain$sampler,
nThread=0, verbose = FALSE, ...)



an object of class rfArb, created from a previous invocation of the command Rborist or rfArb to train.


an object of class SummaryReg or SummaryCtg obtained from prediction using objTrain and argument indexing=TRUE.


an object of class Sampler, as documented for command of the same name.


specifies a prefered thread count.


whether to output progress of weighting.


not currently used.


a numeric matrix having rows equal to the Meinshausen weight of each new datum.


Mark Seligman at Suiji.


Meinshausen, N. (2016) Quantile Random Forests. Journal of Machine Learning Research 17(1), 1-68.

See Also



## Not run: 
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
  rb <- Rborist(x,y)

  newdata <- data.frame(replace(6, rnorm(nRow)))

  # Performs separate prediction on new data, saving indices:
  pred <- predict(rb, newdata, indexing=TRUE)
  weights <- forestWeight(rb, pred)

  obsIdx <- 215 # Arbitrary observation index (zero-based row number)

  # Inner product should equal prediction, modulo numerical vagaries:
  yPredApprox <- weights[obsIdx,] %*% y
  print((yPredApprox - pred$yPred[obsIdx])/yPredApprox) 

## End(Not run)

predict method for arbTrain result


Prediction and test using Rborist.


## S3 method for class 'arbTrain'
predict(object, newdata, sampler, yTest=NULL,
keyedFrame = FALSE, quantVec=numeric(0), quantiles = length(quantVec) > 0,
ctgCensus = "votes", indexing = FALSE, trapUnobserved = FALSE,
bagging = FALSE, nThread = 0, verbose = FALSE, ...)



an object of class arbTrain, created from a previous invocation of the command rfArb, Rborist or rfTrain to train.


a design frame or matrix containing new data, with the same signature of predictors as in the training command.


an object of class Sampler used in the command.


a response vector against which to test the new predictions.


whether the columns of newdata may appear in arbitrary order or as a superset of the predictors used to train.


a vector of quantiles to predict.


whether to predict quantiles.


whether/how to summarize per-category predictions. "votes" specifies the number of trees predicting a given class. "prob" specifies a normalized, probabilistic summary. "probSample" specifies sample-weighted probabilities, similar to quantile histogramming.


whether to record the final node index, typically terminal, of tree traversal.


reports score for nonterminal upon encountering values not observed during training, such as missing data.


whether prediction is restricted to out-of-bag samples.


suggests ans OpenMP-style thread count. Zero denotes default processor setting.


whether to output progress of prediction.


not currently used.


an object of one of two classes:

  • SummaryReg summarizing regression, consisting of:

    • prediction an object of class PredictReg consisting of:

      • yPred the estimated numerical response.

      • qPred quantiles of prediction, if requested.

      • qEst quantile of the estimate, if quantiles requested.

      • indices final index of prediction, if requested.

    • validation if validation requested, an object of class ValidReg consisting of:

      • mse the mean-squared error of the estimate.

      • rsq the r-squared statistic of the estimate.

      • mae the mean absolute error of the estimate.

    • importance if permution importance requested, an object of class importanceReg, containing multiple instances of:

      • names the predictor names.

      • mse the per-predictor mean-squared error, under permutation.

  • SummaryCtg summarizing classification, consisting of:

    • PredictCtg consisting of:

      • yPred estimated categorical response.

      • census factor-valued matrix of the estimate, by category, if requested.

      • prob matrix of estimate probabilities, by category, if requested.

      • indices final index of prediction, if requested.

    • validation if validation requested, an object of class ValidCtg consisting of:

      • confusion the confusion matrix.

      • misprediction the misprediction rate.

      • oobError the out-of-bag error.

    • importance if permution importance requested, an object of class importanceCtg, consisting of:

      • mispred the misprediction rate, by predictor.

      • oobErr the out-of-bag error, by predictor.


Mark Seligman at Suiji.

See Also



## Not run: 
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  pf <- preformat(x)
  sp <- presample(y)
  rb <- arbTrain(pf, sp, y)

  # Performs separate prediction on new data:
  xx <- data.frame(replace(6, rnorm(nRow)))
  pred <- predict(rb, xx)
  yPred <- pred$yPred

  rb <- Rborist(x,y)

  # Performs separate prediction on new data:
  xx <- data.frame(replacate(6, rnorm(nRow)))
  pred <- predict(rb, xx)
  yPred <- pred$yPred

  # As above, but also records final indices of each tree walk:
  pred <- predict(rb, xx, indexing=TRUE)
  print(pred$indices[c(1:2), ])

  # As above, but predicts over \code{newdata} with unobserved values.
  # In the case of numerical data, only missing values are considered
  # unobserved.  Missing values are encoded as \code{NaN}, which are
  # incomparable, precipitating \code{false} on every test.  Prediction
  # therefore takes the \code{false} branch when encountering missing
  # values:
  xxMissing <- xx
  xxMissing[6, c(15, 32, 87, 101)] <- NA
  pred <- predict(rb, xxMissing)

  # As above, but returns a nonterminal score upon encountering
  # unobserved values. Neither the true nor the false branch from the
  # testing node is taken.  Instead, the score returned is derived
  # from all leaf nodes (terminals) reached by the testing
  # (nonterminal) node.
  pred <- predict(rb, xxMissing, trapUnobserved = TRUE)

  # Performs separate prediction, using original response as test
  # vector:
  pred <- predict(rb, xx, y)
  mse <- pred$mse
  rsq <- pred$rsq

  # Performs separate prediction with (default) quantiles:
  pred <- predict(rb, xx, quantiles="TRUE")
  qPred <- pred$qPred

  # Performs separate prediction with deciles:
  pred <- predict(rb, xx, quantVec = seq(0.1, 1.0, by = 0.10))
  qPred <- pred$qPred

  # Classification examples:
  rb <- Rborist(iris[-5], iris[5])

  # Generic prediction using training set.
  # Census as (default) votes:
  pred <- predict(rb, iris[-5])
  yPred <- pred$yPred
  census <- pred$census

  # Using the \code{keyedFrame} option allows the columns of
  # \code{newdata} to appear in arbitrary order, so long as the
  # columns present during training appear as a subset:
  pred <- predict(rb, iris[c(2, 4, 3, 1)], keyedFrame=TRUE)

  # As above, but validation census to report class probabilities:
  pred <- predict(rb, iris[-5], ctgCensus="prob")
  prob <- pred$prob

  # As above, but with training reponse as test vector:
  pred <- predict(rb, iris[-5], iris[5], ctgCensus = "prob")
  prob <- pred$prob
  conf <- pred$confusion
  misPred <- pred$misPred

  # As above, but predicts nonterminal when encountering categories
  # not observed during training.  That is, prediction returns a score
  # derived from all terminal nodes (leaves) reached from the
  # (nonterminal) testing node.
  # In this case, "unobserved" refers to categories not present in
  # the subpartition over which a splitting is performed.  As training
  # partitions the data into smaller and smaller regions, a given
  # category becomes less likely to appear in a region.
  # More generally, unobserved data can include missing predictors as
  # well as categories appearing in \code{newdata} which were not
  # present during training.
  pred <- predict(rb, trapUnobserved=TRUE)

## End(Not run)

predict method for rfArb result


Prediction and test using Rborist.


## S3 method for class 'rfArb'
predict(object, newdata, sampler, yTest=NULL,
keyedFrame = FALSE, quantVec=numeric(0), quantiles = length(quantVec) > 0,
ctgCensus = "votes", indexing = FALSE, trapUnobserved = FALSE,
bagging = FALSE, nThread = 0, verbose = FALSE, ...)



an object of class rfArb, created from a previous invocation of the command rfArb or Rborist to train.


a design frame or matrix containing new data, with the same signature of predictors as in the training command.


an object of class Sampler used in the command.


a response vector against which to test the new predictions.


whether the columns of newdata may appear in arbitrary order or as a superset of the predictors used to train.


a vector of quantiles to predict.


whether to predict quantiles.


whether/how to summarize per-category predictions. "votes" specifies the number of trees predicting a given class. "prob" specifies a normalized, probabilistic summary. "probSample" specifies sample-weighted probabilities, similar to quantile histogramming.


whether to record the final node index, typically terminal, of tree traversal.


reports score for nonterminal upon encountering values not observed during training, such as missing data.


whether prediction is restricted to out-of-bag samples.


suggests ans OpenMP-style thread count. Zero denotes default processor setting.


whether to output progress of prediction.


not currently used.


an object of one of two classes:

  • SummaryReg summarizing regression, consisting of:

    • prediction an object of class PredictReg consisting of:

      • yPred the estimated numerical response.

      • qPred quantiles of prediction, if requested.

      • qEst quantile of the estimate, if quantiles requested.

      • indices final index of prediction, if requested.

    • validation if validation requested, an object of class ValidReg consisting of:

      • mse the mean-squared error of the estimate.

      • rsq the r-squared statistic of the estimate.

      • mae the mean absolute error of the estimate.

    • importance if permution importance requested, an object of class importanceReg, containing multiple instances of:

      • names the predictor names.

      • mse the per-predictor mean-squared error, under permutation.

  • SummaryCtg summarizing classification, consisting of:

    • PredictCtg consisting of:

      • yPred estimated categorical response.

      • census factor-valued matrix of the estimate, by category, if requested.

      • prob matrix of estimate probabilities, by category, if requested.

      • indices final index of prediction, if requested.

    • validation if validation requested, an object of class ValidCtg consisting of:

      • confusion the confusion matrix.

      • misprediction the misprediction rate.

      • oobError the out-of-bag error.

    • importance if permution importance requested, an object of class importanceCtg, consisting of:

      • mispred the misprediction rate, by predictor.

      • oobErr the out-of-bag error, by predictor.


Mark Seligman at Suiji.

See Also



## Not run: 
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  pf <- preformat(x)
  sp <- presample(y)
  rb <- rfArb(pf, sp, y)

  # Performs separate prediction on new data:
  xx <- data.frame(replace(6, rnorm(nRow)))
  pred <- predict(rb, xx)
  yPred <- pred$yPred

  rb <- Rborist(x,y)

  # Performs separate prediction on new data:
  xx <- data.frame(replacate(6, rnorm(nRow)))
  pred <- predict(rb, xx)
  yPred <- pred$yPred

  # As above, but also records final indices of each tree walk:
  pred <- predict(rb, xx, indexing=TRUE)
  print(pred$indices[c(1:2), ])

  # As above, but predicts over \code{newdata} with unobserved values.
  # In the case of numerical data, only missing values are considered
  # unobserved.  Missing values are encoded as \code{NaN}, which are
  # incomparable, precipitating \code{false} on every test.  Prediction
  # therefore takes the \code{false} branch when encountering missing
  # values:
  xxMissing <- xx
  xxMissing[6, c(15, 32, 87, 101)] <- NA
  pred <- predict(rb, xxMissing)

  # As above, but returns a nonterminal score upon encountering
  # unobserved values. Neither the true nor the false branch from the
  # testing node is taken.  Instead, the score returned is derived
  # from all leaf nodes (terminals) reached by the testing
  # (nonterminal) node.
  pred <- predict(rb, xxMissing, trapUnobserved = TRUE)

  # Performs separate prediction, using original response as test
  # vector:
  pred <- predict(rb, xx, y)
  mse <- pred$mse
  rsq <- pred$rsq

  # Performs separate prediction with (default) quantiles:
  pred <- predict(rb, xx, quantiles="TRUE")
  qPred <- pred$qPred

  # Performs separate prediction with deciles:
  pred <- predict(rb, xx, quantVec = seq(0.1, 1.0, by = 0.10))
  qPred <- pred$qPred

  # Classification examples:
  rb <- Rborist(iris[-5], iris[5])

  # Generic prediction using training set.
  # Census as (default) votes:
  pred <- predict(rb, iris[-5])
  yPred <- pred$yPred
  census <- pred$census

  # Using the \code{keyedFrame} option allows the columns of
  # \code{newdata} to appear in arbitrary order, so long as the
  # columns present during training appear as a subset:
  pred <- predict(rb, iris[c(2, 4, 3, 1)], keyedFrame=TRUE)

  # As above, but validation census to report class probabilities:
  pred <- predict(rb, iris[-5], ctgCensus="prob")
  prob <- pred$prob

  # As above, but with training reponse as test vector:
  pred <- predict(rb, iris[-5], iris[5], ctgCensus = "prob")
  prob <- pred$prob
  conf <- pred$confusion
  misPred <- pred$misPred

  # As above, but predicts nonterminal when encountering categories
  # not observed during training.  That is, prediction returns a score
  # derived from all terminal nodes (leaves) reached from the
  # (nonterminal) testing node.
  # In this case, "unobserved" refers to categories not present in
  # the subpartition over which a splitting is performed.  As training
  # partitions the data into smaller and smaller regions, a given
  # category becomes less likely to appear in a region.
  # More generally, unobserved data can include missing predictors as
  # well as categories appearing in \code{newdata} which were not
  # present during training.
  pred <- predict(rb, trapUnobserved=TRUE)

## End(Not run)

Preformatting for Training with Warm Starts


Presorts and formats training frame into a form suitable for subsequent training by rfArb caller or rfTrain command. Wraps this form to spare unnecessary recomputation when iteratively retraining, for example, under parameter sweep.


## Default S3 method:
		   nThread = 0,



the design frame expressed as either a data.frame object with numeric and/or factor columns or as a numeric or factor-valued matrix.


number of cores to run in parallel, if available.


indicates whether to output progress of preformatting.




an object of class Deframe consisting of:

  • rleFrame run-length encoded representation of class RLEFrame consisting of:

    • rankedFrame run-length encoded representation of class RankedFrame consisting of:

      • nRow the number of observations encoded.

      • runVal the run-length encoded values.

      • runRow the corresponding row indices.

      • rleHeight the number of encodings, per predictor.

      • topIdx the accumulated end index, per predictor.

    • numRanked packed representation of sorted numerical values of class NumRanked consisting of:

      • numVal distinct numerical values.

      • numHeight value offset per predictor.

    • facRanked packed representation of sorted factor values of class FacRanked consisting of:

      • facVal distinct factor values, zero-based.

      • facHeight value offset per predictor.

  • nRow the number of training observations.

  • signature an object of type Signature consisting of:

    • predForm predictor class names.

    • level per-predictor levels, regardless whether realized.

    • factor per-predictor realized levels.

    • colNames predictor names.

    • rowNames observation names.


Mark Seligman at Suiji.


## Not run: 
    pt <- preformat(iris[,-5])

    ppTry <- seq(0.2, 0.5, by= 0.3/10)
    nIter <- length(ppTry)
    rsq <- numeric(nIter)
    for (i in 1:nIter) {
      rb <- Rborist(pt, iris[,5], predProb=ppTry[i])
      rsq[i] = rb$validiation$rsq
## End(Not run)

Forest-wide Observation Sampling


Observations sampled for each tree to be trained. In the case of the Random Forest algorithm, this is the bag.


## Default S3 method:
                            samplingWeight = numeric(0),
                            nSamp = 0,
                            nRep = 500,
                            withRepl =  TRUE,
                            nHoldout = 0,
                            nFold = 1,
                            verbose = FALSE,
                            nTree = 0,



A vector to be sampled, typically the response.


Per-observation sampling weights. Default is uniform.


Size of sample draw. Default draws y length.


Number of samples to draw. Replaces deprecated nTree.


true iff sampling is with replacement.


Number of observations to omit from sampling. Augmented by unobserved response values.


Number of collections into which to partition the respone.


true iff tracing execution.


Number of samples to draw. Deprecated.


not currently used.


an object of class Sampler consisting of:

  • yTrain the sampled vector.

  • nSamp the sample sizes drawn.

  • nRep the number of independent samples.

  • nTree synonymous with nRep. Deprecated.

  • samples a packed data structure encoding the observation index and corresponding sample count.

  • hash a hashed digest of the data items.


Tille, Yves. Sampling algorithms. Springer New York, 2006.


## Not run: 
    y <- runif(1000)

    # Samples with replacement, 500 vectors of length 1000:
    ps <- presample(y)

    # Samples, as above, with 63 observations held out:
    ps <- presample(y, nHoldout = 63)

    # Samples without replacement, 250 vectors of length 500:
    ps2 <- presample(y, nTree=250, nSamp=500, withRepl = FALSE)

## End(Not run)

Rapid Decision Tree Construction and Evaluation


Legacy entry for accelerated implementation of the Random Forest (trademarked name) algorithm. Calls the suggested entry, rfArb.


## Default S3 method:



the design matrix expressed as a PreFormat object, as a data.frame object with numeric and/or factor columns or as a numeric matrix.


the response (outcome) vector, either numerical or categorical. Row count must conform with x.


specific to rfArb.


an object of class rfArb, as documented in command of the same name.


Mark Seligman at Suiji.


## Not run: 
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  # Classification example:

  # Generic invocation:
  rb <- Rborist(x, y)

## End(Not run)

NEWS Displayer for Rborist


Displays NEWS associated with Rborist releases.





Rapid Decision Tree Construction and Evaluation


Accelerated implementation of the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R. Invocation is similar to that provided by randomForest package.


## Default S3 method:
                autoCompress = 0.25,              
                ctgCensus = "votes",
                classWeight = numeric(0),
                discardState = FALSE,
                impPermute = 0,
                indexing = FALSE,
                maxLeaf = 0,
                minInfo = 0.01,
                minNode = if (is.factor(y)) 2 else 3,
                nHoldout = 0,
                nLevel = 0,
                nSamp = 0,
                nThread = 0,
                nTree = 500,
                noValidate = FALSE,
                predFixed = 0,
                predProb = 0.0,
                predWeight = numeric(0),
                quantVec = numeric(0),
                quantiles = length(quantVec) > 0,
                regMono = numeric(0),
                rowWeight = numeric(0),
                samplingWeight = numeric(0),
                splitQuant = numeric(0),
                streamline = FALSE,
                thinLeaves = streamline || (is.factor(y) && !indexing),
                trapUnobserved = FALSE,
                treeBlock = 1,
                verbose = FALSE,
                withRepl = TRUE,



the design matrix expressed as a PreFormat object, as a data.frame object with numeric and/or factor columns or as a numeric matrix.


the response (outcome) vector, either numerical or categorical. Row count must conform with x.


plurality above which to compress predictor values.


report categorical validation by vote or by probability.


proportional weighting of classification categories.


minimizes storage by discarding primary training output. Useful for parameter sweeps and cross-validation, in which only validation may be of interest.


number of importance permutations: 0 or 1.


whether to report final index, typically terminal, of validation tree traversal.


maximum number of leaves in a tree. Zero denotes no limit.


information ratio with parent below which node does not split.


minimum number of distinct row references to split a node.


number of observations to omit from sampling. Augmented by missing response values.


maximum number of tree levels to train, including terminals (leaves). Zero denotes no limit.


number of rows to sample, per tree.


suggests an OpenMP-style thread count. Zero denotes the default processor setting.


the number of trees to train.


whether to train without validation.


number of trial predictors for a split (mtry).


probability of selecting individual predictor as trial splitter.


relative weighting of individual predictors as trial splitters.


quantile levels to validate.


whether to report quantiles at validation.


signed probability constraint for monotonic regression.


row weighting for initial sampling of tree. Deprecated


row weighting for initial sampling of tree.


(sub)quantile at which to place cut point for numerical splits



whether to streamline sampler contents to save space.


bypasses creation of leaf state in order to reduce storage footprint.


reports score for nonterminal upon encountering values not observed during training, such as missing data.


maximum number of trees to train during a single level (e.g., coprocessor computing).


indicates whether to output progress of training.


whether row sampling is by replacement.


not currently used.


an object sharing classes rfArb, a supplementary collection consisting of the following items:

  • sampler an object of class Sampler, as described in the documentation for the presample command, that summarizes the bagging structure.

  • training a list summarizing the training task, consisting of the following fields:

    • call the calling invocation.

    • info a vector of forest-wide Gini (classification) or weighted variance (regression), by predictor.

    • version the version of the Rborist package used to train.

    • diag diagnostics accumulated over the training task.

    • samplerHash hash value of the Sampler object used to train. Recorded for consistency of subsequent commands.

  • prediction an object of class PredictReg or PredictCtg, as described by the documention for command predict.

  • validation an object of class ValidReg or ValidCtg, as described by the documention for commandvalidate, if validation is requested.

  • importance an object of class ImportanceReg orImportanceCtg, as described by the documention for command predict, if permutation performance has been requested.


Mark Seligman at Suiji.


Breiman, L. (2001) Random Forests, Machine Learning 45(1), 5-32.

See Also



## Not run: 
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  # Classification example:

  # Generic invocation:
  rb <- rfArb(x, y)

  # Causes 300 trees to be trained:
  rb <- rfArb(x, y, nTree = 300)

  # Causes rows to be sampled without replacement:
  rb <- rfArb(x, y, withRepl=FALSE)

  # Causes validation census to report class probabilities:
  rb <- rfArb(iris[-5], iris[5], ctgCensus="prob")

  # Applies table-weighting to classification categories:
  rb <- rfArb(iris[-5], iris[5], classWeight = "balance")

  # Weights first category twice as heavily as remaining two:
  rb <- rfArb(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))

  # Does not split nodes when doing so yields less than a 2% gain in
  # information over the parent node:
  rb <- rfArb(x, y, minInfo=0.02)

  # Does not split nodes representing fewer than 10 unique samples:
  rb <- rfArb(x, y, minNode=10)

  # Trains a maximum of 20 levels:
  rb <- rfArb(x, y, nLevel = 20)

  # Trains, but does not perform subsequent validation:
  rb <- rfArb(x, y, noValidate=TRUE)

  # Chooses 500 rows (with replacement) to root each tree.
  rb <- rfArb(x, y, nSamp=500)

  # Chooses 2 predictors as splitting candidates at each node (or
  # fewer, when choices exhausted):
  rb <- rfArb(x, y, predFixed = 2)  

  # Causes each predictor to be selected as a splitting candidate with
  # distribution Bernoulli(0.3):
  rb <- rfArb(x, y, predProb = 0.3) 

  # Causes first three predictors to be selected as splitting candidates
  # twice as often as the other two:
  rb <- rfArb(x, y, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))

  # Causes (default) quantiles to be computed at validation:
  rb <- rfArb(x, y, quantiles=TRUE)
  qPred <- rb$validation$qPred

  # Causes specfied quantiles (deciles) to be computed at validation:
  rb <- rfArb(x, y, quantVec = seq(0.1, 1.0, by = 0.10))
  qPred <- rb$validation$qPred

  # Constrains modelled response to be increasing with respect to X1
  # and decreasing with respect to X5.
  rb <- rfArb(x, y, regMono=c(1.0, 0, 0, 0, -1.0, 0))

  # Causes rows to be sampled with random weighting:
  rb <- rfArb(x, y, samplingWeight=runif(nRow))

  # Suppresses creation of detailed leaf information needed for
  # quantile prediction and external tools.
  rb <- rfArb(x, y, thinLeaves = TRUE)

  # Directs prediction to take a random branch on encountering
  # values not observed during training, such as NA or an
  # unrecognized category.

  predict(rb, trapUnobserved = FALSE)

  # Directs prediction to silently trap unobserved values, reporting a
  # score associated with the current nonterminal tree node.

  predict(rb, trapUnobserved = TRUE)

  # Sets splitting position for predictor 0 to far left and predictor
  # 1 to far right, others to default (median) position.

  spq <- rep(0.5, ncol(x))
  spq[0] <- 0.0
  spq[1] <- 1.0
  rb <- rfArb(x, y, splitQuant = spq)
## End(Not run)

Rapid Decision Tree Training


Accelerated training using the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R.


## Default S3 method:
                autoCompress = 0.25,
                ctgCensus = "votes",
                classWeight = numeric(0),
                maxLeaf = 0,
                minInfo = 0.01,
                minNode = if (is.factor(y)) 2 else 3,
                nLevel = 0,
                nThread = 0,
                predFixed = 0,
                predProb = 0.0,
                predWeight = numeric(0),
                regMono = numeric(0),
                splitQuant = numeric(0),
                thinLeaves = FALSE,
                treeBlock = 1,
                verbose = FALSE,



the response (outcome) vector, either numerical or categorical.


Compressed, presorted representation of the predictor values. Row count must conform with y.


Compressed representation of the sampled response.


plurality above which to compress predictor values.


report categorical validation by vote or by probability.


proportional weighting of classification categories.


maximum number of leaves in a tree. Zero denotes no limit.


information ratio with parent below which node does not split.


minimum number of distinct row references to split a node.


maximum number of tree levels to train, including terminals (leaves). Zero denotes no limit.


suggests an OpenMP-style thread count. Zero denotes the default processor setting.


number of trial predictors for a split (mtry).


probability of selecting individual predictor as trial splitter.


relative weighting of individual predictors as trial splitters.


signed probability constraint for monotonic regression.


(sub)quantile at which to place cut point for numerical splits



bypasses creation of leaf state in order to reduce memory footprint.


maximum number of trees to train during a single level (e.g., coprocessor computing).


indicates whether to output progress of training.


Not currently used.


an object of class arbTrain, containing:

  • version the version of the Rborist package used to train.

  • samplerHash hash value of the Sampler object used to train. Recorded for consistency of subsequent commands.

  • predInfo a vector of forest-wide Gini (classification) or weighted variance (regression), by predictor.

  • forest an object of class Forest containing:

    • nTree the number of trees trained.

    • node an object of class Node consisting of:

      • treeNode forest-wide vector of packed node representations.

      • extent per-tree node counts.

      • scores numeric vector of scores, for all terminals and nonterminals.

      • factor an object of class Factor consisting of:

        • facSplit forest-wide vector of packed factor bits.

        • extent per-tree extent of factor bits.

        • observed forest-wide vector of observed factor bits.

    • Leaf an object of class Leaf containing:

      • extent forest-wide vector of leaf populations, i.e., counts of unique samples.

      • index forest-wide vector of sample indices.

  • diag diagnostics accumulated over the training task.


Mark Seligman at Suiji.

See Also



## Not run: 
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  # Classification example:

  # Generic invocation:
  rt <- rfTrain(y)

  # Causes 300 trees to be trained:
  rt <- rfTrain(y, nTree = 300)

  # Causes validation census to report class probabilities:
  rt <- rfTrain(iris[-5], iris[5], ctgCensus="prob")

  # Applies table-weighting to classification categories:
  rt <- rfTrain(iris[-5], iris[5], classWeight = "balance")

  # Weights first category twice as heavily as remaining two:
  rt <- rfTrain(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))

  # Does not split nodes when doing so yields less than a 2% gain in
  # information over the parent node:
  rt <- rfTrain(y, preFormat, sampler, minInfo=0.02)

  # Does not split nodes representing fewer than 10 unique samples:
  rt <- rfTrain(y, preFormat, sampler, minNode=10)

  # Trains a maximum of 20 levels:
  rt <- rfTrain(y, preFormat, sampler, nLevel = 20)

  # Trains, but does not perform subsequent validation:
  rt <- rfTrain(y, preFormat, sampler, noValidate=TRUE)

  # Chooses 500 rows (with replacement) to root each tree.
  rt <- rfTrain(y, preFormat, sampler, nSamp=500)

  # Chooses 2 predictors as splitting candidates at each node (or
  # fewer, when choices exhausted):
  rt <- rfTrain(y, preFormat, sampler, predFixed = 2)  

  # Causes each predictor to be selected as a splitting candidate with
  # distribution Bernoulli(0.3):
  rt <- rfTrain(y, preFormat, sampler, predProb = 0.3) 

  # Causes first three predictors to be selected as splitting candidates
  # twice as often as the other two:
  rt <- rfTrain(y, preFormat, sampler, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))

  # Constrains modelled response to be increasing with respect to X1
  # and decreasing with respect to X5.
  rt <- rfTrain(x, y, preFormat, sampler, regMono=c(1.0, 0, 0, 0, -1.0, 0))

  # Suppresses creation of detailed leaf information needed for
  # quantile prediction and external tools.
  rt <- rfTrain(y, preFormat, sampler, thinLeaves = TRUE)

  spq <- rep(0.5, ncol(x))
  spq[0] <- 0.0
  spq[1] <- 1.0
  rt <- rfTrain(y, preFormat, sampler, splitQuant = spq)
## End(Not run)

Reducing Memory Footprint of Trained Decision Forest


Clears fields deemed no longer useful.


## S3 method for class 'rfArb'



Trained forest object of class rfArb.


an object of class rfArb with sample data cleared.


Mark Seligman at Suiji.


## Not run: 
    ## Trains.
    rs <- Rborist(x, y)
    ## Replaces trained object with streamlined copy.
    rs <- Streamline(rs)
## End(Not run)

Separate Validation of Trained Decision Forest


Permits trained decision forest to be validated separately from training.


## Default S3 method:
validate(train, preFormat, sampler = NULL,  ctgCensus
= "votes", impPermute = 0, quantVec = numeric(0), quantiles =
length(quantVec) > 0, indexing = FALSE, trapUnobserved = FALSE, nThread = 0, verbose =
FALSE, ...)



an object of class Rborist obtained from previous training.


summarizes the response and its per-tree samplgin.


internal representation of the design matrix, of class PreFormat


report categorical validation by vote or by probability.


specifies the number of importance permutations: 0 or 1.


quantile levels to validate.


whether to report quantiles at validation.


whether to report final index, typically terminal, of tree traversal.


indicates whether to return a nonterminal for values unobserved during training, such as missing data.


suggests an OpenMP-style thread count. Zero denotes the default processor setting.


indicates whether to output progress of validation.


not currently used.


either of two pairs of objects:

  • SummaryReg summarizing regression, as documented with the command predict.arbTrain.

  • validation an object of class ValidReg consisting of:

    • mse the mean-square error of the estimate.

    • rsq the r-squared statistic of the estimate.

    • mae the mean absolute error of the estimate.

  • SummaryCtg summarizing classification, as documented with the command predict.arbTrain.

  • validation an object of class ValidCtg consisting of:

    • confusion the confusion matrix.

    • misprediction the misprediction rate.

    • oobError the out-of-bag error.


Mark Seligman at Suiji.


## Not run: 
    ## Trains without validation.
    rb <- Rborist(x, y, novalidate=TRUE)
    ## Delayed validation using a preformatted object.
    pf <- preformat(x)
    v <- validate(rb, pf)
## End(Not run)