Title:  Multiple Imputation using Chained Random Forests 

Description:  An R package for multiple imputation using chained random forests. Implemented methods can handle missing data in mixed types of variables by using predictionbased or nodebased conditional distributions constructed using random forests. For predictionbased imputation, the method based on the empirical distribution of outofbag prediction errors of random forests and the method based on normality assumption for prediction errors of random forests are provided for imputing continuous variables. And the method based on predicted probabilities is provided for imputing categorical variables. For nodebased imputation, the method based on the conditional distribution formed by the predicting nodes of random forests, and the method based on proximity measures of random forests are provided. More details of the statistical methods can be found in Hong et al. (2020) <arXiv:2004.14823>. 
Authors:  Shangzhi Hong [aut, cre], Henry S. Lynn [ths] 
Maintainer:  Shangzhi Hong <shangzhihong@hotmail.com> 
License:  GPL3 
Version:  2.1.8 
Built:  20240222 12:42:45 UTC 
Source:  CRAN 
Convert variables to factors
conv.factor(data, convNames = NULL, exceptNames = NULL, uniqueNum = 5)
data 
Input data frame. 
convNames 
Names of variable to convert, the default is

exceptNames 
Names of variables to be excluded from conversion, the
default is 
uniqueNum 
Variables of less than or equal to a specific number of
unique values in the to be converted to factors, the default is

A data frame of converted variables.
nhanes.fix < conv.factor(data = nhanes, convNames = c("age", "hyp"))
Generate missing (completely at random) cells in a data set
gen.mcar(df, prop.na = 0.2, warn.empty.row = TRUE, ...)
df 
Input data frame or matrix. 
prop.na 
Proportion of generated missing cells. The default is

warn.empty.row 
Show a warning if empty rows were present in the output data set. 
... 
Other parameters (will be ignored). 
A data frame or matrix containing generated missing cells.
Shangzhi Hong
data("mtcars")
mtcars.mcar < gen.mcar(mtcars, warn.empty.row = FALSE)
RfEmp
multiple imputation method is for mixed types of variables,
and calls corresponding functions based on variable types.
Categorical variables should be of type factor
or logical
, etc.
RfPred.Emp
is used for continuous variables, and RfPred.Cate
is used for categorical variables.
imp.rfemp(
data,
num.imp = 5,
max.iter = 5,
num.trees = 10,
alpha.emp = 0,
sym.dist = TRUE,
pre.boot = TRUE,
num.trees.cont = NULL,
num.trees.cate = NULL,
num.threads = NULL,
print.flag = FALSE,
...
)
data 
A data frame or a matrix containing the incomplete data. Missing
values should be coded as 
num.imp 
Number of multiple imputations. The default is

max.iter 
Number of iterations. The default is 
num.trees 
Number of trees to build. The default is

alpha.emp 
The "significance level" for the empirical distribution of
outofbag prediction errors, can be used for prevention for outliers
(helpful for highly skewed variables).
For example, set alpha = 0.05 to use 95% confidence level.
The default is 
sym.dist 
If 
pre.boot 
If 
num.trees.cont 
Number of trees to build for continuous variables.
The default is 
num.trees.cate 
Number of trees to build for categorical variables,
The default is 
num.threads 
Number of threads for parallel computing. The default is

print.flag 
If 
... 
Other arguments to pass down. 
For continuous variables, mice.impute.rfpred.emp
is called, performing
imputation based on the empirical distribution of outofbag
prediction errors of random forests.
For categorical variables, mice.impute.rfpred.cate
is called,
performing imputation based on predicted probabilities.
An object of S3 class mids
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Zhang, Haozhe, et al. "Random Forest Prediction Intervals." The American Statistician (2019): 120.
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 7481.
# Prepare data: convert categorical variables to factors
nhanes.fix < nhanes
nhanes.fix[, c("age", "hyp")] < lapply(nhanes[, c("age", "hyp")], as.factor)
# Perform imputation using imp.rfemp
imp < imp.rfemp(nhanes.fix)
# Do repeated analyses
anl < with(imp, lm(chl ~ bmi + hyp))
# Pool the results
pool < pool(anl)
# Get pooled estimates
reg.ests(pool)
RfNode.Cond
multiple imputation method is for mixed types of variables,
using conditional distribution formed by predicting nodes of random forest
(outofbag observations will be excluded).
imp.rfnode.cond(
data,
num.imp = 5,
max.iter = 5,
num.trees = 10,
pre.boot = TRUE,
print.flag = FALSE,
...
)
data 
A data frame or a matrix containing the incomplete data. Missing
values should be coded as 
num.imp 
Number of multiple imputations. The default is

max.iter 
Number of iterations. The default is 
num.trees 
Number of trees to build. The default is

pre.boot 
If 
print.flag 
If 
... 
Other arguments to pass down. 
During imputation using imp.rfnode.cond
, for missing observations, the
candidate nonmissing observations will be found by the predicting nodes
of random trees in the random forest model. Only the inbag observations
for each random tree will be used for imputation.
An object of S3 class mids
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Zhang, Haozhe, et al. "Random Forest Prediction Intervals." The American Statistician (2019): 120.
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 7481.
# Prepare data: convert categorical variables to factors
nhanes.fix < nhanes
nhanes.fix[, c("age", "hyp")] < lapply(nhanes[, c("age", "hyp")], as.factor)
# Perform imputation using imp.rfnode.cond
imp < imp.rfnode.cond(nhanes.fix)
# Do repeated analyses
anl < with(imp, lm(chl ~ bmi + hyp))
# Pool the results
pool < pool(anl)
# Get pooled estimates
reg.ests(pool)
RfNodeProx
multiple imputation method is for mixed types of variables,
using conditional distributions formed by proximity measures of random
forests (both inbag and outofbag observations will be used for imputation).
imp.rfnode.prox(
data,
num.imp = 5,
max.iter = 5,
num.trees = 10,
pre.boot = TRUE,
print.flag = FALSE,
...
)
data 
A data frame or a matrix containing the incomplete data. Missing
values should be coded as 
num.imp 
Number of multiple imputations. The default is

max.iter 
Number of iterations. The default is 
num.trees 
Number of trees to build. The default is

pre.boot 
If 
print.flag 
If 
... 
Other arguments to pass down. 
During imputation using imp.rfnode.prox
, for missing observations, the
candidate nonmissing observations will be found by whether two observations
can be retrieved from the same predicting node during prediction. The
observations used for imputation may not be necessarily be contained in the
terminal node of random forest model.
An object of S3 class mids
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Zhang, Haozhe, et al. "Random Forest Prediction Intervals." The American Statistician (2019): 120.
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 7481.
# Prepare data: convert categorical variables to factors
nhanes.fix < nhanes
nhanes.fix[, c("age", "hyp")] < lapply(nhanes[, c("age", "hyp")], as.factor)
# Perform imputation using imp.rfnode.prox
imp < imp.rfnode.prox(nhanes.fix)
# Do repeated analyses
anl < with(imp, lm(chl ~ bmi + hyp))
# Pool the results
pool < pool(anl)
# Get pooled estimates
reg.ests(pool)
Please note that functions with names starting with "mice.impute" are exported to be visible for the mice sampler functions. Please do not call these functions directly unless you know exactly what you are doing.
RfEmpImp
multiple imputation method, adapter for mice
samplers.
These functions can be called by the mice
sampler function. In the
mice()
function, set method = "rfemp"
to use the RfEmp
method.
mice.impute.rfemp
is for mixed types of variables, and it calls
corresponding functions according to variable types. Categorical variables
should be of type factor
or logical
etc.
For continuous variables, mice.impute.rfpred.emp
is called, performing
imputation based on the empirical distribution of outofbag prediction
errors of random forests.
For categorical variables, mice.impute.rfpred.cate
is called,
performing imputation based on predicted probabilities.
mice.impute.rfemp(
y,
ry,
x,
wy = NULL,
num.trees = 10,
alpha.emp = 0,
sym.dist = TRUE,
pre.boot = TRUE,
num.trees.cont = NULL,
num.trees.cate = NULL,
...
)
y 
Vector to be imputed. 
ry 
Logical vector of length 
x 
Numeric design matrix with 
wy 
Logical vector of length 
num.trees 
Number of trees to build, default to 
alpha.emp 
The "significance level" for empirical distribution of
prediction errors, can be used for prevention for outliers (useful for highly
skewed variables). For example, set alpha = 0.05 to use 95% confidence level
for empirical distribution of prediction errors.
Default is 
sym.dist 
If 
pre.boot 
Perform bootstrap prior to imputation to get 'proper'
multiple imputation, i.e. accommodating sampling variation in estimating
population regression parameters (see Shah et al. 2014).
It should be noted that if 
num.trees.cont 
Number of trees to build for continuous variables,
default to 
num.trees.cate 
Number of trees to build for categorical variables,
default to 
... 
Other arguments to pass down. 
RfEmpImp
imputation sampler, the mice.impute.rfemp
calls
mice.impute.rfpred.emp
if the variable is.numeric
is
TRUE
, otherwise it calls mice.impute.rfpred.cate
.
Vector with imputed data, same type as y
, and of length
sum(wy)
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Zhang, Haozhe, et al. "Random Forest Prediction Intervals." The American Statistician (2019): 120.
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 7481.
# Prepare data: convert categorical variables to factors
nhanes.fix < conv.factor(nhanes, c("age", "hyp"))
# This function is exported to be visible to the mice sampler functions, and
# users can set method = "rfemp" in call to mice to use this function.
# Users are recommended to use the imp.rfemp function instead:
impObj < mice(nhanes.fix, method = "rfemp", m = 5,
maxit = 5, maxcor = 1.0, eps = 0,
remove.collinear = FALSE, remove.constant = FALSE,
printFlag = FALSE
)
Please note that functions with names starting with "mice.impute" are exported to be visible for the mice sampler functions. Please do not call these functions directly unless you know exactly what you are doing.
RfNode
imputation methods, adapter for mice
samplers.
These functions can be called by the mice
sampler functions.
mice.impute.rfnode.cond
is for imputation using the conditional formed
by the predicting nodes of random forests. To use this function, set
method = "rfnode.cond"
in mice
function.
mice.impute.rfnode.prox
is for imputation based on proximity measures
from random forests, and provides functionality similar to
mice.impute.rf
. To use this function, set
method = "rfnode.prox"
in mice
function.
mice.impute.rfnode
is the main function for performing imputation, and
both mice.impute.rfnode.cond
and mice.impute.rfnode.prox
call
this function. By default, mice.impute.rfnode
works like
mice.impute.rfnode.cond
.
mice.impute.rfnode(
y,
ry,
x,
wy = NULL,
num.trees.node = 10,
pre.boot = TRUE,
use.node.cond.dist = TRUE,
obs.eq.prob = FALSE,
do.sample = TRUE,
num.threads = NULL,
...
)
mice.impute.rfnode.cond(
y,
ry,
x,
wy = NULL,
num.trees = 10,
pre.boot = TRUE,
obs.eq.prob = FALSE,
...
)
mice.impute.rfnode.prox(
y,
ry,
x,
wy = NULL,
num.trees = 10,
pre.boot = TRUE,
obs.eq.prob = FALSE,
...
)
y 
Vector to be imputed. 
ry 
Logical vector of length 
x 
Numeric design matrix with 
wy 
Logical vector of length 
num.trees.node 
Number of trees to build, default to 
pre.boot 
Perform bootstrap prior to imputation to get 'proper' imputation, i.e. accommodating sampling variation in estimating population regression parameters (see Shah et al. 2014). 
use.node.cond.dist 
If 
obs.eq.prob 
If 
do.sample 
If 
num.threads 
Number of threads for parallel computing. The default is

... 
Other arguments to pass down. 
num.trees 
Number of trees to build, default to 
Advanced users can get more flexibility from mice.impute.rfnode
function, as it provides more options than mice.impute.rfnode.cond
or
mice.impute.rfnode.prox
.
Vector with imputed data, same type as y
, and of length
sum(wy)
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. "Recursive partitioning for missing data imputation in the presence of interaction effects." Computational Statistics & Data Analysis 72 (2014): 92104.
# Prepare data: convert categorical variables to factors
nhanes.fix < conv.factor(nhanes, c("age", "hyp"))
# Using "rfnode.cond" or "rfnode"
impRfNodeCond < mice(nhanes.fix, method = "rfnode.cond", m = 5,
maxit = 5, maxcor = 1.0, eps = 0, printFlag = FALSE)
# Using "rfnode.prox"
impRfNodeProx < mice(nhanes.fix, method = "rfnode.prox", m = 5,
maxit = 5, maxcor = 1.0, eps = 0,
remove.collinear = FALSE, remove.constant = FALSE,
printFlag = FALSE)
Please note that functions with names starting with "mice.impute" are exported to be visible for the mice sampler functions. Please do not call these functions directly unless you know exactly what you are doing.
For categorical variables only.
Part of project RfEmpImp
, the function mice.impute.rfpred.cate
is for categorical variables, performing imputation based on predicted
probabilities for the categories.
mice.impute.rfpred.cate(
y,
ry,
x,
wy = NULL,
num.trees.cate = 10,
use.pred.prob.cate = TRUE,
forest.vote.cate = FALSE,
pre.boot = TRUE,
num.threads = NULL,
...
)
y 
Vector to be imputed. 
ry 
Logical vector of length 
x 
Numeric design matrix with 
wy 
Logical vector of length 
num.trees.cate 
Number of trees to build for categorical variables,
default to 
use.pred.prob.cate 
Logical, 
forest.vote.cate 
Logical, 
pre.boot 
Perform bootstrap prior to imputation to get 'proper'
multiple imputation, i.e. accommodating sampling variation in estimating
population regression parameters (see Shah et al. 2014).
It should be noted that if 
num.threads 
Number of threads for parallel computing. The default is

... 
Other arguments to pass down. 
RfEmpImp
Imputation sampler for: categorical variables based on
predicted probabilities.
Vector with imputed data, same type as y
, and of length
sum(wy)
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 7481.
# Prepare data
mtcars.catmcar < mtcars
mtcars.catmcar[, c("gear", "carb")] <
gen.mcar(mtcars.catmcar[, c("gear", "carb")], warn.empty.row = FALSE)
mtcars.catmcar < conv.factor(mtcars.catmcar, c("gear", "carb"))
# Perform imputation
impObj < mice(mtcars.catmcar, method = "rfpred.cate", m = 5, maxit = 5,
maxcor = 1.0, eps = 0,
remove.collinear = FALSE, remove.constant = FALSE,
printFlag = FALSE)
Please note that functions with names starting with "mice.impute" are exported to be visible for the mice sampler functions. Please do not call these functions directly unless you know exactly what you are doing.
For continuous variables only.
This function is for RfPred.Emp
multiple imputation method, adapter
for mice
samplers. In the mice()
function, set
method = "rfpred.emp"
to call it.
The function performs multiple imputation based on the empirical distribution of outofbag prediction errors of random forests.
mice.impute.rfpred.emp(
y,
ry,
x,
wy = NULL,
num.trees.cont = 10,
sym.dist = TRUE,
alpha.emp = 0,
pre.boot = TRUE,
num.threads = NULL,
...
)
y 
Vector to be imputed. 
ry 
Logical vector of length 
x 
Numeric design matrix with 
wy 
Logical vector of length 
num.trees.cont 
Number of trees to build for continuous variables.
The default is 
sym.dist 
If 
alpha.emp 
The "significance level" for the empirical distribution of
outofbag prediction errors, can be used for prevention for outliers
(useful for highly skewed variables).
For example, set alpha = 0.05 to use 95% confidence level.
The default is 
pre.boot 
If 
num.threads 
Number of threads for parallel computing. The default is

... 
Other arguments to pass down. 
num.trees 
Number of trees to build. The default is

RfPred.Emp
imputation sampler.
Vector with imputed data, same type as y
, and of length
sum(wy)
.
Shangzhi Hong
Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
Zhang, Haozhe, et al. "Random Forest Prediction Intervals." The American Statistician (2019): 120.
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 7481.
# Users can set method = "rfpred.emp" in call to mice to use this method
data("airquality")
impObj < mice(airquality, method = "rfpred.emp", m = 5,
maxit = 5, maxcor = 1.0, eps = 0,
remove.collinear = FALSE, remove.constant = FALSE,
printFlag = FALSE)
Please note that functions with names starting with "mice.impute" are exported to be visible for the mice sampler functions. Please do not call these functions directly unless you know exactly what you are doing.
For continuous variables only.
This function is for RfPred.Norm
multiple imputation method, adapter for mice
samplers.
In the mice()
function, set method = "rfpred.norm"
to call it.
The function performs multiple imputation based on normality assumption using outofbag mean squared error as the estimate for the variance.
mice.impute.rfpred.norm(
y,
ry,
x,
wy = NULL,
num.trees.cont = 10,
norm.err.cont = TRUE,
alpha.oob = 0,
pre.boot = TRUE,
num.threads = NULL,
...
)
y 
Vector to be imputed. 
ry 
Logical vector of length 
x 
Numeric design matrix with 
wy 
Logical vector of length 
num.trees.cont 
Number of trees to build for continuous variables.
The default is 
norm.err.cont 
Use normality assumption for prediction errors of random
forests. The default is 
alpha.oob 
The "significance level" for individual outofbag
prediction errors used for the calculation for outofbag mean squared error,
useful when presence of extreme values.
For example, set alpha = 0.05 to use 95% confidence level.
The default is 
pre.boot 
If 
num.threads 
Number of threads for parallel computing. The default is

... 
Other arguments to pass down. 
RfPred.Norm
imputation sampler.
Vector with imputed data, same type as y
, and of length
sum(wy)
.
Shangzhi Hong
Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764774.
# Users can set method = "rfpred.norm" in call to mice to use this method
data("airquality")
impObj < mice(airquality, method = "rfpred.norm", m = 5,
maxit = 5, maxcor = 1.0, eps = 0,
remove.collinear = FALSE, remove.constant = FALSE,
printFlag = FALSE)
ranger
The observation indexes (row numbers) constituting the terminal node
associated with each observation are queried using the ranger
object
and the training data.
The parameter keep.inbag = TRUE
should be applied to call to
ranger
.
query.rf.pred.idx(obj, data, id.name = FALSE, unique.by.id = FALSE, ...)
obj 
An R object of class 
data 
Input for training data. 
id.name 
Use the IDs of the terminal nodes as names for the lists. 
unique.by.id 
Only return results of unique terminal node IDs. 
... 
Other parameters (will be ignored). 
The observations are found based on terminal node IDs. It should be noted that the outofbag observations are not present in the indexes.
A nested list of length num.trees
.
Shangzhi Hong
data(iris)
rfObj < ranger(
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data = iris, num.trees = 5, keep.inbag = TRUE)
outList < query.rf.pred.idx(rfObj, iris)
ranger
The observed values (for the response variable) constituting the terminal
node associated with each observation are queried using the ranger
object and the training data.
The parameter keep.inbag = TRUE
should be applied to call to
ranger
.
query.rf.pred.val(obj, data, id.name = FALSE, unique.by.id = FALSE, ...)
obj 
An R object of class 
data 
Input for training data. 
id.name 
Use the IDs of the terminal nodes as names for the lists. 
unique.by.id 
Only return results of unique terminal node IDs. 
... 
Other parameters (will be ignored). 
The observations are found based on terminal node IDs. It should be noted that the outofbag observations are not present in the indexes.
A nested list of length num.trees
.
Shangzhi Hong
data(iris)
rfObj < ranger(
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data = iris, num.trees = 5, keep.inbag = TRUE)
outList < query.rf.pred.val(rfObj, iris)
ranger
functionThis function serves as an workaround for ranger function.
rangerCallerSafe(...)
... 
Parameters to pass down. 
Constructed ranger
object.
Get the estimates with corresponding confidence intervals after pooling.
reg.ests(obj, ...)
obj 
Pooled object from function 
... 
Other parameters to pass down. 
A data frame containing coefficient estimates and corresponding confidence intervals.