--- title: "Introduction to RfEmpImp" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to RfEmpImp} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` An R package for random-forest-empowered imputation of missing Data ```{r setup, include = FALSE} suppressMessages(library(RfEmpImp)) ``` ## Random-forest-based multiple imputation evolved `RfEmpImp` is an R package for multiple imputation using chained random forests (RF). This R package provides prediction-based and node-based multiple imputation algorithms using random forests, and currently operates under the multiple imputation computation framework [`mice`](https://CRAN.R-project.org/package=mice). For more details of the implemented imputation algorithms, please refer to: [arXiv:2004.14823](https://arxiv.org/abs/2004.14823) (further updates soon). ## Installation Users can install the CRAN version of `RfEmpImp` from CRAN, or the latest development version of `RfEmpImp` from GitHub: ```r # Install from CRAN install.packages("RfEmpImp") # Install from GitHub online if(!"remotes" %in% installed.packages()) install.packages("remotes") remotes::install_github("shangzhi-hong/RfEmpImp") # Install from released source package install.packages(path_to_source_file, repos = NULL, type = "source") # Attach library(RfEmpImp) ``` ## Prediction-based imputation ### For mixed types of variables For data with mixed types of variables, users can call function `imp.rfemp()` to use `RfEmp` method, for using `RfPred.Emp` method for continuous variables, and using `RfPred.Cate` method for categorical variables (of type `logical` or `factor`, etc.). Starting with version `2.0.0`, the names of parameters were further simplified, please refer to the documentation for details. ### Prediction-based imputation for continuous variables For continuous variables, in `RfPred.Emp` method, the empirical distribution of random forest's out-of-bag prediction errors is used when constructing the conditional distributions of the variable under imputation, providing conditional distributions with better quality. Users can set `method = "rfpred.emp"` in function call to `mice` to use it. Also, in `RfPred.Norm` method, normality was assumed for RF prediction errors, as proposed by Shah *et al.*, and users can set `method = "rfpred.norm"` in function call to `mice` to use it. ### Prediction-based imputation for categorical variables For categorical variables, in `RfPred.Cate` method, the probability machine theory is used, and the predictions of missing categories are based on the predicted probabilities for each missing observation. Users can set `method = "rfpred.cate"` in function call to `mice` to use it. ### Example for prediction-based imputation ```r # Prepare data df <- conv.factor(nhanes, c("age", "hyp")) # Do imputation imp <- imp.rfemp(df) # Do analyses regObj <- with(imp, lm(chl ~ bmi + hyp)) # Pool analyzed results poolObj <- pool(regObj) # Extract estimates res <- reg.ests(poolObj) ``` ## Node-based imputation For continuous or categorical variables, the observations under the predicting nodes of random forest are used as candidates for imputation. Two methods are now available for the `RfNode` algorithm series. It should be noted that categorical variables should be of types of `logical` or `factor`, etc. ### Node-based imputation using predicting nodes Users can call function `imp.rfnode.cond()` to use `RfNode.Cond` method, performing imputation using the conditional distribution formed by the prediction nodes. The weight changes of observations caused by the bootstrapping of random forest are considered, and only the "in-bag" observations are used as candidates for imputation. Also, users can set `method = "rfnode.cond"` in function call to `mice` to use it. ### Node-based imputation using proximities Users can call function `imp.rfnode.prox()` to use `RfNode.Prox` method, performing imputation using the proximity matrices of random forests. All the observations fall under the same predicting nodes are used as candidates for imputation, including the out-of-bag ones. Also, users can set `method = "rfnode.prox"` in function call to `mice` to use it. ### Example for node-based imputation ```r # Prepare data df <- conv.factor(nhanes, c("age", "hyp")) # Do imputation imp <- imp.rfnode.cond(df) # Or: imp <- imp.rfnode.prox(df) # Do analyses regObj <- with(imp, lm(chl ~ bmi + hyp)) # Pool analyzed results poolObj <- pool(regObj) # Extract estimates res <- reg.ests(poolObj) ``` ## Imputation functions | Type | Impute function | Univariate sampler | Variable type | |-----------------------------|-----------------|---------------------------|---------------| | Prediction-based imputation | imp.emp() | mice.impute.rfemp() | Mixed | | | / | mice.impute.rfpred.emp() | Continuous | | | / | mice.impute.rfpred.norm() | Continuous | | | / | mice.impute.rfpred.cate() | Categorical | | Node-based imputation | imp.node.cond() | mice.impute.rfnode.cond() | Mixed | | | imp.node.prox() | mice.impute.rfnode.prox() | Mixed | | | / | mice.impute.rfnode() | Mixed | ## Package structure The figure below shows how the imputation functions are organized in this R package. ![](../man/figures/package-structure.png){#id .class width=95% height=95%} ## Support for parallel computation As random forest can be compute-intensive itself, and during multiple imputation process, random forest models will be built for the variables containing missing data for a certain number of iterations (usually 5 to 10 times) repeatedly (usually 5 to 20 times, for the number of imputations performed). Thus, computational efficiency is of crucial importance for multiple imputation using chained random forests, especially for large data sets. So in `RfEmpImp`, the random forest model building process is accelerated using parallel computation powered by [`ranger`](https://CRAN.R-project.org/package=ranger). The ranger R package provides support for parallel computation using native C++. In our simulations, parallel computation can provide impressive performance boost for imputation process (about 4x faster on a quad-core laptop). ## References 1. Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823. 2. Zhang, Haozhe, et al. "Random forest prediction intervals." The American Statistician (2019): 1-15. 3. Wright, Marvin N., and Andreas Ziegler. "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R." Journal of Statistical Software 77.i01 (2017). 4. Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American Journal of Epidemiology 179.6 (2014): 764-774. 5. Doove, Lisa L., Stef Van Buuren, and Elise Dusseldorp. "Recursive partitioning for missing data imputation in the presence of interaction effects." Computational Statistics & Data Analysis 72 (2014): 92-104. 6. Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 74-81. 7. Van Buuren, Stef, and Karin Groothuis-Oudshoorn. "mice: Multivariate Imputation by Chained Equations in R." Journal of Statistical Software 45.i03 (2011).