--- title: "Introduction to bartcs" author: "Yeonghoon Yoo" output: rmarkdown::html_vignette bibliography: REFERENCES.bib link-citations: true vignette: > %\VignetteIndexEntry{Introduction to bartcs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} set.seed(42) library(bartcs) ``` The __bartcs__ package finds confounders and treatment effect with Bayesian Additive Regression Trees (BART). ## Tutorial with IHDP dataset This tutorial will use The Infant Health and Development Program (IHDP) dataset. IHDP is a randomized experiment from 1985 to 1988 which studied the effect of home visits on cognitive test scores for infants. Our version is a synthetic dataset generated by @louizos:2017 which provides true values to compare with. The dataset consists of 6 continuous and 19 binary covariates with simulated outcome which is a cognitive test score. ```{r load and fit} data(ihdp, package = "bartcs") fit <- single_bart( Y = ihdp$y_factual, trt = ihdp$treatment, X = ihdp[, 6:30], num_tree = 10, num_chain = 4, num_post_sample = 100, num_burn_in = 100, verbose = FALSE ) fit ``` You can get mean and 95% credible interval of average treatment effect (ATE) and possible outcome Y1 and Y0. ```{r compare result} ATE <- mean(ihdp$mu1 - ihdp$mu0) ATE mu1 <- mean(ihdp$mu1) mu1 mu0 <- mean(ihdp$mu0) mu0 mse <- mean((unlist(fit$mcmc_list[, "ATE"]) - ATE)^2) mse ``` True values of ATE, mu1 and mu0 all lies in 95% credible interval. Also, MSE between predicted and true values is 0.013. ## Result Both `separate_bart()` and `single_bart()` fits multiple MCMC chains. `summary()` provides result for each and aggregated chain. ```{r summary} summary(fit) ``` ## Data Visualization You can get posterior inclusion probability for each variables. ```{r pip plots} plot(fit, method = "pip") ``` Since `inclusion_plot()` is a wrapper function of `ggcharts::bar_chart()`, you can use its arguments for better plot. ```{r pip plots with options} plot(fit, method = "pip", top_n = 10) plot(fit, method = "pip", threshold = 0.5) ``` With `trace_plot()`, you can visually check trace of effects or other parameters. ```{r trace plots eff} plot(fit, method = "trace") ``` ## Connection to coda Package You can also use functions from `coda` package for components of `bartcs` object. Components with `mcmc_` prefix are `mcmc.list` object from `coda` package. ```{r coda} coda::gelman.diag(fit$mcmc_list[, "ATE"]) ``` ## Multi-threading ```{r count omp thread} count_omp_thread() ``` Check whether OpenMP is supported. You need more than 1 thread for multi-threading. Due to overhead of multi-threading, using parallelization will **NOT** be effective with small and moderate size datasets. For comparison purpose, I will create dataset with 40,000 rows by bootstrapping from IHDP dataset. Then, for fast computation, I will fit the model with most parameters set to 1. ```{r multi-threading performance} idx <- sample(nrow(ihdp), 4e4, TRUE) ihdp <- ihdp[idx, ] microbenchmark::microbenchmark( simple = single_bart( Y = ihdp$y_factual, trt = ihdp$treatment, X = ihdp[, 6:30], num_tree = 1, num_chain = 1, num_post_sample = 10, num_burn_in = 0, verbose = FALSE, parallel = FALSE ), parallel = single_bart( Y = ihdp$y_factual, trt = ihdp$treatment, X = ihdp[, 6:30], num_tree = 1, num_chain = 1, num_post_sample = 10, num_burn_in = 0, verbose = FALSE, parallel = TRUE ), times = 50 ) ``` Result show that parallelization reduces computation time. ## References