---
title: "Introduction to bartcs"
author: "Yeonghoon Yoo"
output: rmarkdown::html_vignette
bibliography: REFERENCES.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{Introduction to bartcs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
set.seed(42)
library(bartcs)
```

The __bartcs__ package finds confounders and treatment effect with Bayesian Additive Regression Trees (BART).

## Tutorial with IHDP dataset

This tutorial will use The Infant Health and Development Program (IHDP) dataset.
IHDP is a randomized experiment from 1985 to 1988 which studied the effect of home visits on cognitive test scores for infants.
Our version is a synthetic dataset generated by @louizos:2017 which provides true values to compare with.
The dataset consists of 6 continuous and 19 binary covariates with simulated outcome which is a cognitive test score.

```{r load and fit}
data(ihdp, package = "bartcs")

fit <- single_bart(
  Y               = ihdp$y_factual,
  trt             = ihdp$treatment,
  X               = ihdp[, 6:30],
  num_tree        = 10,
  num_chain       = 4,
  num_post_sample = 100,
  num_burn_in     = 100,
  verbose         = FALSE
)
fit
```

You can get mean and 95% credible interval of average treatment effect (ATE) and possible outcome Y1 and Y0.

```{r compare result}
ATE <- mean(ihdp$mu1 - ihdp$mu0)
ATE
mu1 <- mean(ihdp$mu1)
mu1
mu0 <- mean(ihdp$mu0)
mu0
mse <- mean((unlist(fit$mcmc_list[, "ATE"]) - ATE)^2)
mse
```

True values of ATE, mu1 and mu0 all lies in 95% credible interval. Also, MSE between predicted and true values is 0.013.

## Result

Both `separate_bart()` and `single_bart()` fits multiple MCMC chains. 
`summary()` provides result for each and aggregated chain. 

```{r summary}
summary(fit)
```

## Data Visualization

You can get posterior inclusion probability for each variables.

```{r pip plots}
plot(fit, method = "pip")
```

Since `inclusion_plot()` is a wrapper function of `ggcharts::bar_chart()`,  you can use its arguments for better plot. 

```{r pip plots with options}
plot(fit, method = "pip", top_n = 10)
plot(fit, method = "pip", threshold = 0.5)
```

With `trace_plot()`, you can visually check trace of effects or other parameters.

```{r trace plots eff}
plot(fit, method = "trace")
```

## Connection to coda Package

You can also use functions from `coda` package for components of `bartcs` object.
Components with `mcmc_` prefix are `mcmc.list` object from `coda` package.

```{r coda}
coda::gelman.diag(fit$mcmc_list[, "ATE"])
```

## Multi-threading

```{r count omp thread}
count_omp_thread()
```

Check whether OpenMP is supported. You need more than 1 thread for multi-threading. 
Due to overhead of multi-threading, using parallelization will **NOT** be effective with small and moderate size datasets.

For comparison purpose, I will create dataset with 40,000 rows by bootstrapping from IHDP dataset.
Then, for fast computation, I will fit the model with most parameters set to 1.

```{r multi-threading performance}
idx  <- sample(nrow(ihdp), 4e4, TRUE)
ihdp <- ihdp[idx, ]

microbenchmark::microbenchmark(
  simple = single_bart(
    Y               = ihdp$y_factual,
    trt             = ihdp$treatment,
    X               = ihdp[, 6:30],
    num_tree        = 1,
    num_chain       = 1,
    num_post_sample = 10,
    num_burn_in     = 0,
    verbose         = FALSE,
    parallel        = FALSE
  ),
  parallel = single_bart(
    Y               = ihdp$y_factual,
    trt             = ihdp$treatment,
    X               = ihdp[, 6:30],
    num_tree        = 1,
    num_chain       = 1,
    num_post_sample = 10,
    num_burn_in     = 0,
    verbose         = FALSE,
    parallel        = TRUE
  ),
  times = 50
)
```

Result show that parallelization reduces computation time.

## References