---
title: "binomialRF Feature Selection Vignette"
author: "Samir Rachid Zaim"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{"binomialRF Feature Selection Vignette"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
  library('randomForest')
  library('data.table')
  library('stats')
  library('binomialRF')

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


The $\textit{binomialRF}$ is a $\textit{randomForest}$ feature selection wrapper (Zaim 2019) that treats the random forest as a binomial process where each tree represents an iid bernoulli random variable for the event of selecting $X_j$ as the main splitting variable at a given tree. The algorithm below describes the technical aspects of the algorithm.

\begin{algorithm} 
\caption{binomialRF Feature Selection Algorithm } 
\label{alg1} 
\begin{algorithmic}
  \For{i=1:N}
    \State Grow $T_{i}$
    \State $S_{ij} =   \left\{
\begin{array}{ll}
      1 & X_j \text{ is splitting variable at root node by } T_i \\
      0 & \text{otherwise} \\
 \end{array} 
\right. $
    \State $S \gets S_{ij}$
  \EndFor
  \State $S_j = \sum_{i=1}^N S_{ij}= \sum_{i=1}^N I[X_j \in root(T_i)]$
  \State $S_j \sim \text{binomial}(N,p_0), \hspace{6mm} \text{where } p_0=\bigg(1 - \prod_{i=1}^m\frac{L-i}{L-(i-1)} \bigg)\frac{1}{m}$.
  \State Test for Significance
  \State Adjust for Multiple Comparisons

\end{algorithmic}
\end{algorithm}

![binomialRF Algorithm.](algorithm.png)

# Simulating Data
Since $\textit{binomialRF}$ is a wrapper algorithm that internally calls and grows a randomForest object based on the inputted parameters. First we generate a simple simulated logistic data as follows: 

* $X_{10}\sim MNV(0, I_{10})$, 


* $p(x) = \frac{1}{1+e^{-X\beta}}$, and

* $y \sim Binom(10,p)$.

where $\beta$ is a vector of coefficients where the first 2 coefficients are set to 3, and the rest are 0. 

$$\beta = \begin{bmatrix} 3 & 3 & 0  & \cdots & 0  \end{bmatrix}^T$$

## Simulated Data

```{r echo=T, warning=F, message=F}
set.seed(324)

### Generate multivariate normal data in R10
X = matrix(rnorm(1000), ncol=10)

### let half of the coefficients be 0, the other be 10
trueBeta= c(rep(3,2), rep(0,8))

### do logistic transform and generate the labels
z = 1 + X %*% trueBeta    
pr = 1/(1+exp(-z))        
y = rbinom(100,1,pr)

```
To generate data looking like this: 
```{r, echo=FALSE, results='asis'}
knitr::kable(head(cbind(round(X,2),y), 10))
```

## Generating the Stable Correlated Binomial Distribution

Since the binomialRF requires a correlation adjustment to adjust for the tree-to-tree sampling correlation, we first generate the appropriately-parameterized stable correlated binomial distribution. Note, the correlbinom function call can take a while to execute for large number trials (i.e., trials > 1000). 

```{r echo=T, warning=F, message=F}
require(correlbinom)
rho = 0.33
ntrees = 250

cbinom = correlbinom(rho, successprob =  1/ncol(X), trials = ntrees, 
                                  precision = 1024, model = 'kuk')

```

## binomialRF Function Call

Then we can run the binomialRF function call as below:

```{r echo=T, warning=F, message=F}
binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
                     ntrees = ntrees,percent_features = .6,
                     fdr.method = 'BY', user_cbinom_dist = cbinom, 
                     sampsize = round(nrow(X)*.33))
print(binom.rf)

```

# Tuning Parameters

## Percent_features

Note that since the binomial exact test is contingent on a test statistic measuring the likelihood of selecting a feature, if there is a dominant feature, then it will render all remaining 'important' features useless as it will always be selected as the splitting variable. So it is important to set the $percent_features$ parameter < 1. The results below show how setting the parameter to a fraction between .6 to 1 can allow other features to stand out as important. 

```{r echo=F, warning=F, message=F}
# set.seed(324)

binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
                     ntrees = ntrees,percent_features = 1,
                     fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33))

cat('\n\nbinomialRF 100%\n\n')
print(binom.rf)

binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
                     ntrees = ntrees,percent_features = .8,
                     fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33))

cat('\n\nbinomialRF 80%\n\n')
print(binom.rf)

binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
                     ntrees = ntrees,percent_features = .6,
                     fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33))

cat('\n\nbinomialRF 60%\n\n')
print(binom.rf)
```

## ntrees

We recommend growing at least 500 to 1,000 trees at a minimum so that the algorithm has a chance to stabilize, but also recommend choosing ntrees as a function of the number of features in your dataset. The ntrees tuning parameter must be set in conjunction with the percent_features as these two are inter-connectedm as well as the number of true features in the model. Since the correlbinom function call is slow to execute for ntrees > 1000, we recommend growing random forests with only 500-1000 trees. 


```{r echo=F, warning=F, message=F}
set.seed(324)

binom.rf1000 <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
                     ntrees = ntrees,percent_features = .5,
                     fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33))


rho = 0.33
ntrees = 500

cbinom = correlbinom(rho, successprob =  1/ncol(X), trials = ntrees, precision = 1024, model = 'kuk')

binom.rf500 <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
                     ntrees = ntrees,percent_features = .5,
                     fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33))


cat('\n\nbinomialRF 250 trees\n\n')
print(binom.rf500)

cat('\n\nbinomialRF 500 trees \n\n')
print(binom.rf1000)

```