--- title: "Projection pursuit classification random forest" author: "N. da Silva, D. Cook & E.K Lee " date: "`r Sys.Date()`" output: rmarkdown::html_vignette fig_caption: yes bibliography: biblio.bib nocite: | @devtools vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Projection pursuit classification random forest} \usepackage[utf8]{inputenc} --- ```{r libraries, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE} require(PPforest) require(dplyr) require(RColorBrewer) require(GGally) require(gridExtra) require(PPtreeViz) library(ggplot2) library(knitr) set.seed(310756) #reproducibility ``` ```{r hooks, echo = FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE, autodep=TRUE, cache.lazy=FALSE ) opts_knit$set(eval.after = 'fig.cap') theme_set(theme_bw(base_family="serif")) ``` ## Introduction The `PPforest` package (projection pursuit random forest) contains functions to run a projection pursuit random forest for classification problems. This method utilize combinations of variables in each tree construction. In a random forest each split is based on a single variable, chosen from a subset of predictors. In the `PPforest`, each split is based on a linear combination of randomly chosen variables. The linear combination is computed by optimizing a projection pursuit index, to get a projection of the variables that best separates the classes. The `PPforest` uses the `PPtree` algorithm, which fits a single tree to the data. Utilizing linear combinations of variables to separate classes takes the correlation between variables into account, and can outperform the basic forest when separations between groups occurs on combinations of variables. Two projection pursuit indexes, LDA and PDA, are used for `PPforest`. To improve the speed performance `PPforest` package, `PPtree` algorithm was translated to Rcpp. `PPforest` package utilizes a number of R packages some of them included in "suggests" not to load them all at package start-up. You can install the package from CRAN: ```r install.package(PPforest) library(PPforest) ``` Or the development version of `PPforest` can be installed from github using: ```r library(devtools) install_github("natydasilva/PPforest") library(PPforest) ``` ##Projection pursuit classification forest In `PPforest`, projection pursuit classification trees are used as the individual model to be combined in the forest. The original algorithm is in `PPtreeViz` package, we translate the original tree algorithm into `Rcpp` to improve the speed performance to run the forest. One important characteristic of PPtree is that treats the data always as a two-class system, when the classes are more than two the algorithm uses a two step projection pursuits optimization in every node split. Let $(X_i,y_i)$ the data set, $X_i$ is a p-dimensional vector of explanatory variables and $y_i\in {1,2,\ldots G}$ represents class information with $i=1,\ldots n$. In the first step optimize a projection pursuit index to find an optimal one-dimension projection $\alpha^*$ for separating all classes in the current data. With the projected data redefine the problem in a two class problem by comparing means, and assign a new label $G1$ or $G2$ to each observation, a new variable $y_i^*$ is created. The new groups $G1$ and $G2$ can contain more than one original classes. Next step is to find an optimal one-dimensional projection $\alpha$, using $(X_i,y_i^*)$ to separate the two class problem $G1$ and $G2$. The best separation of $G1$ and $G2$ is determine in this step and the decision rule is defined for the current node, if $\sum_{i=1}^p \alpha_i M1< c$ then assign $G1$ to the left node else assign $G2$ to the right node, where $M1$ is the mean of $G1$. For each groups we can repeat all the previous steps until $G1$ and $G2$ have only one class from the original classes. Base on this process to grow the tree, the depth of PPtree is at most the number of classes because one class is assigned only to one final node. Trees from `PPtree` algorithm are simple, they use the association between variables to find separation. If a linear boundary exists, `PPtree` produces a tree without misclassification. Projection pursuit random forest algorithm description 1. Let N the number of cases in the training set $\Theta=(X,Y)$, $B$ bootstrap samples from the training set are taking (samples of size N with replacement). 2. For each bootstrap sample a \verb PPtree is grown to the largest extent possible $h(x, {\Theta_k})$. No pruning. This tree is grown using step 3 modification. 3. Let M the number of input variables, a number of $m<% dplyr::mutate_at(dplyr::vars(-matches(ppf$class.var)), dplyr::funs(myscale)) scale.dat.melt <- scale.dat %>% dplyr::mutate(ids = 1:nrow(ppf$train)) %>% tidyr::gather(var,Value,-Type,-ids) scale.dat.melt$Variables <- as.numeric(as.factor(scale.dat.melt$var)) colnames(scale.dat.melt)[1] <- "Class" ggplot2::ggplot(scale.dat.melt, ggplot2::aes(x = Variables, y = Value, group = ids, key = ids, colour = Class, var = var)) + ggplot2::geom_line(alpha = 0.3) + ggplot2::scale_x_discrete(limits = levels(as.factor(scale.dat.melt$var)), expand = c(0.01,0.01)) + ggplot2::ggtitle("Data parallel plot ") + ggplot2::theme(legend.position = "none", axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") } capar <-"Parallel coordinate plot of crab data" parallel(pprf.crab) ```     `ternary_str` is an auxiliary functions in `PPforest` to get the data structure needed to do a ternary plot or a generalized ternary plot if more than 3 classes are available. Because the PPforest is composed of many tree fits on subsets of the data, a lot of statistics can be calculated to analyze as a separate data set, and better understand how the model is working. Some of the diagnostics of interest are: variable importance, OOB error rate, vote matrix and proximity matrix. With a decision tree we can compute for every pair of observations the proximity matrix. This is a $nxn$ matrix where if two cases $k_i$ and $k_j$ are in the same terminal node increase their proximity by one, at the end normalize the proximities by dividing by the number of trees. To visualize the proximity matrix we use a scatter plot with information from multidimensional scaling method. In this plot color indicates the true species and sex. For this data two dimensions are enough to see the four groups separated quite well. Some crabs are clearly more similar to a different group, though, especially in examining the sex differences.     ```{r mds, fig.align="center",fig.cap= capmds, fig.show='hold',fig.width = 5 ,fig.height = 4, warning=FALSE, echo=FALSE} mdspl2d <- function(ppf, lege = "bottom", siz = 3, k = 2) { d <- diag(nrow(ppf$train)) d <- as.dist(d + 1 - ppf$proximity) rf.mds <- stats::cmdscale(d, eig = TRUE, k = k) colnames(rf.mds$points) <- paste("MDS", 1:k, sep = "") df <- data.frame(Class = ppf$train[, 1], rf.mds$points) mds <- ggplot2::ggplot(data = df) + ggplot2::geom_point(ggplot2::aes(x = MDS1, y = MDS2, color = Class), size = I(siz), alpha = .5) + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2", name = "Type") + ggplot2::theme(legend.position = lege, aspect.ratio = 1) mds } capmds<- "Multidimensional scaling plot to examine similarities between cases" mdspl2d(ppf = pprf.crab) ```     The vote matrix ($n \times p$) contains the proportion of times each observation was classified to each class, whole oob. Two possible approaches to visualize the vote matrix information are shown, with a side-by-side jittered dot plot or with ternary plots. A side-by-side jittered dotplot is used for the display, where class is displayed on one axis and proportion is displayed on the other. For each dotplot, the ideal arrangement is that points of observations in that class have values bigger than 0.5, and all other observations have less. This data is close to the ideal but not perfect, e.g. there are a few blue male crabs (orange) that are frequently predicted to be blue females (green), and a few blue female crabs predicted to be another class.     ```{r side, fig.align="center", fig.cap= capside, fig.show='hold',fig.width = 5 ,fig.height = 5, warning = FALSE, echo=FALSE} side <- function(ppf, ang = 0, lege = "bottom", siz = 3, ttl = "") { voteinf <- data.frame(ids = 1:length(ppf$train[, 1]), Type = ppf$train[, 1], ppf$votes, pred = ppf$prediction.oob ) %>% tidyr::gather(Class, Probability, -pred, -ids, -Type) ggplot2::ggplot(data = voteinf, ggplot2::aes(Class, Probability, color = Type)) + ggplot2::geom_jitter(height = 0, size = I(siz), alpha = .5) + ggtitle(ttl) + ylab("Proportion") + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") + ggplot2::theme(legend.position = lege, legend.text = ggplot2::element_text(angle = ang)) + ggplot2::labs(colour = "Class") } capside <-"Vote matrix representation by a jittered side-by-side dotplot. Each dotplot shows the proportion of times the case was predicted into the group, with 1 indicating that the case was always predicted to the group and 0 being never." side(pprf.crab) ```     A ternary plot is a triangular diagram that shows the proportion of three variables that sum to a constant and is done using barycentric coordinates. Compositional data lies in a $(p-1)$-D simplex in $p$-space. One advantage of ternary plot is that are good to visualize compositional data and the proportion of three variables in a two dimensional space can be shown. When we have tree classes a ternary plot are well defined. With more than tree classes the ternary plot idea need to be generalized.@sutherland2000orca suggest the best approach to visualize compositional data will be to project the data into the $(p-1)-$D space (ternary diagram in $2-D$) This will be the approach used to visualize the vote matrix information. A ternary plot is a triangular diagram used to display compositional data with three components. More generally, compositional data can have any number of components, say $p$, and hence is contrained to a $(p-1)$-D simplex in $p$-space. The vote matrix is an example of compositional data, with $G$ components.     ```{r ternary, fig.align = "center",fig.cap = capter, fig.show = 'hold',fig.width = 7 ,fig.height = 4, warning = FALSE, echo=FALSE} pl_ter <- function(dat, dx, dy ){ p1 <- dat[[1]] %>% dplyr::filter(pair %in% paste(dx, dy, sep = "-") ) %>% dplyr::select(Class, x, y) %>% ggplot2::ggplot(aes(x, y, color = Class)) + ggplot2::geom_segment(data = dat[[2]], aes(x = x1, xend = x2, y = y1, yend = y2), color = "black" ) + ggplot2::geom_point(size = I(3), alpha = .5) + ggplot2::labs(y = " ", x = " ") + ggplot2::theme(legend.position = "none", aspect.ratio = 1) + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") + ggplot2::labs(x = paste0("T", dx, ""), y = paste0("T", dy, " ")) + ggplot2::theme(aspect.ratio = 1) p1 } p1 <- pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 1, dy = 2), 1, 2 ) p2 <- pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 1, dy = 3), 1, 3) p3 <- pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 2, dy = 3), 2, 3) gridExtra::grid.arrange(p1, p2, p3, ncol = 3) capter <- "Generalized ternary plot representation of the vote matrix for four classes. The tetrahedron is shown pairwise. Each point corresponds to one observation and color is the true class." ```     To see a complete description about how to visualize a PPforest object read Interactive Graphics for Visually Diagnosing Forest Classifiers in R [@da2017interactive]. ## REFERENCES