--- title: "Using AssocBin to map complex data" author: "Chris Salahub" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using AssocBin on complex data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) set.seed(72604204) ``` The `AssocBin` vignette gives some guidance on how to control the core algorithm at the heart of the `AssocBin` package for applications to a single pair, but the functionality of `AssocBin` does not stop there. The package also includes higher level functionality which can apply recursive binning automatically to a complex data set. This vignette outlines howhttp://127.0.0.1:22179/graphics/681bfdce-bb25-484e-afd9-1fe303480217.png to run, interpret, and customize the `pariwiseAssociation` function that operates this high level functionality. ```{r} library(AssocBin) ``` Let's take a look at the heart disease data from the UCI machine learning data repository (https://archive.ics.uci.edu/dataset/45/heart+disease): ```{r} data(heart) summary(heart) ``` This example data has some missing values, with variables like `slope`, `ca`, and `thal` particularly troublesome. Dropping these variables which are mostly missing and considering complete cases only: ```{r} heartClean <- heart heartClean$thal <- NULL heartClean$ca <- NULL heartClean$slope <- NULL heartClean <- na.omit(heartClean) str(heartClean) ``` This leaves 740 observations over 12 variables to be binned. Running the binning process is very straightforward. First, stop criteria are defined. Next, the function `inDep` is called. This function performs pairwise binning on all variable pairs in this data set and returns an `inDep` object, which contains these binnings alongside data about them that allow us to assess the relationships between these pairs. ```{r} stopCrits <- makeCriteria(depth >= 6, n < 1, expn <= 10) assocs <- inDep(heartClean, stopCriteria = stopCrits) ``` The resulting object can be explored using `plot` and `summary` methods. ```{r} summary(assocs) ``` ```{r, fig.width = 5, fig.height = 8} plot(assocs) # by default this displays the 5 strongest relationships ``` ```{r, fig.width = 5, fig.height = 8} ## by specifying which indices should be displayed, others can be plotted ## the binnings are returned in increasing order of p-value, so the indices ## chosen give the rank of the strength a particular pair's relationship plot(assocs, which = 6:10) # like the next 5 strongest plot(assocs, which = 62:66) # or the 5 weakest relationships ``` In the first two columns, points are jittered within categorical variable levels to reduce overplotting. The different categories are separated by dashed lines for visual clarity. Space is added in the first column for additional separation of groups, but this is condensed in the second column to reflect the appearance of the point configuration under ranks with random tie-breaking. The shadings in the final column of each of these plots highlight the cells with individually significant Pearson residuals under the null hypothesis. Asymptotically, the Pearson residuals are normally distributed, and so values above the 0.975 standard normal or below the 0.025 standard normal quantile are shaded. The cells below the lower quantile (with significantly fewer observations than expected) are shaded blue while those above (with significantly more observations than expected) are shaded red. The $p$-values reported are computed assuming the $\chi^2$ statistics for each pair are $\chi^2_{df}$-distributed with $df$ given by the number of cells less the number of constraints in estimation. By default, binning is therefore done at random to ensure this approximation is accurate. When other splitting methods are used these default $p$-values are no longer accurate. Of course, such splitting methods may still be desirable for better plotting qualities and so we can customize the bins by changing the splitting logic. ```{r} maxCatCon <- function(bn) uniMaxScoreSplit(bn, chiScores) maxConCon <- function(bn) maxScoreSplit(bn, chiScores) maxPairs <- inDep(data = heartClean, stopCriteria = stopCrits, catCon = maxCatCon, conCon = maxCatCon) summary(maxPairs) ``` The drastic change in the significance levels in the summary demonstrates how maximized $\chi^2$ splitting, for example, leads to inflated significance of pairs compared to the more accurate random splitting. ```{r, fig.width = 5, fig.height = 8} plot(maxPairs) plot(maxPairs, which = 6:10) ``` Despite this change, both methods agree on the most significant relationships in the data. For each pair, we can get a sense of its importance by comparing its rank under both random and maximized binning in a single plot. ```{r, fig.width = 5, fig.height = 5} randOrd <- match(assocs$pairs, assocs$pairs) maxOrd <- match(assocs$pairs, maxPairs$pairs) plot(randOrd, maxOrd, xlab = "Random rank", ylab = "Max chi rank", main = "Rankings of pair significance between methods") ``` Both methods agree on the most important relationships in the data, but tend to disagree about the less strongly related variables. Note that this plot was produced by matching the pairs ranked by $p$-values, which are only accurate for random binning. Noting the strong impact of study on these results, we might repeat this analysis with the Cleveland data alone. ```{r} heartCleve <- heartClean[heartClean$study == "cleveland", ] heartCleve$study <- NULL cleveAssoc <- inDep(heartCleve, stopCriteria = stopCrits, catCon = maxCatCon, conCon = maxConCon) summary(cleveAssoc) ``` ```{r, fig.width = 5, fig.height = 8} plot(cleveAssoc) plot(cleveAssoc, which = 6:10) plot(cleveAssoc, which = 11:15) ```