The AssocBin
vignette gives some guidance on
how to control the core algorithm at the heart of the
AssocBin
package for applications to a single pair, but the
functionality of AssocBin
does not stop there. The package
also includes higher level functionality which can apply recursive
binning automatically to a complex data set. This vignette outlines
howhttp://127.0.0.1:22179/graphics/681bfdce-bb25-484e-afd9-1fe303480217.png
to run, interpret, and customize the pariwiseAssociation
function that operates this high level functionality.
Let’s take a look at the heart disease data from the UCI machine learning data repository (https://archive.ics.uci.edu/dataset/45/heart+disease):
## age sex cp trestbps chol
## Min. :28.00 female:194 atypical :174 Min. : 0.0 Min. : 0.0
## 1st Qu.:47.00 male :726 non-angina:204 1st Qu.:120.0 1st Qu.:175.0
## Median :54.00 none :496 Median :130.0 Median :223.0
## Mean :53.51 typical : 46 Mean :132.1 Mean :199.1
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:268.0
## Max. :77.00 Max. :200.0 Max. :603.0
## NA's :59 NA's :30
## fbs restecg thalach exang
## Mode :logical hypertrophy:188 Min. : 60.0 Mode :logical
## FALSE:692 normal :551 1st Qu.:120.0 FALSE:528
## TRUE :138 stt :179 Median :140.0 TRUE :337
## NA's :90 NA's : 2 Mean :137.5 NA's :55
## 3rd Qu.:157.0
## Max. :202.0
## NA's :55
## oldpeak slope ca thal num
## Min. :-2.6000 down: 63 0 :181 normal :196 0:411
## 1st Qu.: 0.0000 flat:345 1 : 67 fixed : 46 1:265
## Median : 0.5000 up :203 2 : 41 reversable:192 2:109
## Mean : 0.8788 NA's:309 3 : 20 NA's :486 3:107
## 3rd Qu.: 1.5000 NA's:611 4: 28
## Max. : 6.2000
## NA's :62
## study
## Length:920
## Class :character
## Mode :character
##
##
##
##
This example data has some missing values, with variables like
slope
, ca
, and thal
particularly
troublesome. Dropping these variables which are mostly missing and
considering complete cases only:
heartClean <- heart
heartClean$thal <- NULL
heartClean$ca <- NULL
heartClean$slope <- NULL
heartClean <- na.omit(heartClean)
str(heartClean)
## 'data.frame': 740 obs. of 12 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "atypical","non-angina",..: 4 3 3 2 1 1 3 3 3 3 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
## $ restecg : Factor w/ 3 levels "hypertrophy",..: 1 1 1 2 1 2 1 2 1 1 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : logi FALSE TRUE TRUE FALSE FALSE FALSE ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ num : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...
## $ study : chr "cleveland" "cleveland" "cleveland" "cleveland" ...
## - attr(*, "na.action")= 'omit' Named int [1:180] 306 331 335 338 348 369 376 379 385 390 ...
## ..- attr(*, "names")= chr [1:180] "306" "331" "335" "338" ...
This leaves 740 observations over 12 variables to be binned. Running
the binning process is very straightforward. First, stop criteria are
defined. Next, the function inDep
is called. This function
performs pairwise binning on all variable pairs in this data set and
returns an inDep
object, which contains these binnings
alongside data about them that allow us to assess the relationships
between these pairs.
stopCrits <- makeCriteria(depth >= 6, n < 1, expn <= 10)
assocs <- inDep(heartClean, stopCriteria = stopCrits)
The resulting object can be explored using plot
and
summary
methods.
## All 66 pairs in heartClean recursively binned with type distribution:
##
## factor:factor factor:numeric numeric:numeric
## 21 35 10
## 35 pairs are significant at 5% and 33 pairs are significant at 1%
## by specifying which indices should be displayed, others can be plotted
## the binnings are returned in increasing order of p-value, so the indices
## chosen give the rank of the strength a particular pair's relationship
plot(assocs, which = 6:10) # like the next 5 strongest
In the first two columns, points are jittered within categorical variable levels to reduce overplotting. The different categories are separated by dashed lines for visual clarity. Space is added in the first column for additional separation of groups, but this is condensed in the second column to reflect the appearance of the point configuration under ranks with random tie-breaking.
The shadings in the final column of each of these plots highlight the cells with individually significant Pearson residuals under the null hypothesis. Asymptotically, the Pearson residuals are normally distributed, and so values above the 0.975 standard normal or below the 0.025 standard normal quantile are shaded. The cells below the lower quantile (with significantly fewer observations than expected) are shaded blue while those above (with significantly more observations than expected) are shaded red.
The p-values reported are computed assuming the χ2 statistics for each pair are χdf2-distributed with df given by the number of cells less the number of constraints in estimation. By default, binning is therefore done at random to ensure this approximation is accurate. When other splitting methods are used these default p-values are no longer accurate. Of course, such splitting methods may still be desirable for better plotting qualities and so we can customize the bins by changing the splitting logic.
maxCatCon <- function(bn) uniMaxScoreSplit(bn, chiScores)
maxConCon <- function(bn) maxScoreSplit(bn, chiScores)
maxPairs <- inDep(data = heartClean, stopCriteria = stopCrits,
catCon = maxCatCon, conCon = maxCatCon)
summary(maxPairs)
## All 66 pairs in heartClean recursively binned with type distribution:
##
## factor:factor factor:numeric numeric:numeric
## 21 35 10
## 54 pairs are significant at 5% and 50 pairs are significant at 1%
The drastic change in the significance levels in the summary demonstrates how maximized χ2 splitting, for example, leads to inflated significance of pairs compared to the more accurate random splitting.
Despite this change, both methods agree on the most significant relationships in the data. For each pair, we can get a sense of its importance by comparing its rank under both random and maximized binning in a single plot.
randOrd <- match(assocs$pairs, assocs$pairs)
maxOrd <- match(assocs$pairs, maxPairs$pairs)
plot(randOrd, maxOrd, xlab = "Random rank", ylab = "Max chi rank",
main = "Rankings of pair significance between methods")
Both methods agree on the most important relationships in the data, but tend to disagree about the less strongly related variables. Note that this plot was produced by matching the pairs ranked by p-values, which are only accurate for random binning.
Noting the strong impact of study on these results, we might repeat this analysis with the Cleveland data alone.
heartCleve <- heartClean[heartClean$study == "cleveland", ]
heartCleve$study <- NULL
cleveAssoc <- inDep(heartCleve, stopCriteria = stopCrits, catCon = maxCatCon,
conCon = maxConCon)
summary(cleveAssoc)
## All 55 pairs in heartCleve recursively binned with type distribution:
##
## factor:factor factor:numeric numeric:numeric
## 15 30 10
## 28 pairs are significant at 5% and 21 pairs are significant at 1%