--- title: 'cdparcoord: Categorical and Discrete Parallel Coordinates' author: "Norm Matloff, Vincent Yang and Harrison Nguyen" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} --- # cdparcoord: Categorical and Discrete Parallel Coordinates # Table of Contents * [Motivation](#Motivation) * [Quickstart](#quickstart) * [Examples](#examples) * [Key Functions](#key-functions) * [Tips](#tips) * [Authors](#authors) # Motivation The [*parallel coordinates* approach](https://en.wikipedia.org/wiki/Parallel_coordinates) is a popular method for graphing multivariate data. However, for large data sets, the method suffers from a "black screen problem" -- the jumble of lines fills the screen and it is difficult if not impossible to discern any relationships in the data. Consider the package dataset **mlb**, consisting of data on Major League Baseball players, courtesy of the UCLA Stat Dept. ```R data(mlb) # extract height, weight, age m <- mlb[,4:6] # ordinary parallel coordinates library(MASS) parcoord(m) ```

Each polygonal line in the graph represents one player, connecting his height, weight and age. If a player had his (height,weight,age) tuple as, say (73,192,25), his line would have height 73 on the Height axis, height 192 on the Weight axis and height 25 on the Age axis. But since there are so many lines (actually only about 1000), the graph is useless. Our solution is to graph only the most frequent lines. Our [**freqparcoord** package](https://cran.r-project.org/package=freqparcoord) does this for continuous variables, with line frequency defined in terms of estimated multivariate density. The current package, [**cdparcoord**](https://github.com/matloff/cdparcoord), covers the case of categorical variables, with frequency defined as actual tuple count. (In a mixed continuous-categorical setting, the continuous variables are discretized.) # Quickstart ## Installation #### CRAN ```r install.packages("cdparcoord") ``` #### Github (development version) ```r install.packages("devtools") devtools::install_github("matloff/cdparcoord") ``` Here we give a quick view of the package operations. It is assumed that the user has already executed ```R library(cdparcoord) ``` ## Example: Gender pay difference This example involves data from the 2000 U.S. census on programmers and engineers in Silicon Valley, a dataset included with the package. Suppose our interest is exploring whether women encounter wage discrimination. Of course we won't settle such a complex question here, but it will serve as a good example of the use of the package as an exploratory tool. We first load the data, and select some of the columns for display. (In this tutorial, the terms *column* and *variable* will be used interchangeably, and will have the same meaning as *feature*.) We also remove some very high wages (at least in the context of the year 2000) to make the display easier. ```R data(prgeng) pe <- prgeng[,c(1,3,5,7:9)] pe25 <- pe[pe$wageinc < 250000,] ``` The resulting data has just under 20,000 rows. As mentioned, a key feature of the package is discretization of continuous variables, so that the tuple frequency counts will have meaning. We will do this via the package's **discretize()** function, which we will apply to the numeric variables. However, in this particular data set, there are variables that seem numeric but are in essence factors, as they are codes. For the **educ** variable, for instance, the number 14 codes a master's degree. (A code list is available at the [Census Bureau site](https://www.census.gov/prod/cen2000/doc/pums.pdf).) So, let's change the coded variables to factors, and then discretize: ```R pe25 <- makeFactor(pe25,c('educ','occ','sex')) pe25disc <- discretize(pe25,nlevels=5) ``` Each of the numeric variables here is discretized into 5 levels. Now display: ```R discparcoord(pe25disc,k=150,saveCounts=FALSE) ``` Here we are having **cdparcoord** display the 150 most frequent tuple patterns. The result is

For example, there is a blue line corresponding to the tuple, (age=35,educ=14,occ=102,sex=1,wageinc=100000,wkswrkd=52) The frequencies are color-coded according to the legend at the right. So the above tuple occurred something like 60 times in the data. What about the difference between males and females (coded 1 and 2)? One interesting point is that there seems to be greater range in the men's salaries. At the high end, though, men seem to have the edge. Now, can that edge may be explained by differences between the two groups in other variables? For instance, do women tend to be in lower-paying occupations? To investigate that, let's move the **occ** column to be adjacent to **wageinc**. This is accomplished by a mouse operation that is provided by **plotly**, the graphical package on top of which **cdparcoord** is built. Specifically, we can use the mouse to drag the **occ** label to the right, releasing the mouse when the column reaches near the **wageinc** column. The result is

As noted, there is a range for each occupation. However, looking at the more-frequent lines, occupation 102 seems to rather lucrative (and possibly 140 and 141). And if so, that seems to be bad news for the women, as occupation 102 seems more populated by men. On the other hand, this might help explain the high-end salary gender gap found above. Another possible piece of evidence in this direction is that the graph seems to say the men tend to have higher levels of education. To obtain a somewhat finer look, we can use another feature supplied by **plotly**, a form of *brushing*. Here we highlight the women's lines by using the mouse to drag the top of the **sex** column down slightly. This causes the men's lines to go to light gray, while the women's lines stay in color:

The fact that we requested brushing for **sex** = 2 is confirmed in the graph by the appearance of a short magenta-colored line segment just below the 2 tick mark. Multiple columns can be brushed together. To turn off brushing, click on a non-magenta portion of the axis. # Further Examples The following examples will illustrate other features of the package, as well as different applications. ## Example: Node reordering, advanced brushing Here we try the [Stanford WordBank data](http://wordbank.stanford.edu/analyses?name=instrument_data) on vocabulary acquisition in young children. The file used was **English.csv**, from which we have a data frame **wb**, consisting of about 5500 rows. (There are many NA values, though, and only about 2800 complete cases.) Variables are age, birth order, sex, mother's education and vocabulary size. ```R wb <- wb[,c(2,5,7,8,10)] wb <- discretize(wb,nlevels=5) discparcoord(wb,k=100,saveCounts=FALSE) ``` We again asked for 5 levels for each variable. As noted in the [Tips section](#tips) below, though, **cdparcoord**, like any graphical exploratory tool, is best used by trying a number of different parameter combinations, e.g. varying **nlevels** here. This produces

Nice -- but, presuming that **mom\_ed** has an ordinal relation with vocabulary, the ordering of the labels here is not what we would like. We can use **reOrder** to remedy that: ```R wb <- reOrder(wb,'mom_ed', c('Secondary','Some College','College','Some Graduate','Graduate')) discparcoord(wb,k=100,saveCounts=FALSE) ```

By the way, there were further levels in the **mom\_ed** variable, 'Primary' and 'Some Secondary', but they didn't show up here, as we plotted only the top 100 lines. (Or we set **nlevels** at a higher value than might be effective for this data.) There was a similar issue with missing levels on **birth\_order**. Speaking of the latter, the earlier-born children seem to be at an advantage, at least in the two orders that show up here. Now suppose we wish to study girls with mothers having at least a college education. Again we can use brushing, this time with two variables **sex** and **mom\_ed** together, and several values together in the latter variables:

The magenta highlights show that **sex** and **mom\_ed** were brushed, and in the latter case, specifically the levels 'College', 'Some Graduate' and 'Graduate'. Lines with all other combinations now appear in light gray. ## Example: Advanced use of discretize() We return to the baseball, and show a more advanced usage of **discretize()**. The dataset is probably too small to discretize -- some frequencies of interesting tuples will be very small -- but it is a good example of usage of lists in **discretize()**. The key argument is **input**, which will be an R list of lists. In the outer list, there will be one inner list for each variable to be specified for discretization. ```R inp1 <- list("name" = "Height", "partitions"=3, "labels"=c("short", "med", "tall")) inp2 <- list("name" = "Weight", "partitions"=3, "labels"=c("light", "med", "heavy")) inp3 <- list("name" = "Age", "partitions"=2, "labels"=c("young", "old")) discreteinput <- list(inp1, inp2, inp3) discretizedmlb <- discretize(m, discreteinput) discparcoord(discretizedmlb, name="MLB", k=100,saveCounts=FALSE) ```

Had we wanted to handle **Height** separately, we could have called **discretize()** twice: ```R inpt <- list(inp2,inp3) m1 <- discretize(m,inpt) m2 <- discretize(m1) discparcoord(m2,k=150,saveCounts=FALSE) ``` ## Example: Outlier hunting As with **freqparcoord** for the continuous-variable case, **cdparcoord** may be used for identifying outliers. This is accomplished by setting a negative value for the argument **k** in **discparcoord()**. Here we took **k** = -10, i.e. asked to plot the 10 LEAST frequent tuples: ```R data(PimaIndiansDiabetes) pima <- PimaIndiansDiabetes discparcoord(pima,k=-10,saveCounts=FALSE) showCounts() ```

There is an interesting case of a woman who has high values on almost all the risk factors, and does indeed have diabetes. That one may be a correct data point (though possibly a candidate for exclusion in formal statistical analysis), but the case in which Thick, Insul and BMI are all 0s is clearly an error. Let's investigate this further, by examining the actual tuples: ``` > showCounts() pregnant glucose pressure triceps insulin mass pedigree age diabetes freq 1 6 148 72 35 0 33.6 0.627 50 pos 1 2 1 85 66 29 0 26.6 0.351 31 neg 1 3 8 183 64 0 0 23.3 0.672 32 pos 1 4 1 89 66 23 94 28.1 0.167 21 neg 1 5 0 137 40 35 168 43.1 2.288 33 pos 1 6 5 116 74 0 0 25.6 0.201 30 neg 1 7 3 78 50 32 88 31.0 0.248 26 pos 1 8 10 115 0 0 0 35.3 0.134 29 neg 1 9 2 197 70 45 543 30.5 0.158 53 pos 1 10 8 125 96 0 0 0.0 0.232 54 pos 1 ``` So, in addition to the triple-0 case, there are several with double 0s and one with a single 0. There are probably more in the rest of the data. We see that there is serious need for data cleaning here. ## Example: Time series Consider the [ozone data](https://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection) in the UCI Machine Learning Data Repository. We will use the file **onehr.data**, which gives hourly readings of ozone and other values over the coure of a day. There is one row per day, during the time period 1998-2004. The missing values are coded as question marks in the dataset, so we'll remove any case with such a value. Also, to make the display easier to view, we'll look only a readings every four hours. We will also include the midday temperature: ```R oz <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/ozone/onehr.data',header=FALSE) noq <- function(rw) !any(rw == '?') goodrows <- apply(oz,1,noq) ozc <- oz[goodrows,] for(i in 1:74) ozc[,i] <- as.numeric(as.character(ozc[,i])) ozc4 <- ozc[,c(seq(2,25,4),39)] ozc4d <- discretize(ozc4,nlevels=3) discparcoord(ozc4d,k=25,saveCounts=FALSE) ``` The result is

Apparently on hot days, a high ozone level at the start of a day will persist as the day goes on. However, on cooler days, there may be some oscillation. As is often the case, we would need a domain expert to interpret this, but the point is that we have discovered something for him/her to investigate. ## Example: Classification problems The package can be used in an exploratory manner for feature selection in classification problems. This can be useful, for instance, for identifying sets of good predictors. Let's look at the Letter Recognition dataset from the UCI Machine Learning Repository. Images were generated for the 26 capital letters in English, various fonts, some randomly-generated distortion. Various physical and statistical numerics were recorded for each image. One might expect the letters O and Q to be more difficult to tell apart. Let's view that using **cdparcoord**. By the way, no need to discretize here, as all values are integers. ```R ltr <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data', header=FALSE) oq <- ltr[ltr[,1]=='O' | ltr[,1]=='Q',] discparcoord(oq,k=100,saveCounts=FALSE) ```

A little daunting! Yet even for these two similar letters, it appears for instance that variables V2 through V6 may have some predictive power to distinguish between the two letters. Brushing the Q node makes this a little clearer (not shown here). ## Example: Rare subsgroups Since (for positive **k**) we only display the most-frequent tuples, we may missing small but important subgroups. This may especially be a problem in classification applications, with one or more small classes. For that reason, **cdparcoord** provides a mechanism with which the user can specify that a given subgroup be given extra weighting in the frequency counting. Let's use the Census data to illustrate this. If in the previous graph above for this dataset one brushes both **educ** = 16 and **sex** = 2, i.e. female PhDs, one finds that all lines are gray! There were actually 102 people of that type in the data, but that wasn't popular enough to make the cut. Instead, one can run ```R discparcoord(pe25disc,k=150, accentuate='with(pe25disc,educ==16 & sex==2)',saveCounts=FALSE) ``` which specifies to give extra weight (default value 100) to this group. ## Example: Comparison to full parallel coordinates Finally, let's look at the the Diamonds example bundled with the **ggplot2** package (in turn bundled by **plotly**, which is loaded by **cdparcoord**). We'll borrow from a Luke Tierney course example, which uses the **lattice** graphics library. First, the data prep. The example "thins out" the data by taking a subsample: ```R library(lattice) ds <- diamonds[sample(nrow(diamonds), 5000),] parallelplot(~ds, group = cut, data = ds, horizontal.axis = FALSE, auto.key = TRUE) ```

It's a pretty picture, all right, but hard to follow, say for the Premium diamonds. Let's try **cdparcoord**, using the full data set of course, though as usual only the most-frequent tuples: ```R dd <- discretize(diamonds,nlevels=4) discparcoord(dd,k=2500,saveCounts=FALSE) ```

It is definitely easier to discern the patterns here than above. Look for instance at the blue lines for the Premium quality. # Key Functions #### `discparcoord()` The main function is `discparcoord()`, which may optionally be used with `discretize()`. `discparcoord()` accounts for partial values and drawing. #### `discretize()` `discparcoord()` may optionally be used with `discretize()`. `discretize` takes a dataset and a list of lists. It discretizes the dataset's values such that `plot()` may chart categorical variables. The inner list should contain the following variables: `int partitions`, `string vector labels`, `vector lower bounds`, `vector upper bounds`. The last three are optional. #### `discparcoord()` details Encompassed in **discparcoord**, we provide 3 key functions -- `tupleFreqs()` `grpcategory()`, and `interactivedraw()`. 1. The call `tupleFreqs(dataset,n)` inputs a dataset and returns a new dataset consisting of the **n** most frequent patterns with an added column - the frequency of each column. This dataset contains no NA values, as all of the columns previously with NA values have now been eliminated. By default, `tupleFreqs` returns the 5 most significant tuples. 2. The `grpcategory` option allows you to create multiple plots, one for each category. If a field has 4 possible values, then `discparcoord()` with the `grpcategory` option will create a plot for each category, where each plot has the specific category's attributes. For example, if a field "Weight "has "Heavyweight" and "Lightweight", then this will create one plot where all tuples are heavyweights, then one more where where all tuples are lightweights. 3. `interactivedraw()` takes a dataset and draws a parallel coordinates plot that opens in your browser. It has movable columns, brushing, and the ability to save your plots. You can also choose to toggle labels on and off. For more information, type `?interactivedraw` into the console. # Tips * Like any exploratory graphical tool, **cdparcoord** is best used by trying various parameter values, e.g. different values of **k**, **nlevels** and so on, one then just one combination of settings.. * Pay close attention to the frequency color-coding. For instance, if the lines are all green, this means the frequencies are all about the same, likely 1. In such case, you may wish to use **discretize()** with a small value of **nlevels**. * On the other hand, there are situations in which one may not wish to discretize, such as: - When searching for outliers (negative **k**), the cases of interest will typically have frequency 1 and in any event the frequency may not be of interest here. - With a large number of variables, even a value of 2 for **nlevels** will result in frequencies of 1, so it may be better not to discretize in the first place. Actually, **freqparcoord** may be more useful here. * Sometimes labels greatly hinder the visibility and clarity of the plot. This can be circumvented by opting to remove labels in plot. * Sometimes two lines will coincide in one or more segments. Brushing may help separate them. # Accounting for NA Values (EXPERIMENTAL, MAY BE VERY SLOW) R and R packages typically leave out any rows with NA values. Unfortunately for data sets with high NA counts, this may have drastic effects, such as low counts and possible bias. [`cdparcoord`](https://github.com/matloff/cdparcoord) addresses this issue by allowing these rows to partially contribute to overall counts. See the **NAexp** variable in **tupleFreqs()**. # Authors Norm Matloff, Vincent Yang, Harrison Nguyen