Title: | Top Frequency-Based Parallel Coordinates |
---|---|
Description: | Parallel coordinate plotting with resolutions for large data sets and missing values. |
Authors: | Norm Matloff <[email protected]> and Vincent Yang <[email protected]> and Harrison Nguyen <[email protected]> |
Maintainer: | Norm Matloff <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.1 |
Built: | 2024-11-18 06:39:57 UTC |
Source: | CRAN |
A novel approach to the parallel coordinates method for visualization of multiple variables at once, focused on discrete and categorical variables.
(a) Addresses the screen-clutter problem in parallel coordinates, by only plotting the "most typical" cases. These are the tuples with the highest occurrence rates.
(b) Provides a novel approach to NA values by allowing tuples with NA values to partially contribute to complete tuples rather than eliminating missing values.
Type ?quickstart
for a quick start.
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
The functions tupleFreqs
and discparcoord
are
the workhorse functions in the
package, calculating frequency counts to be used in the graphs and
displaying them.
tupleFreqs(dataset,k=5,NAexp=1.0,countNAs=FALSE,saveCounts=FALSE, minFreq=NULL,accentuate=NULL,accval=100) clsTupleFreqs(cls=NULL, dataset, k=5, NAexp=1, countNAs=FALSE) discparcoord(data, k=5, grpcategory=NULL, permute=FALSE, interactive = TRUE, save=FALSE, name="Parcoords", labelsOff=TRUE, NAexp=1.0,countNAs=FALSE, accentuate=NULL, accval=100, inParallel=FALSE, cls=NULL, differentiate=FALSE, saveCounts=FALSE, minFreq=NULL)
tupleFreqs(dataset,k=5,NAexp=1.0,countNAs=FALSE,saveCounts=FALSE, minFreq=NULL,accentuate=NULL,accval=100) clsTupleFreqs(cls=NULL, dataset, k=5, NAexp=1, countNAs=FALSE) discparcoord(data, k=5, grpcategory=NULL, permute=FALSE, interactive = TRUE, save=FALSE, name="Parcoords", labelsOff=TRUE, NAexp=1.0,countNAs=FALSE, accentuate=NULL, accval=100, inParallel=FALSE, cls=NULL, differentiate=FALSE, saveCounts=FALSE, minFreq=NULL)
data |
The data, in data frame or matrix form. |
k |
The number of tuples to return. These will be the |
grpcategory |
Grouping column/variable. |
permute |
If TRUE, randomly permute the columns before plotting. |
interactive |
If TRUE, use interactive plotting, allowing for interactively readjusting column order and scrubbing/brushing. |
save |
If this is TRUE and interactive mode is on, saved plot will be available from the browser. |
name |
The name for the plot. |
labelsOff |
If TRUE, labels are off. This only comes into effect when interactive=FALSE. |
NAexp |
Scale for NA counts. |
countNAs |
If TRUE, count NA values. |
accentuate |
Character expression specifying the property to accentuate. |
accval |
Value to accentuate. |
inParallel |
If TRUE, calculate tuple frequencies in parallel. |
differentiate |
If TRUE, randomize coloring to differentiate overlapping lines. |
saveCounts |
If TRUE, save the tuple counts to the file ‘tupleCounts’. |
minFreq |
The smallest frequency to be displayed. |
dataset |
The dataset to process, a data frame or data.table. |
cls |
Cluster to be used if |
Tuple tabulation is performed by tupleFreqs
, or in large
cases, in parallel by clsTupleFreqs
. The display is done by
discparcoord
.
The k
most- or least-frequent tuples will be reported,
with the latter specified via negative k
. Optionally,
tuples with NA values will count less, but weigh toward
everything that has existing numbers in common with it.
If continuous variables are present, then in most cases, either
convert to discrete using discretize
or use
freqparcoord.
The data will be converted into a data.table if it is not already in
that form. For this and other reasons, it is advantageous to have the
data in that form to begin with, say by using data.table::fread
to read the data.
Optionally, tuples that partially match a full tuple pattern except for NA
values will add a partial count to the frequency count for the full
pattern. If for instance the data consist of 8-tuples and a row in the
data matches a given 8-tuple pattern in 7 of 8 components, this row
would add a count of 7/8 to the frequency for that pattern. To reduce
this weight, use a value greater than 1.0 for NAexp
. If that
value is 2, for example, the 7/8 increment will be 7/8 squared.
The functions tupleFreqs
and clsTupleFreqs
return an
object of class c('pna','data.frame')
, with each row
consisting of a tuple and its count. In addition the object will
have attributes k
and minFreq
.
The function discparcoord
returns an object of class
c('plotly','htmlwidget')
. Printing the object causes display
of the graph.
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
## Not run: data(Titanic) # Find frequencies in parallel discparcoord(Titanic, inParallel=TRUE) ## End(Not run) ## Not run: data(hrdata) input1 = list("name" = "average_montly_hours", "partitions" = 3, "labels" = c("low", "med", "high")) input = list(input1) # this will discretize the data by partitioning average monthly # hours into 3 parts called low, med, and high hrdata = discretize(hrdata, input) print('first few discretized tuples') # first line should be 0.38,0.53,2,low,3,0,1,00,sales,low head(hrdata) print('first few most-frequent tuples') # first line should be 0.40,0.46,2,...,11 tupleFreqs(hrdata,saveCounts=FALSE) # account for NA values and plot with parallel coordinates discparcoord(hrdata) # same as above, but with scrambled columns discparcoord(hrdata, permute=TRUE) # same as above, but show top k values discparcoord(hrdata, k=8) # same as above, but group according to profession discparcoord(hrdata, grpcategory="sales") ## End(Not run)
## Not run: data(Titanic) # Find frequencies in parallel discparcoord(Titanic, inParallel=TRUE) ## End(Not run) ## Not run: data(hrdata) input1 = list("name" = "average_montly_hours", "partitions" = 3, "labels" = c("low", "med", "high")) input = list(input1) # this will discretize the data by partitioning average monthly # hours into 3 parts called low, med, and high hrdata = discretize(hrdata, input) print('first few discretized tuples') # first line should be 0.38,0.53,2,low,3,0,1,00,sales,low head(hrdata) print('first few most-frequent tuples') # first line should be 0.40,0.46,2,...,11 tupleFreqs(hrdata,saveCounts=FALSE) # account for NA values and plot with parallel coordinates discparcoord(hrdata) # same as above, but with scrambled columns discparcoord(hrdata, permute=TRUE) # same as above, but show top k values discparcoord(hrdata, k=8) # same as above, but group according to profession discparcoord(hrdata, grpcategory="sales") ## End(Not run)
Converts continuous columns to discrete.
discretize(dataset, input = NULL, ndigs=2, nlevels=10, presumedFactor=FALSE)
discretize(dataset, input = NULL, ndigs=2, nlevels=10, presumedFactor=FALSE)
dataset |
Dataset to discretize, data frame/table. |
input |
Optional specification for partitioning, giving the number of partitions and labels for each partition. List of lists, one list per column to be converted. The outermost list indicates the columns to be converted, and each inner list holds the name of the column, the number of partitions, and a list of labels for each partition. |
ndigs |
Number of digits to retain in forming labels/values for the
discretized data, if |
nlevels |
Number of partitions to form for each variable, if
|
presumedFactor |
If TRUE, any variable having fewer than |
If input
is not specified, each numeric column in the data will
be discretized, with one exception: If a column is numeric but has
fewer distinct values than nlevels
, and if presumedFactor
is TRUE, it is presumed to be an informal R factor and will not be converted.
However, it is best to use makeFactor
on such variables.
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
data(prgeng) pe <- prgeng[,c(1,3,5,7:9)] # extract vars of interest pe25 <- pe[pe$wageinc < 250000,] # delete extreme values pe25disc <- discretize(pe25) # age, wageinc and wkswrkd discretized data(mlb) # extract the height, weight, age, and position of players m <- mlb[,4:7] inp1 <- list("name" = "Height", "partitions"=4, "labels"=c("short", "shortmid", "tallmid", "tall")) inp2 <- list("name" = "Weight", "partitions"=3, "labels"=c("light", "med", "heavy")) inp3 <- list("name" = "Age", "partitions"=3, "labels"=c("young", "med", "old")) # create one list to pass everything to discretize() discreteinput <- list(inp1, inp2, inp3) head(discreteinput) # at this point, all of the data has been discretized discretizedmlb <- discretize(m, discreteinput) head(discretizedmlb)
data(prgeng) pe <- prgeng[,c(1,3,5,7:9)] # extract vars of interest pe25 <- pe[pe$wageinc < 250000,] # delete extreme values pe25disc <- discretize(pe25) # age, wageinc and wkswrkd discretized data(mlb) # extract the height, weight, age, and position of players m <- mlb[,4:7] inp1 <- list("name" = "Height", "partitions"=4, "labels"=c("short", "shortmid", "tallmid", "tall")) inp2 <- list("name" = "Weight", "partitions"=3, "labels"=c("light", "med", "heavy")) inp3 <- list("name" = "Age", "partitions"=3, "labels"=c("young", "med", "old")) # create one list to pass everything to discretize() discreteinput <- list(inp1, inp2, inp3) head(discreteinput) # at this point, all of the data has been discretized discretizedmlb <- discretize(m, discreteinput) head(discretizedmlb)
A small fictional dataset by Kaggle that includes satisfaction level, the result of their last evaluation, number of projects, average monthly hours, time spent at the company, whether they have had a work accident, whether they have had a promotion in the last 5 years, their department, salary and finally whether the employee has left the company. Each row represents a single employee.
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
Change numeric variables that are specified in varnames
to factors so that discretize
won't partition.
makeFactor(df, varnames)
makeFactor(df, varnames)
df |
Input data frame. |
varnames |
Names of variables to be converted to factors. |
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
data(prgeng) pe <- prgeng[,c(1,3,5,7:9)] class(pe$educ) # integer pe <- makeFactor(pe,c('educ','occ','sex')) class(pe$educ) # factor # nice to give levels names levels(pe$sex) <- c('male','female') head(pe$sex)
data(prgeng) pe <- prgeng[,c(1,3,5,7:9)] class(pe$educ) # integer pe <- makeFactor(pe,c('educ','occ','sex')) class(pe$educ) # factor # nice to give levels names levels(pe$sex) <- c('male','female') head(pe$sex)
Quick introduction to the package.
# programmer/engineer info from 2000 Census data(prgeng) # select some columns of interest pe <- prgeng[,c(1,3,5,7:9)] # remove some extreme values pe25 <- pe[pe$wageinc < 250000,] # some numeric variables are really factors pe25 <- makeFactor(pe25,c('educ','occ','sex')) # convert the continuous variables to discrete pe25disc <- discretize(pe25,nlevels=5) ## Not run: # display discparcoord(pe25disc,k=150) # then possibly brush, etc. ## End(Not run)
# programmer/engineer info from 2000 Census data(prgeng) # select some columns of interest pe <- prgeng[,c(1,3,5,7:9)] # remove some extreme values pe25 <- pe[pe$wageinc < 250000,] # some numeric variables are really factors pe25 <- makeFactor(pe25,c('educ','occ','sex')) # convert the continuous variables to discrete pe25disc <- discretize(pe25,nlevels=5) ## Not run: # display discparcoord(pe25disc,k=150) # then possibly brush, etc. ## End(Not run)
Use to order the levels of a factor in a desired sequence.
reOrder(dataset, colName, levelNames)
reOrder(dataset, colName, levelNames)
dataset |
Dataset to reorder. |
colName |
Column name. |
levelNames |
Names of the reordered levels |
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
sl <- c('primary','college','hs','middle','hs') z <- data.frame( schlvl = factor(x=sl, levels=c('college','hs','middle','primary')) ) z z <- reOrder(z,'schlvl',c('primary','middle','hs','college')) str(z) # shows the desired label order in the 'categoryorder' attribute
sl <- c('primary','college','hs','middle','hs') z <- data.frame( schlvl = factor(x=sl, levels=c('college','hs','middle','primary')) ) z z <- reOrder(z,'schlvl',c('primary','middle','hs','college')) str(z) # shows the desired label order in the 'categoryorder' attribute
Used with saveCounts
TRUE in tupleFreqs
etc.
to recover the tuple counts.
showCounts(nshow=NULL)
showCounts(nshow=NULL)
nshow |
Dataset to show. |
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>
A small fictional dataset with different values and NA's to emphasize tupleFreqs and frequency based calculations with cdparcoord.
Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>