Package 'cdparcoord' reference manual

Title:	Top Frequency-Based Parallel Coordinates
Description:	Parallel coordinate plotting with resolutions for large data sets and missing values.
Authors:	Norm Matloff <[email protected]> and Vincent Yang <[email protected]> and Harrison Nguyen <[email protected]>
Maintainer:	Norm Matloff <[email protected]>
License:	GPL (>= 2)
Version:	1.0.1
Built:	2025-02-16 06:57:45 UTC
Source:	CRAN

Top-frequency parallel coordinates plots.

Description

A novel approach to the parallel coordinates method for visualization of multiple variables at once, focused on discrete and categorical variables.

(a) Addresses the screen-clutter problem in parallel coordinates, by only plotting the "most typical" cases. These are the tuples with the highest occurrence rates.

(b) Provides a novel approach to NA values by allowing tuples with NA values to partially contribute to complete tuples rather than eliminating missing values.

Type ?quickstart for a quick start.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Compute/display tuple frequency counts, and optionally account for NA values

Description

The functions tupleFreqs and discparcoord are the workhorse functions in the package, calculating frequency counts to be used in the graphs and displaying them.

Usage

    tupleFreqs(dataset,k=5,NAexp=1.0,countNAs=FALSE,saveCounts=FALSE, 
       minFreq=NULL,accentuate=NULL,accval=100) 
    clsTupleFreqs(cls=NULL, dataset, k=5, NAexp=1, countNAs=FALSE)
    discparcoord(data, k=5, grpcategory=NULL, permute=FALSE,
        interactive = TRUE, save=FALSE, name="Parcoords", labelsOff=TRUE,
        NAexp=1.0,countNAs=FALSE, accentuate=NULL, accval=100, inParallel=FALSE,
        cls=NULL, differentiate=FALSE, saveCounts=FALSE, minFreq=NULL)
tupleFreqs(dataset,k=5,NAexp=1.0,countNAs=FALSE,saveCounts=FALSE, 
       minFreq=NULL,accentuate=NULL,accval=100) 
    clsTupleFreqs(cls=NULL, dataset, k=5, NAexp=1, countNAs=FALSE)
    discparcoord(data, k=5, grpcategory=NULL, permute=FALSE,
        interactive = TRUE, save=FALSE, name="Parcoords", labelsOff=TRUE,
        NAexp=1.0,countNAs=FALSE, accentuate=NULL, accval=100, inParallel=FALSE,
        cls=NULL, differentiate=FALSE, saveCounts=FALSE, minFreq=NULL)

Arguments

`data`	The data, in data frame or matrix form.
`k`	The number of tuples to return. These will be the `k` most frequent tuples, unless `k` is negative, in which case the least-frequent tuples will be returned. The latter is useful for hunting for outliers.
`grpcategory`	Grouping column/variable.
`permute`	If TRUE, randomly permute the columns before plotting.
`interactive`	If TRUE, use interactive plotting, allowing for interactively readjusting column order and scrubbing/brushing.
`save`	If this is TRUE and interactive mode is on, saved plot will be available from the browser.
`name`	The name for the plot.
`labelsOff`	If TRUE, labels are off. This only comes into effect when interactive=FALSE.
`NAexp`	Scale for NA counts.
`countNAs`	If TRUE, count NA values.
`accentuate`	Character expression specifying the property to accentuate.
`accval`	Value to accentuate.
`inParallel`	If TRUE, calculate tuple frequencies in parallel.
`differentiate`	If TRUE, randomize coloring to differentiate overlapping lines.
`saveCounts`	If TRUE, save the tuple counts to the file ‘tupleCounts’.
`minFreq`	The smallest frequency to be displayed.
`dataset`	The dataset to process, a data frame or data.table.
`cls`	Cluster to be used if `inParallel` is TRUE. If `inParallel` is TRUE and `cls` is not supplied, it will use the sensed number of cores on the calling machine by default.

Details

Tuple tabulation is performed by tupleFreqs, or in large cases, in parallel by clsTupleFreqs. The display is done by discparcoord.

The k most- or least-frequent tuples will be reported, with the latter specified via negative k. Optionally, tuples with NA values will count less, but weigh toward everything that has existing numbers in common with it.

If continuous variables are present, then in most cases, either convert to discrete using discretize or use freqparcoord.

The data will be converted into a data.table if it is not already in that form. For this and other reasons, it is advantageous to have the data in that form to begin with, say by using data.table::fread to read the data.

Optionally, tuples that partially match a full tuple pattern except for NA values will add a partial count to the frequency count for the full pattern. If for instance the data consist of 8-tuples and a row in the data matches a given 8-tuple pattern in 7 of 8 components, this row would add a count of 7/8 to the frequency for that pattern. To reduce this weight, use a value greater than 1.0 for NAexp. If that value is 2, for example, the 7/8 increment will be 7/8 squared.

Value

The functions tupleFreqs and clsTupleFreqs return an object of class c('pna','data.frame'), with each row consisting of a tuple and its count. In addition the object will have attributes k and minFreq.

The function discparcoord returns an object of class c('plotly','htmlwidget'). Printing the object causes display of the graph.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Examples


   ## Not run: 
       data(Titanic)
       # Find frequencies in parallel
       discparcoord(Titanic, inParallel=TRUE)
    
## End(Not run)

    ## Not run: 
       data(hrdata)
       input1 = list("name" = "average_montly_hours",
                     "partitions" = 3, "labels" = c("low", "med", "high"))
       input = list(input1)
       # this will discretize the data by partitioning average monthly 
       # hours into 3 parts called low, med, and high
       hrdata = discretize(hrdata, input)
       print('first few discretized tuples')
       # first line should be 0.38,0.53,2,low,3,0,1,00,sales,low
       head(hrdata)
       print('first few most-frequent tuples')
       # first line should be 0.40,0.46,2,...,11
       tupleFreqs(hrdata,saveCounts=FALSE)
       # account for NA values and plot with parallel coordinates
       discparcoord(hrdata)
       # same as above, but with scrambled columns
       discparcoord(hrdata, permute=TRUE)
       # same as above, but show top k values
       discparcoord(hrdata, k=8)
       # same as above, but group according to profession
       discparcoord(hrdata, grpcategory="sales")
    
## End(Not run)
## Not run: 
       data(Titanic)
       # Find frequencies in parallel
       discparcoord(Titanic, inParallel=TRUE)
    
## End(Not run)

    ## Not run: 
       data(hrdata)
       input1 = list("name" = "average_montly_hours",
                     "partitions" = 3, "labels" = c("low", "med", "high"))
       input = list(input1)
       # this will discretize the data by partitioning average monthly 
       # hours into 3 parts called low, med, and high
       hrdata = discretize(hrdata, input)
       print('first few discretized tuples')
       # first line should be 0.38,0.53,2,low,3,0,1,00,sales,low
       head(hrdata)
       print('first few most-frequent tuples')
       # first line should be 0.40,0.46,2,...,11
       tupleFreqs(hrdata,saveCounts=FALSE)
       # account for NA values and plot with parallel coordinates
       discparcoord(hrdata)
       # same as above, but with scrambled columns
       discparcoord(hrdata, permute=TRUE)
       # same as above, but show top k values
       discparcoord(hrdata, k=8)
       # same as above, but group according to profession
       discparcoord(hrdata, grpcategory="sales")
    
## End(Not run)

Demographic statistics by ZIP Code.

Description

Useful for embeddings. Source is catalog.data.gov.

Discretize continuous data.

Description

Converts continuous columns to discrete.

Usage

    discretize(dataset, input = NULL, ndigs=2, nlevels=10, presumedFactor=FALSE)
discretize(dataset, input = NULL, ndigs=2, nlevels=10, presumedFactor=FALSE)

Arguments

`dataset`	Dataset to discretize, data frame/table.
`input`	Optional specification for partitioning, giving the number of partitions and labels for each partition. List of lists, one list per column to be converted. The outermost list indicates the columns to be converted, and each inner list holds the name of the column, the number of partitions, and a list of labels for each partition.
`ndigs`	Number of digits to retain in forming labels/values for the discretized data, if `input` is not supplied. E.g. if `ndigs` is 2 and the original datum is 38.12, it becomes 38.
`nlevels`	Number of partitions to form for each variable, if `input` is NULL.
`presumedFactor`	If TRUE, any variable having fewer than `nlevels` levels will be presumed to be an informal factor, and thus will not be discretized.

Details

If input is not specified, each numeric column in the data will be discretized, with one exception: If a column is numeric but has fewer distinct values than nlevels, and if presumedFactor is TRUE, it is presumed to be an informal R factor and will not be converted. However, it is best to use makeFactor on such variables.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Examples


    data(prgeng)
    pe <- prgeng[,c(1,3,5,7:9)]  # extract vars of interest
    pe25 <- pe[pe$wageinc < 250000,]  # delete extreme values
    pe25disc <- discretize(pe25)  # age, wageinc and wkswrkd discretized

    data(mlb)
    # extract the height, weight, age, and position of players
    m <- mlb[,4:7]

    inp1 <- list("name" = "Height",
                 "partitions"=4,
                 "labels"=c("short", "shortmid", "tallmid", "tall"))

    inp2 <- list("name" = "Weight",
                 "partitions"=3,
                 "labels"=c("light", "med", "heavy"))

    inp3 <- list("name" = "Age",
                 "partitions"=3,
                 "labels"=c("young", "med", "old"))

    # create one list to pass everything to discretize()
    discreteinput <- list(inp1, inp2, inp3)
    head(discreteinput)

    # at this point, all of the data has been discretized
    discretizedmlb <- discretize(m, discreteinput)
    head(discretizedmlb)


data(prgeng)
    pe <- prgeng[,c(1,3,5,7:9)]  # extract vars of interest
    pe25 <- pe[pe$wageinc < 250000,]  # delete extreme values
    pe25disc <- discretize(pe25)  # age, wageinc and wkswrkd discretized

    data(mlb)
    # extract the height, weight, age, and position of players
    m <- mlb[,4:7]

    inp1 <- list("name" = "Height",
                 "partitions"=4,
                 "labels"=c("short", "shortmid", "tallmid", "tall"))

    inp2 <- list("name" = "Weight",
                 "partitions"=3,
                 "labels"=c("light", "med", "heavy"))

    inp3 <- list("name" = "Age",
                 "partitions"=3,
                 "labels"=c("young", "med", "old"))

    # create one list to pass everything to discretize()
    discreteinput <- list(inp1, inp2, inp3)
    head(discreteinput)

    # at this point, all of the data has been discretized
    discretizedmlb <- discretize(m, discreteinput)
    head(discretizedmlb)

A human resources simulated dataset.

Description

A small fictional dataset by Kaggle that includes satisfaction level, the result of their last evaluation, number of projects, average monthly hours, time spent at the company, whether they have had a work accident, whether they have had a promotion in the last 5 years, their department, salary and finally whether the employee has left the company. Each row represents a single employee.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Change numeric variables factors.

Description

Change numeric variables that are specified in varnames to factors so that discretize won't partition.

Usage

    makeFactor(df, varnames)
makeFactor(df, varnames)

Arguments

`df`	Input data frame.
`varnames`	Names of variables to be converted to factors.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Examples

data(prgeng)
pe <- prgeng[,c(1,3,5,7:9)]
class(pe$educ)  # integer
pe <- makeFactor(pe,c('educ','occ','sex'))
class(pe$educ)  # factor
# nice to give levels names
levels(pe$sex) <- c('male','female')
head(pe$sex)
data(prgeng)
pe <- prgeng[,c(1,3,5,7:9)]
class(pe$educ)  # integer
pe <- makeFactor(pe,c('educ','occ','sex'))
class(pe$educ)  # factor
# nice to give levels names
levels(pe$sex) <- c('male','female')
head(pe$sex)

cdparcoord: Quick start

Description

Quick introduction to the package.

Examples

   # programmer/engineer info from 2000 Census
   data(prgeng)
   # select some columns of interest
   pe <- prgeng[,c(1,3,5,7:9)]
   # remove some extreme values
   pe25 <- pe[pe$wageinc < 250000,]
   # some numeric variables are really factors
   pe25 <- makeFactor(pe25,c('educ','occ','sex'))
   # convert the continuous variables to discrete
   pe25disc <- discretize(pe25,nlevels=5)
   ## Not run: 
      # display
      discparcoord(pe25disc,k=150)
      # then possibly brush, etc. 
   
## End(Not run)
# programmer/engineer info from 2000 Census
   data(prgeng)
   # select some columns of interest
   pe <- prgeng[,c(1,3,5,7:9)]
   # remove some extreme values
   pe25 <- pe[pe$wageinc < 250000,]
   # some numeric variables are really factors
   pe25 <- makeFactor(pe25,c('educ','occ','sex'))
   # convert the continuous variables to discrete
   pe25disc <- discretize(pe25,nlevels=5)
   ## Not run: 
      # display
      discparcoord(pe25disc,k=150)
      # then possibly brush, etc. 
   
## End(Not run)

Re-order levels of a factor, according to some desired ordinal form.

Description

Use to order the levels of a factor in a desired sequence.

Usage

    reOrder(dataset, colName, levelNames)
reOrder(dataset, colName, levelNames)

Arguments

`dataset`	Dataset to reorder.
`colName`	Column name.
`levelNames`	Names of the reordered levels

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Examples

   sl <- c('primary','college','hs','middle','hs')
   z <- data.frame(
          schlvl = factor(x=sl,
             levels=c('college','hs','middle','primary'))
          )
   z
   z <- reOrder(z,'schlvl',c('primary','middle','hs','college'))
   str(z)  # shows the desired label order in the 'categoryorder' attribute
sl <- c('primary','college','hs','middle','hs')
   z <- data.frame(
          schlvl = factor(x=sl,
             levels=c('college','hs','middle','primary'))
          )
   z
   z <- reOrder(z,'schlvl',c('primary','middle','hs','college'))
   str(z)  # shows the desired label order in the 'categoryorder' attribute

Show tuple counts for the most recent saved counting operation.

Description

Used with saveCounts TRUE in tupleFreqs etc. to recover the tuple counts.

Usage

    showCounts(nshow=NULL)
showCounts(nshow=NULL)

Arguments

nshow

Dataset to show.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

A small dataset for showing how tupleFreqs works in cdparcoord

Description

A small fictional dataset with different values and NA's to emphasize tupleFreqs and frequency based calculations with cdparcoord.

Author(s)

Norm Matloff <[email protected]>, Vincent Yang <[email protected]>, and Harrison Nguyen <[email protected]>

Titanic passengers

Description

Famous dataset, source various.

Package 'cdparcoord'

Help Index

Top-frequency parallel coordinates plots.

Description

Author(s)

Compute/display tuple frequency counts, and optionally account for NA values

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Demographic statistics by ZIP Code.

Description

Discretize continuous data.

Description

Usage

Arguments

Details

Author(s)

Examples

A human resources simulated dataset.

Description

Author(s)

Change numeric variables factors.

Description

Usage

Arguments

Author(s)

Examples

cdparcoord: Quick start

Description

Examples

Re-order levels of a factor, according to some desired ordinal form.

Description

Usage

Arguments

Author(s)

Examples

Show tuple counts for the most recent saved counting operation.

Description

Usage

Arguments

Author(s)

A small dataset for showing how tupleFreqs works in cdparcoord

Description

Author(s)

Titanic passengers

Description