Package 'flexclust'

Title:	Flexible Cluster Algorithms
Description:	The main function kcca implements a general framework for k-centroids cluster analysis supporting arbitrary distance measures and centroid computation. Further cluster methods include hard competitive learning, neural gas, and QT clustering. There are numerous visualization methods for cluster results (neighborhood graphs, convex cluster hulls, barcharts of centroids, ...), and bootstrap methods for the analysis of cluster stability.
Authors:	Friedrich Leisch [aut] (<https://orcid.org/0000-0001-7278-1983>, maintainer up to 2024), Evgenia Dimitriadou [ctb], Bettina Grün [ctb, cre]
Maintainer:	Bettina Grün <[email protected]>
License:	GPL-2
Version:	1.4-2
Built:	2025-01-23 06:37:51 UTC
Source:	CRAN

Help Index

Achievement Test Scores for New Haven Schools
Automobile Customer Survey Data
Barplot/chart Methods in Package ‘flexclust’
Bagged Clustering
Birth and Death Rates
Bootstrap Flexclust Algorithms
German Parliament Election Data
Box-Whisker Plot Methods in Package ‘flexclust’
Convex Clustering
Cluster Similarity Matrix
Conversion Between S3 Partition Objects and KCCA
Dentition of Mammals
Compute Pairwise Distances Between Two Data sets
Distance and Centroid Computation
Classes "flexclustControl" and "cclustControl"
Flexclust Color Palettes
Methods for Function histogram in Package ‘flexclust’
Methods for Function image in Package ‘flexclust’
Get Information on Fitted Flexclust Objects
K-Centroids Cluster Analysis
Convert Cluster Result to Data Frame
Milk of Mammals
Artificial Example with 4 Gaussians
Nutrients in Meat, Fish and Fowl
Methods for Function pairs in Package ‘flexclust’
Get Centroids from KCCA Object
Methods for Function plot in Package ‘flexclust’
Predict Cluster Membership
Artificial 2d Market Segment Data
Add Arrows for Projected Axes to a Plot
Barcharts and Boxplots for Columns of a Data Matrix Split by Groups
Stochastic QT Clustering
Compare Partitions
Plot a Random Tour
Relabel Cluster Results.
Cluster Shadows and Silhouettes
Shadow Stars
Segment Level Stability Across Solutions Plot.
Segment Level Stability Within Solution.
Run Flexclust Algorithms Repeatedly
Stripes Plot
Vacation Motives of Australians
Motivation of Australian Volunteers

Achievement Test Scores for New Haven Schools

Description

Measurements at the beginning of the 4th grade (when the national average is 4.0) and of the 6th grade in 25 schools in New Haven.

Usage

data(achieve)data(achieve)

Format

A data frame with 25 observations on the following 4 variables.

read4: 4th grade reading.
arith4: 4th grade arithmetic.
read6: 6th grade reading.
arith6: 6th grade arithmetic.

References

John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.

Automobile Customer Survey Data

Description

A German manufacturer of premium cars asked customers approximately 3 months after a car purchase which characteristics of the car were most important for the decision to buy the car. The survey was done in 1983 and the data set contains all responses without missing values.

Usage

data(auto)data(auto)

Format

A data frame with 793 observations on the following 46 variables.

model: A factor with levels A, B, C, or D; model bought by the customer.
gear: A factor with levels 4 gears, 5 econo, 5 sport, or automatic.
leasing: A logical vector, was leasing used to finance the car?
usage: A factor with levels private, both, business.
previous_model: A factor describing which type of car was owned directly before the purchase.
other_consider: A factor with levels same manuf, other manuf, both, or none.
test_drive: A logical vector, did you do a test drive?
info_adv: A logical vector, was advertising an important source of information?
info_exp: A logical vector, was experience an important source of information?
info_rec: A logical vector, were recommendations an important source of information?
ch_clarity: A logical vector.
ch_economy: A logical vector.
ch_driving_properties: A logical vector.
ch_service: A logical vector.
ch_interior: A logical vector.
ch_quality: A logical vector.
ch_technology: A logical vector.
ch_model_continuity: A logical vector.
ch_comfort: A logical vector.
ch_reliability: A logical vector.
ch_handling: A logical vector.
ch_reputation: A logical vector.
ch_concept: A logical vector.
ch_character: A logical vector.
ch_power: A logical vector.
ch_resale_value: A logical vector.
ch_styling: A logical vector.
ch_safety: A logical vector.
ch_sporty: A logical vector.
ch_consumption: A logical vector.
ch_space: A logical vector.
satisfaction: A numeric vector describing overall satisfaction (1=very good, 10=very bad).
good1: Conception, styling, dimensions.
good2: Auto body.
good3: Driving and coupled axles.
good4: Engine.
good5: Electronics.
good6: Financing and customer service.
good7: Other.
sporty: What do you think about the balance of sportiness and comfort? (good, more sport, more comfort).
drive_char: Driving characteristis (gentle < speedy < powerfull < extreme).
tempo: Which average speed do you prefer on German Autobahn in km/h? (< 130 < 130-150 < 150-180 < > 180)
consumption: An ordered factor with levels low < ok < high < too high.
gender: A factor with levels male and female
occupation: A factor with levels self-employed, freelance, and employee.
household: Size of household, an ordered factor with levels 1-2 < >=3.

Source

The original German data are in the public domain and available from LMU Munich (doi:10.5282/ubm/data.14). The variable names and help page were translated to English and converted into Rd format by Friedrich Leisch.

References

Open Data LMU (1983): Umfrage unter Kunden einer Automobilfirma, doi:10.5282/ubm/data.14

Examples

data(auto)
summary(auto)
data(auto)
summary(auto)

Barplot/chart Methods in Package ‘flexclust’

Description

Barplot of cluster centers or other cluster statistics.

Usage

## S4 method for signature 'kcca'
barplot(height, bycluster = TRUE, oneplot = TRUE,
    data = NULL, FUN = colMeans, main = deparse(substitute(height)), 
    which = 1:height@k, names.arg = NULL,
    oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...)

## S4 method for signature 'kcca'
barchart(x, data, xlab="", strip.labels=NULL,
    strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
    which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE,
    clusters=1:x@k, ...)
## S4 method for signature 'hclust'
barchart(x, data, xlab="", strip.labels=NULL,
    strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
    which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...)
## S4 method for signature 'bclust'
barchart(x, data, xlab="", strip.labels=NULL,
       strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, 
       which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE,
       k=x@k, clusters=1:k, ...)
## S4 method for signature 'kcca'
barplot(height, bycluster = TRUE, oneplot = TRUE,
    data = NULL, FUN = colMeans, main = deparse(substitute(height)), 
    which = 1:height@k, names.arg = NULL,
    oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...)

## S4 method for signature 'kcca'
barchart(x, data, xlab="", strip.labels=NULL,
    strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
    which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE,
    clusters=1:x@k, ...)
## S4 method for signature 'hclust'
barchart(x, data, xlab="", strip.labels=NULL,
    strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol,
    which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...)
## S4 method for signature 'bclust'
barchart(x, data, xlab="", strip.labels=NULL,
       strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, 
       which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE,
       k=x@k, clusters=1:k, ...)

Arguments

`height`, `x`	An object of class `"kcca"`.
`bycluster`	If `TRUE`, then each barplot shows one cluster. If `FALSE`, then each barplot compares all cluster for one input variable.
`oneplot`	If `TRUE`, all barplots are plotted together on one page, else each plot is on a separate page.
`data`	If not `NULL`, cluster membership is predicted for the new data and used for the plots. By default the values from the training data are used. Ignored by the `barchart` method.
`FUN`	The function to be applied to each cluster for calculating the bar heights. Only used, if `data` is not `NULL`.
`which`	For `barplot` index number of clusters for the plot, for `barchart` index numbers or names of variables to plot.
`names.arg`	A vector of names to be plotted below each bar.
`main`, `oma`, `xlab`, `...`	Graphical parameters.
`col`	Vector of colors for the clusters.
`mcol`, `mlcol`	If not `NULL`, the value of `FUN` for the complete data set is plotted over each bar as a point with color `mcol` and a line segment starting in zero with color `mlcol`.
`srt`	Number between 0 and 90, rotation of the x-axis labels.
`strip.labels`	Vector of strings for the strips of the Trellis display.
`strip.prefix`	Prefix string for the strips of the Trellis display.
`legend`	If `TRUE`, the barchart is always plotted on the current graphics device and a legend is added to the bottom of the plot.
`shade`	If `TRUE`, only bars with large absolute or relative deviation deviation from the sample mean of the respective variables are plotted in color.
`diff`	A numerical vector of length two with absolute and relative deviations for shading, default is $max/4$ absolute deviation and 50% relative deviation.
`byvar`	If `TRUE`, a panel is plotted for each variable. By default a panel is plotted for each cluster.
`clusters`	Integer vector of clusters to plot.
`k`	Integer specifying the desired number of clusters.

Note

The flexclust barchart method uses a horizontal arrangements of the bars, and sorts them from top to bottom. Default barcharts in lattice are the other way round (bottom to top). See the examples below how this affects, e.g., manual labels for the y axis.

The barplot method is legacy code and only maintained to keep up with changes in R, all active development is done on barchart.

Author(s)

Friedrich Leisch

References

Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2), 97-120, 2014.

Examples

  cl <- cclust(iris[,-5], k=3)
  barplot(cl)
  barplot(cl, bycluster=FALSE)

  ## plot the maximum instead of mean value per cluster:
  barplot(cl, bycluster=FALSE, data=iris[,-5],
          FUN=function(x) apply(x,2,max))

  ## use lattice for plotting:
  barchart(cl)
  ## automatic abbreviation of labels
  barchart(cl, scales=list(abbreviate=TRUE))
  ## origin of bars at zero
  barchart(cl, scales=list(abbreviate=TRUE), origin=0)

  ## Use manual labels. Note that the flexclust barchart orders bars
  ## from top to bottom (the default does it the other way round), hence
  ## we have to rev() the labels:
  LAB <- c("SL", "SW", "PL", "PW")
  barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0)

  ## deviation of each cluster center from the population means
  barchart(cl, origin=rev(cl@xcent), mlcol=NULL)

  ## use shading to highlight large deviations from population mean
  barchart(cl, shade=TRUE)

  ## use smaller deviation limit than default and add a legend
  barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)
cl <- cclust(iris[,-5], k=3)
  barplot(cl)
  barplot(cl, bycluster=FALSE)

  ## plot the maximum instead of mean value per cluster:
  barplot(cl, bycluster=FALSE, data=iris[,-5],
          FUN=function(x) apply(x,2,max))

  ## use lattice for plotting:
  barchart(cl)
  ## automatic abbreviation of labels
  barchart(cl, scales=list(abbreviate=TRUE))
  ## origin of bars at zero
  barchart(cl, scales=list(abbreviate=TRUE), origin=0)

  ## Use manual labels. Note that the flexclust barchart orders bars
  ## from top to bottom (the default does it the other way round), hence
  ## we have to rev() the labels:
  LAB <- c("SL", "SW", "PL", "PW")
  barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0)

  ## deviation of each cluster center from the population means
  barchart(cl, origin=rev(cl@xcent), mlcol=NULL)

  ## use shading to highlight large deviations from population mean
  barchart(cl, shade=TRUE)

  ## use smaller deviation limit than default and add a legend
  barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)

Bagged Clustering

Description

Cluster the data in x using the bagged clustering algorithm. A partitioning cluster algorithm such as cclust is run repeatedly on bootstrap samples from the original data. The resulting cluster centers are then combined using the hierarchical cluster algorithm hclust.

Usage

bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0,
       dist.method = "euclidian", hclust.method = "average",
       FUN = "cclust", verbose = TRUE, final.cclust = FALSE,
       resample = TRUE, weights = NULL, maxcluster = base.k, ...)
## S4 method for signature 'bclust,missing'
plot(x, y, maxcluster = x@maxcluster, main = "", ...)
## S4 method for signature 'bclust,missing'
clusters(object, newdata, k, ...)
## S4 method for signature 'bclust'
parameters(object, k)
bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0,
       dist.method = "euclidian", hclust.method = "average",
       FUN = "cclust", verbose = TRUE, final.cclust = FALSE,
       resample = TRUE, weights = NULL, maxcluster = base.k, ...)
## S4 method for signature 'bclust,missing'
plot(x, y, maxcluster = x@maxcluster, main = "", ...)
## S4 method for signature 'bclust,missing'
clusters(object, newdata, k, ...)
## S4 method for signature 'bclust'
parameters(object, k)

Arguments

`x`	Matrix of inputs (or object of class `"bclust"` for plot).
`k`	Number of clusters.
`base.iter`	Number of runs of the base cluster algorithm.
`base.k`	Number of centers used in each repetition of the base method.
`minsize`	Minimum number of points in a base cluster.
`dist.method`	Distance method used for the hierarchical clustering, see `dist` for available distances.
`hclust.method`	Linkage method used for the hierarchical clustering, see `hclust` for available methods.
`FUN`	Partitioning cluster method used as base algorithm.
`verbose`	Output status messages.
`final.cclust`	If `TRUE`, a final cclust step is performed using the output of the bagged clustering as initialization.
`resample`	Logical, if `TRUE` the base method is run on bootstrap samples of `x`, else directly on `x`.
`weights`	Vector of length `nrow(x)`, weights for the resampling. By default all observations have equal weight.
`maxcluster`	Maximum number of clusters memberships are to be computed for.
`object`	Object of class `"bclust"`.
`main`	Main title of the plot.
`...`	Optional arguments top be passed to the base method in `bclust`, ignored in `plot`.
`y`	Missing.
`newdata`	An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used.

Details

First, base.iter bootstrap samples of the original data in x are created by drawing with replacement. The base cluster method is run on each of these samples with base.k centers. The base.method must be the name of a partitioning cluster function returning an object with the same slots as the return value of cclust.

This results in a collection of iter.base * base.centers centers, which are subsequently clustered using the hierarchical method hclust. Base centers with less than minsize points in there respective partitions are removed before the hierarchical clustering. The resulting dendrogram is then cut to produce k clusters.

Value

bclust returns objects of class "bclust" including the slots

`hclust`	Return value of the hierarchical clustering of the collection of base centers (Object of class `"hclust"`).
`cluster`	Vector with indices of the clusters the inputs are assigned to.
`centers`	Matrix of centers of the final clusters. Only useful, if the hierarchical clustering method produces convex clusters.
`allcenters`	Matrix of all `iter.base * base.centers` centers found in the base runs.

Author(s)

Friedrich Leisch

References

Friedrich Leisch. Bagged clustering. Working Paper 51, SFB “Adaptive Information Systems and Modeling in Economics and Management Science”, August 1999. https://epub.wu.ac.at/1272/1/document.pdf

Sara Dolnicar and Friedrich Leisch. Winter tourist segments in Austria: Identifying stable vacation styles using bagged clustering techniques. Journal of Travel Research, 41(3):281-292, 2003.

Examples

data(iris)
bc1 <- bclust(iris[,1:4], 3, base.k=5)
plot(bc1)

table(clusters(bc1, k=3))
parameters(bc1, k=3)
data(iris)
bc1 <- bclust(iris[,1:4], 3, base.k=5)
plot(bc1)

table(clusters(bc1, k=3))
parameters(bc1, k=3)

Birth and Death Rates

Description

Birth and death rates for 70 countries.

Usage

data(birth)data(birth)

Format

A data frame with 70 observations on the following 2 variables.

birth: Birth rate (in percent).
death: Death rate (in percent).

References

John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.

Bootstrap Flexclust Algorithms

Description

Runs clustering algorithms repeatedly for different numbers of clusters on bootstrap replica of the original data and returns corresponding cluster assignments, centroids and (adjusted) Rand indices comparing pairs of partitions.

Usage

bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL,
              multicore=TRUE, verbose=FALSE, ...)

## S4 method for signature 'bootFlexclust'
summary(object)
## S4 method for signature 'bootFlexclust,missing'
plot(x, y, ...)
## S4 method for signature 'bootFlexclust'
boxplot(x, ...)
## S4 method for signature 'bootFlexclust'
densityplot(x, data, ...)
bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL,
              multicore=TRUE, verbose=FALSE, ...)

## S4 method for signature 'bootFlexclust'
summary(object)
## S4 method for signature 'bootFlexclust,missing'
plot(x, y, ...)
## S4 method for signature 'bootFlexclust'
boxplot(x, ...)
## S4 method for signature 'bootFlexclust'
densityplot(x, data, ...)

Arguments

`x`, `k`, `...`	Passed to `stepFlexclust`.
`nboot`	Number of bootstrap pairs of partitions.
`correct`	Logical, correct the Rand index for agreement by chance also called adjusted Rand index)?
`seed`	If not `NULL`, a call to `set.seed()` is made before any clustering is done.
`multicore`	If `TRUE`, use package parallel for parallel processing. In addition, it may be a workstation cluster object as returned by `makeCluster`, see examples below.
`verbose`	If `TRUE`, show progress information during computations. Will not work with `multicore=TRUE`.
`y`, `data`	Not used.
`object`	An object of class `"bootFlexclust"`.

Details

Availability of multicore is checked when flexclust is loaded. This information is stored and can be obtained using getOption("flexclust")$have_multicore. Set to FALSE for debugging and more sensible error messages in case something goes wrong.

Author(s)

Friedrich Leisch

Examples

## Not run: 

## data uniform on unit square
x <- matrix(runif(400), ncol=2)

cl <- FALSE

## to run bootstrap replications on a workstation cluster do the following:
library("parallel")
cl <- makeCluster(2, type = "PSOCK")
clusterCall(cl, function() require("flexclust"))


## 50 bootstrap replicates for speed in example,
## use more for real applications
bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl)

bcl
summary(bcl)

## splitting the square into four quadrants should be the most stable
## solution (increase nboot if not)
plot(bcl)
densityplot(bcl, from=0)

## End(Not run)## Not run: 

## data uniform on unit square
x <- matrix(runif(400), ncol=2)

cl <- FALSE

## to run bootstrap replications on a workstation cluster do the following:
library("parallel")
cl <- makeCluster(2, type = "PSOCK")
clusterCall(cl, function() require("flexclust"))


## 50 bootstrap replicates for speed in example,
## use more for real applications
bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl)

bcl
summary(bcl)

## splitting the square into four quadrants should be the most stable
## solution (increase nboot if not)
plot(bcl)
densityplot(bcl, from=0)

## End(Not run)

German Parliament Election Data

Description

Results of the elections 2002, 2005 or 2009 for the German Bundestag, the first chamber of the German parliament.

Usage

data(btw2002)
data(btw2005)
data(btw2009)
bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)
data(btw2002)
data(btw2005)
data(btw2009)
bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)

Arguments

`year`	Numeric or character, year of the election.
`second`	Logical, return second or first votes?
`percent`	Logical, return percentages or absolute numbers?
`nazero`	Logical, convert `NA`s to 0?
`state`	Logical or character. If `TRUE` then only column `state` from the corresponding data frame is returned, and all other arguments are ignored. If character, then it is used as pattern to `grep` for the corresponding state(s), see examples.

Format

btw200x are data frames with 299 rows (corresponding to constituencies) and 17 columns. All columns except state are numeric.

state: Factor, the 16 German federal states.
eligible: Number of citizens eligible to vote.
votes: Number of eligible citizens who did vote.
invalid1, invalid2: Number of invalid first and second votes (see details below).
valid1, valid2: Number of valid first and second votes.
SPD1, SPD2: Number of first and second votes for the Social Democrats.
UNION1, UNION2: Number of first and second votes for CDU/CSU, the conservative Christian Democrats.
GRUENE1, GRUENE2: Number of first and second votes for the Green Party.
FDP1, FDP2: Number of first and second votes for the Liberal Party.
LINKE1, LINKE2: Number of first and second votes for the Left Party (PDS in 2002).

Missing values indicate that a party did not candidate in the corresponding constituency.

Details

btw200x are the original data sets. bundestag() is a helper function which extracts first or second votes, calculates percentages (number of votes for a party divided by number of valid votes), replaces missing values by zero, and converts the result from a data frame to a matrix. By default it returns the percentage of second votes for each party, which determines the number of seats each party gets in parliament.

German Federal Elections

Half of the Members of the German Bundestag are elected directly from Germany's 299 constituencies, the other half on the parties' state lists. Accordingly, each voter has two votes in the elections to the German Bundestag. The first vote, allowing voters to elect their local representatives to the Bundestag, decides which candidates are sent to Parliament from the constituencies.

The second vote is cast for a party list. And it is this second vote that determines the relative strengths of the parties represented in the Bundestag. At least 598 Members of the German Bundestag are elected in this way. In addition to this, there are certain circumstances in which some candidates win what are known as “overhang mandates” when the seats are being distributed.

References

Homepage of the Bundestag: https://www.bundestag.de

Examples

p02 <- bundestag(2002)
pairs(p02)
p05 <- bundestag(2005)
pairs(p05)
p09 <- bundestag(2009)
pairs(p09)

state <- bundestag(2002, state=TRUE)
table(state)

start.with.b <- bundestag(2002, state="^B")
table(start.with.b)

pairs(p09, col=2-(state=="Bayern"))
p02 <- bundestag(2002)
pairs(p02)
p05 <- bundestag(2005)
pairs(p05)
p09 <- bundestag(2009)
pairs(p09)

state <- bundestag(2002, state=TRUE)
table(state)

start.with.b <- bundestag(2002, state="^B")
table(start.with.b)

pairs(p09, col=2-(state=="Bayern"))

Box-Whisker Plot Methods in Package ‘flexclust’

Description

Seperate boxplot of variables in each cluster in comparison with boxplot for complete sample.

Usage

## S4 method for signature 'kcca'
bwplot(x, data, xlab="",
       strip.labels=NULL, strip.prefix="Cluster ",
       col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...)
## S4 method for signature 'bclust'
bwplot(x, k=x@k, xlab="", strip.labels=NULL, 
       strip.prefix="Cluster ", clusters=1:k, ...)
## S4 method for signature 'kcca'
bwplot(x, data, xlab="",
       strip.labels=NULL, strip.prefix="Cluster ",
       col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...)
## S4 method for signature 'bclust'
bwplot(x, k=x@k, xlab="", strip.labels=NULL, 
       strip.prefix="Cluster ", clusters=1:k, ...)

Arguments

`x`	An object of class `"kcca"` or `"bclust"`.
`data`	If not `NULL`, cluster membership is predicted for the new data and used for the plots. By default the values from the training data are used.
`xlab`, `...`	Graphical parameters.
`col`	Vector of colors for the clusters.
`strip.labels`	Vector of strings for the strips of the Trellis display.
`strip.prefix`	Prefix string for the strips of the Trellis display.
`shade`	If `TRUE`, only boxes with larger deviation from the median or quartiles of the total population of the respective variables are filled with color.
`shadefun`	A function or name of a function to compute which boxes are shaded, e.g. `"medianInside"` (default) or `"boxOverlap"`.
`byvar`	If `TRUE`, a panel is plotted for each variable. By default a panel is plotted for each group.
`k`	Number of clusters.
`clusters`	Integer vector of clusters to plot.

Examples

  set.seed(1)
  cl <- cclust(iris[,-5], k=3, save.data=TRUE)
  bwplot(cl)
  bwplot(cl, byvar=TRUE)

  ## fill only boxes with color which do not contain the overall median
  ## (grey dot of background box)
  bwplot(cl, shade=TRUE)

  ## fill only boxes with color which do not overlap with the box of the
  ## complete sample (grey background box)
  bwplot(cl, shadefun="boxOverlap")
set.seed(1)
  cl <- cclust(iris[,-5], k=3, save.data=TRUE)
  bwplot(cl)
  bwplot(cl, byvar=TRUE)

  ## fill only boxes with color which do not contain the overall median
  ## (grey dot of background box)
  bwplot(cl, shade=TRUE)

  ## fill only boxes with color which do not overlap with the box of the
  ## complete sample (grey background box)
  bwplot(cl, shadefun="boxOverlap")

Convex Clustering

Description

Perform k-means clustering, hard competitive learning or neural gas on a data matrix.

Usage

cclust(x, k, dist = "euclidean", method = "kmeans",
       weights=NULL, control=NULL, group=NULL, simple=FALSE,
       save.data=FALSE)
cclust(x, k, dist = "euclidean", method = "kmeans",
       weights=NULL, control=NULL, group=NULL, simple=FALSE,
       save.data=FALSE)

Arguments

`x`	A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
`k`	Either the number of clusters, or a vector of cluster assignments, or a matrix of initial (distinct) cluster centroids. If a number, a random set of (distinct) rows in `x` is chosen as the initial centroids.
`dist`	Distance measure, one of `"euclidean"` (mean square distance) or `"manhattan "` (absolute distance).
`method`	Clustering algorithm: one of `"kmeans"`, `"hardcl"` or `"neuralgas"`, see details below.
`weights`	An optional vector of weights for the observations (rows of the `x`) to be used in the fitting process. Works only in combination with hard competitive learning.
`control`	An object of class `"cclustControl"`.
`group`	Currently ignored.
`simple`	Return an object of class `"kccasimple"`?
`save.data`	Save a copy of `x` in the return object?

Details

This function uses the same computational engine as the earlier function of the same name from package ‘cclust’. The main difference is that it returns an S4 object of class "kcca", hence all available methods for "kcca" objects can be used. By default kcca and cclust use exactly the same algorithm, but cclust will usually be much faster because it uses compiled code.

If dist is "euclidean", the distance between the cluster center and the data points is the Euclidian distance (ordinary kmeans algorithm), and cluster means are used as centroids. If "manhattan", the distance between the cluster center and the data points is the sum of the absolute values of the distances, and the column-wise cluster medians are used as centroids.

If method is "kmeans", the classic kmeans algorithm as given by MacQueen (1967) is used, which works by repeatedly moving all cluster centers to the mean of their respective Voronoi sets. If "hardcl", on-line updates are used (AKA hard competitive learning), which work by randomly drawing an observation from x and moving the closest center towards that point (e.g., Ripley 1996). If "neuralgas" then the neural gas algorithm by Martinetz et al (1993) is used. It is similar to hard competitive learning, but in addition to the closest centroid also the second closest centroid is moved in each iteration.

Value

An object of class "kcca".

Author(s)

Evgenia Dimitriadou and Friedrich Leisch

References

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.

Martinetz T., Berkovich S., and Schulten K (1993). ‘Neural-Gas’ Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.

Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.

Examples

## a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2),
           matrix(rnorm(100, mean=1, sd=0.3), ncol=2))
cl <- cclust(x,2)
plot(x, col=predict(cl))
points(cl@centers, pch="x", cex=2, col=3) 

## a 3-dimensional example 
x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3),
           matrix(rnorm(150, mean=2, sd=0.3), ncol=3),
           matrix(rnorm(150, mean=4, sd=0.3), ncol=3))
cl <- cclust(x, 6, method="neuralgas", save.data=TRUE)
pairs(x, col=predict(cl))
plot(cl)
## a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2),
           matrix(rnorm(100, mean=1, sd=0.3), ncol=2))
cl <- cclust(x,2)
plot(x, col=predict(cl))
points(cl@centers, pch="x", cex=2, col=3) 

## a 3-dimensional example 
x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3),
           matrix(rnorm(150, mean=2, sd=0.3), ncol=3),
           matrix(rnorm(150, mean=4, sd=0.3), ncol=3))
cl <- cclust(x, 6, method="neuralgas", save.data=TRUE)
pairs(x, col=predict(cl))
plot(cl)

Cluster Similarity Matrix

Description

Returns a matrix of cluster similarities. Currently two methods for computing similarities of clusters are implemented, see details below.

Usage

## S4 method for signature 'kcca'
clusterSim(object, data=NULL, method=c("shadow", "centers"), 
           symmetric=FALSE, ...)
## S4 method for signature 'kccasimple'
clusterSim(object, data=NULL, method=c("shadow", "centers"), 
           symmetric=FALSE, ...)
## S4 method for signature 'kcca'
clusterSim(object, data=NULL, method=c("shadow", "centers"), 
           symmetric=FALSE, ...)
## S4 method for signature 'kccasimple'
clusterSim(object, data=NULL, method=c("shadow", "centers"), 
           symmetric=FALSE, ...)

Arguments

`object`	Fitted object.
`data`	Data to use for computation of the shadow values. If the cluster object `x` was created with `save.data=TRUE`, then these are used by default. Ignored if `method="centers"`.
`method`	Type of similarities, see details below.
`symmetric`	Compute symmetric or asymmetric shadow values? Ignored if `method="centers"`.
`...`	Currently not used.

Details

If method="shadow" (the default), then the similarity of two clusters is proportional to the number of points in a cluster, where the centroid of the other cluster is second-closest. See Leisch (2006, 2008) for detailed formulas.

If method="centers", then first the pairwise distances between all centroids are computed and rescaled to [0,1]. The similarity between tow clusters is then simply 1 minus the rescaled distance.

Author(s)

Friedrich Leisch

References

Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.

Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.

Examples

example(Nclus)

clusterSim(cl)
clusterSim(cl, symmetric=TRUE)

## should have similar structure but will be numerically different:
clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),])

## different concept of cluster similarity
clusterSim(cl, method="centers")
example(Nclus)

clusterSim(cl)
clusterSim(cl, symmetric=TRUE)

## should have similar structure but will be numerically different:
clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),])

## different concept of cluster similarity
clusterSim(cl, method="centers")

Conversion Between S3 Partition Objects and KCCA

Description

These functions can be used to convert the results from cluster functions like kmeans or pam to objects of class "kcca" and vice versa.

Usage

as.kcca(object, ...)

## S3 method for class 'hclust'
as.kcca(object, data, k, family=NULL, save.data=FALSE, ...)
## S3 method for class 'kmeans'
as.kcca(object, data, save.data=FALSE, ...)
## S3 method for class 'partition'
as.kcca(object, data=NULL, save.data=FALSE, ...)
## S3 method for class 'skmeans'
as.kcca(object, data, save.data=FALSE, ...)
## S4 method for signature 'kccasimple,kmeans'
coerce(from, to="kmeans", strict=TRUE)

Cutree(tree, k=NULL, h=NULL)
as.kcca(object, ...)

## S3 method for class 'hclust'
as.kcca(object, data, k, family=NULL, save.data=FALSE, ...)
## S3 method for class 'kmeans'
as.kcca(object, data, save.data=FALSE, ...)
## S3 method for class 'partition'
as.kcca(object, data=NULL, save.data=FALSE, ...)
## S3 method for class 'skmeans'
as.kcca(object, data, save.data=FALSE, ...)
## S4 method for signature 'kccasimple,kmeans'
coerce(from, to="kmeans", strict=TRUE)

Cutree(tree, k=NULL, h=NULL)

Arguments

`object`	Fitted object.
`data`	Data which were used to obtain the clustering. For `"partition"` objects created by functions from package cluster this is optional, if `object` contains the data.
`save.data`	Save a copy of the data in the return object?
`k`	Number of clusters.
`family`	Object of class `"kccaFamily"`, can be omitted for some known distances.
`...`	Currently not used.
`from`, `to`, `strict`	Usual arguments for `coerce`
`tree`	A tree as produced by `hclust`.
`h`	Numeric scalar or vector with heights where the tree should be cut.

Details

The standard cutree function orders clusters such that observation one is in cluster one, the first observation (as ordered in the data set) not in cluster one is in cluster two, etc. Cutree orders clusters as shown in the dendrogram from left to right such that similar clusters have similar numbers. The latter is used when converting to kcca.

For hierarchical clustering the cluster memberships of the converted object can be different from the result of Cutree, because one KCCA-iteration has to be performed in order to obtain a valid kcca object. In this case a warning is issued.

Author(s)

Friedrich Leisch

Examples

data(Nclus)

cl1 <- kmeans(Nclus, 4)
cl1
cl1a <- as.kcca(cl1, Nclus)
cl1a
cl1b <- as(cl1a, "kmeans")



library("cluster")
cl2 <- pam(Nclus, 4)
cl2
cl2a <- as.kcca(cl2)
cl2a
## the same
cl2b <- as.kcca(cl2, Nclus)
cl2b



## hierarchical clustering
hc <- hclust(dist(USArrests))
plot(hc)
rect.hclust(hc, k=3)
c3 <- Cutree(hc, k=3)
k3 <- as.kcca(hc, USArrests, k=3)
barchart(k3)
table(c3, clusters(k3))
data(Nclus)

cl1 <- kmeans(Nclus, 4)
cl1
cl1a <- as.kcca(cl1, Nclus)
cl1a
cl1b <- as(cl1a, "kmeans")



library("cluster")
cl2 <- pam(Nclus, 4)
cl2
cl2a <- as.kcca(cl2)
cl2a
## the same
cl2b <- as.kcca(cl2, Nclus)
cl2b



## hierarchical clustering
hc <- hclust(dist(USArrests))
plot(hc)
rect.hclust(hc, k=3)
c3 <- Cutree(hc, k=3)
k3 <- as.kcca(hc, USArrests, k=3)
barchart(k3)
table(c3, clusters(k3))

Dentition of Mammals

Description

Mammal's teeth divided into the 4 groups: incisors, canines, premolars and molars.

Usage

data(dentitio)data(dentitio)

Format

A data frame with 66 observations on the following 8 variables.

top.inc: Top incisors.
bot.inc: Bottom incisors.
top.can: Top canines.
bot.can: Bottom canines.
top.pre: Top premolars.
bot.pre: Bottom premolars.
top.mol: Top molars.
bot.mol: Bottom molars.

References

John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.

Compute Pairwise Distances Between Two Data sets

Description

This function computes and returns the distance matrix computed by using the specified distance measure to compute the pairwise distances between the rows of two data matrices.

Usage

dist2(x, y, method = "euclidean", p=2)
dist2(x, y, method = "euclidean", p=2)

Arguments

`x`	A data matrix.
`y`	A vector or second data matrix.
`method`	the distance measure to be used. This must be one of `"euclidean"`, `"maximum"`, `"manhattan"`, `"canberra"`, `"binary"` or `"minkowski"`. Any unambiguous substring can be given.
`p`	The power of the Minkowski distance.

Details

This is a two-data-set equivalent of the standard function dist. It returns a matrix of all pairwise distances between rows in x and y. The current implementation is efficient only if y has not too many rows (the code is vectorized in x but not in y).

Note

The definition of Canberra distance was wrong for negative data prior to version 1.3-5.

Author(s)

Friedrich Leisch

Examples

x <- matrix(rnorm(20), ncol=4)
rownames(x) = paste("X", 1:nrow(x), sep=".")
y <- matrix(rnorm(12), ncol=4)
rownames(y) = paste("Y", 1:nrow(y), sep=".")

dist2(x, y)
dist2(x, y, "man")

data(milk)
dist2(milk[1:5,], milk[4:6,])
x <- matrix(rnorm(20), ncol=4)
rownames(x) = paste("X", 1:nrow(x), sep=".")
y <- matrix(rnorm(12), ncol=4)
rownames(y) = paste("Y", 1:nrow(y), sep=".")

dist2(x, y)
dist2(x, y, "man")

data(milk)
dist2(milk[1:5,], milk[4:6,])

Distance and Centroid Computation

Description

Helper functions to create kccaFamily objects.

Usage

distAngle(x, centers)
distCanberra(x, centers)
distCor(x, centers)
distEuclidean(x, centers)
distJaccard(x, centers)
distManhattan(x, centers)
distMax(x, centers)
distMinkowski(x, centers, p=2)

centAngle(x)
centMean(x)
centMedian(x)

centOptim(x, dist)
centOptim01(x, dist)
distAngle(x, centers)
distCanberra(x, centers)
distCor(x, centers)
distEuclidean(x, centers)
distJaccard(x, centers)
distManhattan(x, centers)
distMax(x, centers)
distMinkowski(x, centers, p=2)

centAngle(x)
centMean(x)
centMedian(x)

centOptim(x, dist)
centOptim01(x, dist)

Arguments

`x`	A data matrix.
`centers`	A matrix of centroids.
`p`	The power of the Minkowski distance.
`dist`	A distance function.

Author(s)

Friedrich Leisch

Classes "flexclustControl" and "cclustControl"

Description

Hyperparameters for cluster algorithms.

Objects from the Class

Objects can be created by calls of the form new("flexclustControl", ...). In addition, named lists can be coerced to flexclustControl objects, names are completed if unique (see examples).

Slots

Objects of class "flexclustControl" have the following slots:

iter.max:: Maximum number of iterations.
tolerance:: The algorithm is stopped when the (relative) change of the optimization criterion is smaller than tolerance.
verbose:: If a positive integer, then progress is reported every verbose iterations. If 0, no output is generated during model fitting.
classify:: Character string, one of "auto", "weighted", "hard" or "simann".
initcent:: Character string, name of function for initial centroids, currently "randomcent" (the default) and "kmeanspp" are available.
gamma:: Gamma value for weighted hard competitive learning.
simann:: Parameters for simulated annealing optimization (only used when classify="simann").
ntry:: Number of trials per iteration for QT clustering.
min.size:: Clusters smaller than this value are treated as outliers.

Objects of class "cclustControl" inherit from "flexclustControl" and have the following additional slots:

method:: Learning rate for hard competitive learning, one of "polynomial" or "exponential".
pol.rate:: Positive number for polynomial learning rate of form $1/iter^{par}$ .
exp.rate: Vector of length 2 with parameters for exponential learning rate of form $par1*(par2/par1)^{(iter/iter.max)}$

ng.rate:: Vector of length 4 with parameters for neural gas, see details below.

Learning Rate of Neural Gas

The neural gas algorithm uses updates of form

$cnew = cold + e*exp(-m/l)*(x - cold)$

for every centroid, where $m$ is the order (minus 1) of the centroid with respect to distance to data point $x$ (0=closest, 1=second, ...). The parameters $e$ and $l$ are given by

$e = par1*(par2/par1)^{(iter/iter.max)},$

$l = par3*(par4/par3)^{(iter/iter.max)}.$

See Martinetz et al (1993) for details of the algorithm, and the examples section on how to obtain default values.

Author(s)

Friedrich Leisch

References

Martinetz T., Berkovich S., and Schulten K. (1993). "Neural-Gas Network for Vector Quantization and its Application to Time-Series Prediction." IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.

Arthur D. and Vassilvitskii S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. pp. 1027-1035.

Examples

## have a look at the defaults
new("flexclustControl")

## corce a list
mycont <- list(iter=500, tol=0.001, class="w")
as(mycont, "flexclustControl")

## some additional slots
as(mycont, "cclustControl")

## default values for ng.rate
new("cclustControl")@ng.rate
## have a look at the defaults
new("flexclustControl")

## corce a list
mycont <- list(iter=500, tol=0.001, class="w")
as(mycont, "flexclustControl")

## some additional slots
as(mycont, "cclustControl")

## default values for ng.rate
new("cclustControl")@ng.rate

Flexclust Color Palettes

Description

Create and access palettes for the plot methods.

Usage

  flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE)
  flxPalette(n, ...)
flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE)
  flxPalette(n, ...)

Arguments

`n`	Index number of color to return (1 to 8) for `flxColor`, number of colors to return for `flxPalette()`.
`color`	Type of color, see details.
`grey`	Return grey value corresponding to palette.
`...`	Passed on to `flxColors()`.

Details

This function creates color palettes in HCL space for up to 8 colors. All palettes have constant chroma and luminance, only the hue of the colors change within a palette.

Palettes "full" and "dark" have the same luminance, and palettes "medium" and "light" have the same luminance.

Author(s)

Friedrich Leisch

Examples

opar <- par(c("mfrow", "mar", "xaxt"))
par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n")

x <- rep(1, 8)

barplot(x, col = flxColors(color="full"), main="full")
barplot(x, col = flxColors(color="dark"), main="dark")
barplot(x, col = flxColors(color="medium"), main="medium")
barplot(x, col = flxColors(color="light"), main="light")

par(opar)
opar <- par(c("mfrow", "mar", "xaxt"))
par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n")

x <- rep(1, 8)

barplot(x, col = flxColors(color="full"), main="full")
barplot(x, col = flxColors(color="dark"), main="dark")
barplot(x, col = flxColors(color="medium"), main="medium")
barplot(x, col = flxColors(color="light"), main="light")

par(opar)

Methods for Function histogram in Package ‘flexclust’

Description

Plot a histogram of the similarity of each observation to each cluster.

Usage

## S4 method for signature 'kccasimple,missing'
histogram(x, data, xlab="", ...)
## S4 method for signature 'kccasimple,data.frame'
histogram(x, data, xlab="", ...)
## S4 method for signature 'kccasimple,matrix'
histogram(x, data, xlab="Similarity",
          power=1, ...)
## S4 method for signature 'kccasimple,missing'
histogram(x, data, xlab="", ...)
## S4 method for signature 'kccasimple,data.frame'
histogram(x, data, xlab="", ...)
## S4 method for signature 'kccasimple,matrix'
histogram(x, data, xlab="Similarity",
          power=1, ...)

Arguments

`x`	An object of class `"kccasimple"`.
`data`	If not missing, the distance and thus similarity between observations and cluster centers is determined for the new data and used for the plots. By default the values from the training data are used.
`xlab`	Label for the x-axis.
`power`	Numeric indicating how similarities are transformed, for more details see Dolnicar et al. (2018).
`...`	Additional arguments passed to `histogram`.

Author(s)

Friedrich Leisch

References

Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.

Methods for Function image in Package ‘flexclust’

Description

Image plot of cluster segments overlaid by neighbourhood graph.

Usage

## S4 method for signature 'kcca'
image(x, which = 1:2, npoints = 100,
         xlab = "", ylab = "", fastcol = TRUE, col=NULL,
         clwd=0, graph=TRUE, ...)
## S4 method for signature 'kcca'
image(x, which = 1:2, npoints = 100,
         xlab = "", ylab = "", fastcol = TRUE, col=NULL,
         clwd=0, graph=TRUE, ...)

Arguments

`x`	An object of class `"kcca"`.
`which`	Index number of dimensions of input space to plot.
`npoints`	Number of grid points for image.
`fastcol`	If `TRUE`, a greedy algorithm is used for the background colors of the segments, which may result in neighbouring segments having the same color. If `FALSE`, neighbouring segments always have different colors at a speed penalty.
`col`	Vector of background colors for the segments.
`clwd`	Line width of contour lines at cluster boundaries, use larger values for `npoints` than the default to get smooth lines. (Warning: We really need a smarter way to draw cluster boundaries!)
`graph`	Logical, add a neighborhood graph to the plot?
`xlab`, `ylab`, `...`	Graphical parameters.

Details

This works only for "kcca" objects, no method is available for "kccasimple" objects.

Author(s)

Friedrich Leisch

Get Information on Fitted Flexclust Objects

Description

Returns descriptive information about fitted flexclust objects like cluster sizes or sum of within-cluster distances.

Usage

## S4 method for signature 'flexclust,character'
info(object, which, drop=TRUE, ...)
## S4 method for signature 'flexclust,character'
info(object, which, drop=TRUE, ...)

Arguments

`object`	Fitted object.
`which`	Which information to get. Use `which="help"` to list available information.
`drop`	Logical. If `TRUE` the result is coerced to the lowest possible dimension.
`...`	Passed to methods.

Details

Function info can be used to access slots of fitted flexclust objects in a portable way, and in addition computes some meta-information like sum of within-cluster distances.

Function infoCheck returns a logical value that is TRUE if the requested information can be computed from the object.

Author(s)

Friedrich Leisch

Examples

data("Nclus")
plot(Nclus)

cl1 <- cclust(Nclus, k=4)
summary(cl1)

## these two are the same
info(cl1)
info(cl1, "help")

## cluster sizes
i1 <- info(cl1, "size")
i1

## average within cluster distances
i2 <- info(cl1, "av_dist")
i2

## the sum of all within-cluster distances
i3 <- info(cl1, "distsum")
i3

## sum(i1*i2) must of course be the same as i3
stopifnot(all.equal(sum(i1*i2), i3))



## This should return TRUE
infoCheck(cl1, "size")
## and this FALSE
infoCheck(cl1, "Homer Simpson")
## both combined
i4 <- infoCheck(cl1, c("size", "Homer Simpson"))
i4

stopifnot(all.equal(i4, c(TRUE, FALSE)))
data("Nclus")
plot(Nclus)

cl1 <- cclust(Nclus, k=4)
summary(cl1)

## these two are the same
info(cl1)
info(cl1, "help")

## cluster sizes
i1 <- info(cl1, "size")
i1

## average within cluster distances
i2 <- info(cl1, "av_dist")
i2

## the sum of all within-cluster distances
i3 <- info(cl1, "distsum")
i3

## sum(i1*i2) must of course be the same as i3
stopifnot(all.equal(sum(i1*i2), i3))



## This should return TRUE
infoCheck(cl1, "size")
## and this FALSE
infoCheck(cl1, "Homer Simpson")
## both combined
i4 <- infoCheck(cl1, c("size", "Homer Simpson"))
i4

stopifnot(all.equal(i4, c(TRUE, FALSE)))

K-Centroids Cluster Analysis

Description

Perform k-centroids clustering on a data matrix.

Usage

kcca(x, k, family=kccaFamily("kmeans"), weights=NULL, group=NULL,
     control=NULL, simple=FALSE, save.data=FALSE)
kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which,
           preproc = NULL, trim=0, groupFun = "minSumClusters")

## S4 method for signature 'kccasimple'
summary(object)
kcca(x, k, family=kccaFamily("kmeans"), weights=NULL, group=NULL,
     control=NULL, simple=FALSE, save.data=FALSE)
kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which,
           preproc = NULL, trim=0, groupFun = "minSumClusters")

## S4 method for signature 'kccasimple'
summary(object)

Arguments

`x`	A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
`k`	Either the number of clusters, or a vector of cluster assignments, or a matrix of initial (distinct) cluster centroids. If a number, a random set of (distinct) rows in `x` is chosen as the initial centroids.
`family`	Object of class `"kccaFamily"`.
`weights`	An optional vector of weights to be used in the clustering process, cannot be combined with all families.
`group`	An optional grouping vector for the data, see details below.
`control`	An object of class `"flexclustControl"`.
`simple`	Return an object of class `"kccasimple"`?
`save.data`	Save a copy of `x` in the return object?
`which`	One of `"kmeans"`, `"kmedians"`, `"angle"`, `"jaccard"`, or `"ejaccard"`.
`name`	Optional long name for family, used only for show methods.
`dist`	A function for distance computation, ignored if `which` is specified.
`cent`	A function for centroid computation, ignored if `which` is specified.
`preproc`	Function for data preprocessing.
`trim`	A number in between 0 and 0.5, if non-zero then trimmed means are used for the `kmeans` family, ignored by all other families.
`groupFun`	Function or name of function to obtain clusters for grouped data, see details below.
`object`	Object of class `"kcca"`.

Details

See the paper A Toolbox for K-Centroids Cluster Analysis referenced below for details.

Value

Function kcca returns objects of class "kcca" or "kccasimple" depending on the value of argument simple. The simpler objects contain fewer slots and hence are faster to compute, but contain no auxiliary information used by the plotting methods. Most plot methods for "kccasimple" objects do nothing and return a warning. If only centroids, cluster membership or prediction for new data are of interest, then the simple objects are sufficient.

Predefined Families

Function kccaFamily() currently has the following predefined families (distance / centroid):

kmeans:: Euclidean distance / mean
kmedians:: Manhattan distance / median
angle:: angle between observation and centroid / standardized mean
jaccard:: Jaccard distance / numeric optimization
ejaccard:: Jaccard distance / mean

See Leisch (2006) for details on all combinations.

Group Constraints

If group is not NULL, then observations from the same group are restricted to belong to the same cluster (must-link constraint) or different clusters (cannot-link constraint) during the fitting process. If groupFun = "minSumClusters", then all group members are assign to the cluster where the center has minimal average distance to the group members. If groupFun = "majorityClusters", then all group members are assigned to the cluster the majority would belong to without a constraint.

groupFun = "differentClusters" implements a cannot-link constraint, i.e., members of one group are not allowed to belong to the same cluster. The optimal allocation for each group is found by solving a linear sum assignment problem using solve_LSAP. Obviously the group sizes must be smaller than the number of clusters in this case.

Ties are broken at random in all cases. Note that at the moment not all methods for fitted "kcca" objects respect the grouping information, most importantly the plot method when a data argument is specified.

Author(s)

Friedrich Leisch

References

Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.

Friedrich Leisch and Bettina Gruen. Extending standard cluster algorithms to allow for group constraints. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006-Proceedings in Computational Statistics, pages 885-892. Physica Verlag, Heidelberg, Germany, 2006.

Examples

data("Nclus")
plot(Nclus)

## try kmeans 
cl1 <- kcca(Nclus, k=4)
cl1

image(cl1)
points(Nclus)

## A barplot of the centroids 
barplot(cl1)


## now use k-medians and kmeans++ initialization, cluster centroids
## should be similar...

cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"),
           control=list(initcent="kmeanspp"))
cl2

## ... but the boundaries of the partitions have a different shape
image(cl2)
points(Nclus)
data("Nclus")
plot(Nclus)

## try kmeans 
cl1 <- kcca(Nclus, k=4)
cl1

image(cl1)
points(Nclus)

## A barplot of the centroids 
barplot(cl1)


## now use k-medians and kmeans++ initialization, cluster centroids
## should be similar...

cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"),
           control=list(initcent="kmeanspp"))
cl2

## ... but the boundaries of the partitions have a different shape
image(cl2)
points(Nclus)

Convert Cluster Result to Data Frame

Description

Convert object of class "kcca" to a data frame in long format.

Usage

kcca2df(object, data)
kcca2df(object, data)

Arguments

`object`	Object of class `"kcca"`.
`data`	Optional data if not saved in `object`.

Value

A data.frame with columns value, variable and group.

Examples

c.iris <- cclust(iris[,-5], 3, save.data=TRUE)
df.c.iris <- kcca2df(c.iris)
summary(df.c.iris)
densityplot(~value|variable+group, data=df.c.iris)
c.iris <- cclust(iris[,-5], 3, save.data=TRUE)
df.c.iris <- kcca2df(c.iris)
summary(df.c.iris)
densityplot(~value|variable+group, data=df.c.iris)

Milk of Mammals

Description

The data set contains the ingredients of mammal's milk of 25 animals.

Usage

data(milk)data(milk)

Format

A data frame with 25 observations on the following 5 variables (all in percent).

water: Water.
protein: Protein.
fat: Fat.
lactose: Lactose.
ash: Ash.

References

John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.

Artificial Example with 4 Gaussians

Description

A simple artificial regression example with 4 clusters, all of them having a Gaussian distribution.

Usage

data(Nclus)
data(Nclus)

Details

The Nclus data set can be re-created by loading package flexmix and running ExNclus(100) using set.seed(2602). It has been saved as a data set for simplicity of examples only.

Examples

data(Nclus)
cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE)
plot(cl)
data(Nclus)
cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE)
plot(cl)

Nutrients in Meat, Fish and Fowl

Description

The data set contains the measurements of nutrients in several types of meat, fish and fowl.

Usage

data(nutrient)data(nutrient)

Format

A data frame with 27 observations on the following 5 variables.

energy: Food energy (calories).
protein: Protein (grams).
fat: Fat (grams).
calcium: calcium (milli grams).
iron: Iron (milli grams).

References

John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.

Methods for Function pairs in Package ‘flexclust’

Description

Plot a matrix of neighbourhood graphs.

Usage

  ## S4 method for signature 'kcca'
pairs(x, which=NULL, project=NULL, oma=NULL, ...)
## S4 method for signature 'kcca'
pairs(x, which=NULL, project=NULL, oma=NULL, ...)

Arguments

`x`	An object of class `"kcca"`
`which`	Index numbers of dimensions of (projected) input space to plot, default is to plot all dimensions.
`project`	Projection object for which a `predict` method exists, e.g., the result of `prcomp`.
`oma`	Outer margin.
`...`	Passed to the `plot` method.

Details

This works only for "kcca" objects, no method is available for "kccasimple" objects.

Author(s)

Friedrich Leisch

Get Centroids from KCCA Object

Description

Returns the matrix of centroids of a fitted object of class "kcca".

Usage

## S4 method for signature 'kccasimple'
parameters(object, ...)
## S4 method for signature 'kccasimple'
parameters(object, ...)

Arguments

`object`	Fitted object.
`...`	Currently not used.

Author(s)

Friedrich Leisch

Methods for Function plot in Package ‘flexclust’

Description

Plot the neighbourhood graph of a cluster solution together with projected data points.

Usage

  ## S4 method for signature 'kcca,missing'
plot(x, y, which=1:2, project=NULL,
         data=NULL, points=TRUE, hull=TRUE, hull.args=NULL, 
         number = TRUE, simlines=TRUE,
         lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16,
         add=FALSE, xlab="", ylab="", xlim = NULL,
         ylim = NULL, pch=NULL, col=NULL, ...)
## S4 method for signature 'kcca,missing'
plot(x, y, which=1:2, project=NULL,
         data=NULL, points=TRUE, hull=TRUE, hull.args=NULL, 
         number = TRUE, simlines=TRUE,
         lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16,
         add=FALSE, xlab="", ylab="", xlim = NULL,
         ylim = NULL, pch=NULL, col=NULL, ...)

Arguments

`x`	An object of class `"kcca"`
`y`	Not used
`which`	Index numbers of dimensions of (projected) input space to plot.
`project`	Projection object for which a `predict` method exists, e.g., the result of `prcomp`.
`data`	Data to include in plot. If the cluster object `x` was created with `save.data=TRUE`, then these are used by default.
`points`	Logical, shall data points be plotted (if available)?
`hull`	If `TRUE`, then hulls of the data are plotted (if available). Can either be a logical value, one of the strings `"convex"` (the default) or `"ellipse"`, or a function for plotting the hulls.
`hull.args`	A list of arguments for the hull function.
`number`	Logical, plot number labels in nodes of graph?
`numcol`, `cex`	Color and size of number labels in nodes of graph. If `numcol` is logical, it switches between black and the color of the clusters, else it is taken as a vector of colors.
`nodes`	Plotting symbol to use for nodes if no numbers are drawn.
`simlines`	Logical, plot edges of graph?
`lwd`, `maxlwd`	Numerical, thickness of lines.
`add`	Logical, add to existing plot?
`xlab`, `ylab`	Axis labels.
`xlim`, `ylim`	Axis range.
`pch`, `col`, `...`	Plotting symbols and colors for data points.

Details

This works only for "kcca" objects, no method is available for "kccasimple" objects.

Author(s)

Friedrich Leisch

References

Predict Cluster Membership

Description

Return either the cluster membership of training data or predict for new data.

Usage

## S4 method for signature 'kccasimple'
predict(object, newdata, ...)
## S4 method for signature 'flexclust,ANY'
clusters(object, newdata, ...)
## S4 method for signature 'kccasimple'
predict(object, newdata, ...)
## S4 method for signature 'flexclust,ANY'
clusters(object, newdata, ...)

Arguments

`object`	Object of class inheriting from `"flexclust"`.
`newdata`	An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used.
`...`	Currently not used.

Details

clusters can be used on any object of class "flexclust" and returns the cluster memberships of the training data.

predict can be used only on objects of class "kcca" (which inherit from "flexclust"). If no newdata argument is specified, the function is identical to clusters, if newdata is specified, then cluster memberships for the new data are predicted. clusters(object, newdata, ...) is an alias for predict(object, newdata, ...).

Author(s)

Friedrich Leisch

Artificial 2d Market Segment Data

Description

Simple artificial 2-dimensional data to demonstrate clustering for market segmentation. One dimension is the hypothetical feature sophistication (or performance or quality, etc) of a product, the second dimension the price customers are willing to pay for the product.

Usage

priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust",
                        "ellipse", "triangle", "circle", "square",
                        "largesmall"))
priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust",
                        "ellipse", "triangle", "circle", "square",
                        "largesmall"))

Arguments

`n`	Sample size.
`which`	Shape of data set.

References

Sara Dolnicar and Friedrich Leisch. Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Marketing Letters, 21:83-101, 2010.

Examples

plot(priceFeature(200, "2clust"))
plot(priceFeature(200, "3clust"))
plot(priceFeature(200, "3clustold"))
plot(priceFeature(200, "5clust"))
plot(priceFeature(200, "ell"))
plot(priceFeature(200, "tri"))
plot(priceFeature(200, "circ"))
plot(priceFeature(200, "square"))
plot(priceFeature(200, "largesmall"))
plot(priceFeature(200, "2clust"))
plot(priceFeature(200, "3clust"))
plot(priceFeature(200, "3clustold"))
plot(priceFeature(200, "5clust"))
plot(priceFeature(200, "ell"))
plot(priceFeature(200, "tri"))
plot(priceFeature(200, "circ"))
plot(priceFeature(200, "square"))
plot(priceFeature(200, "largesmall"))

Add Arrows for Projected Axes to a Plot

Description

Adds arrows for original coordinate axes to a projection plot.

Usage

projAxes(object, which=1:2, center=NULL,
                     col="red", radius=NULL,
                     minradius=0.1, textargs=list(col=col),
                     col.names=getColnames(object),
                     which.names="", group = NULL, groupFun = colMeans,
                     plot=TRUE, ...)

placeLabels(object)
## S4 method for signature 'projAxes'
placeLabels(object)
projAxes(object, which=1:2, center=NULL,
                     col="red", radius=NULL,
                     minradius=0.1, textargs=list(col=col),
                     col.names=getColnames(object),
                     which.names="", group = NULL, groupFun = colMeans,
                     plot=TRUE, ...)

placeLabels(object)
## S4 method for signature 'projAxes'
placeLabels(object)

Arguments

`object`	Return value of a projection method like `prcomp`.
`which`	Index number of dimensions of (projected) input space that have been plotted.
`center`	Center of the coordinate system to use in projected space. Default is the center of the plotting region.
`col`	Color of arrows.
`radius`	Relative size of the arrows.
`minradius`	Minimum radius of arrows to include (relative to arrow size).
`textargs`	List of arguments for `text`.
`col.names`	Variable names of the original data.
`which.names`	A regular expression which variable names to include in the plot.
`group`	An optional grouping variable for the original coordinates. Coordinates with group `NA` are omitted.
`groupFun`	Function used to aggregate the projected coordinates if `group` is specified.
`plot`	Logical,if `TRUE` the axes are added to the current plot.
`...`	Passed to `arrows`.

Value

projAxes invisibly returns an object of class "projAxes", which can be added to an existing plot by its plot method.

Author(s)

Friedrich Leisch

Examples

data(milk)
milk.pca <- prcomp(milk, scale=TRUE)

## create a biplot step by step
plot(predict(milk.pca), type="n")
text(predict(milk.pca), rownames(milk), col="green", cex=0.8)
projAxes(milk.pca)

## the same, but arrows are blue, centered at origin and all arrows are
## plotted 
plot(predict(milk.pca), type="n")
text(predict(milk.pca), rownames(milk), col="green", cex=0.8)
projAxes(milk.pca, col="blue", center=0, minradius=0)

## use points instead of text, plot PC2 and PC3, manual radius
## specification, store result
plot(predict(milk.pca)[,c(2,3)])
arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE)
plot(arr)

## Not run: 

## manually try to find new places for the labels: each arrow is marked
## active in turn, use the left mouse button to find a better location
## for the label. Use the right mouse button to go on to the next
## variable.

arr1 <- placeLabels(arr)

## now do the plot again:
plot(predict(milk.pca)[,c(2,3)])
plot(arr1)

## End(Not run)

data(milk)
milk.pca <- prcomp(milk, scale=TRUE)

## create a biplot step by step
plot(predict(milk.pca), type="n")
text(predict(milk.pca), rownames(milk), col="green", cex=0.8)
projAxes(milk.pca)

## the same, but arrows are blue, centered at origin and all arrows are
## plotted 
plot(predict(milk.pca), type="n")
text(predict(milk.pca), rownames(milk), col="green", cex=0.8)
projAxes(milk.pca, col="blue", center=0, minradius=0)

## use points instead of text, plot PC2 and PC3, manual radius
## specification, store result
plot(predict(milk.pca)[,c(2,3)])
arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE)
plot(arr)

## Not run: 

## manually try to find new places for the labels: each arrow is marked
## active in turn, use the left mouse button to find a better location
## for the label. Use the right mouse button to go on to the next
## variable.

arr1 <- placeLabels(arr)

## now do the plot again:
plot(predict(milk.pca)[,c(2,3)])
plot(arr1)

## End(Not run)

Barcharts and Boxplots for Columns of a Data Matrix Split by Groups

Description

Split a binary or numeric matrix by a grouping variable, run a series of tests on all variables, adjust for multiple testing and graphically represent results.

Usage

propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test",
             sort=FALSE, strip.prefix="", strip.labels=NULL,
             which=NULL, byvar=FALSE, ...)

## S4 method for signature 'propBarchart'
summary(object, ...)

groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL,
            shade=!is.null(shadefun), shadefun=NULL,
            strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE,
            ...)
propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test",
             sort=FALSE, strip.prefix="", strip.labels=NULL,
             which=NULL, byvar=FALSE, ...)

## S4 method for signature 'propBarchart'
summary(object, ...)

groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL,
            shade=!is.null(shadefun), shadefun=NULL,
            strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE,
            ...)

Arguments

`x`	A binary data matrix.
`g`	A factor specifying the groups.
`alpha`	Significance level for test of differences in proportions.
`correct`	Correction method for multiple testing, passed to `p.adjust`.
`test`	Test to use for detecting significant differences in proportions.
`sort`	Logical, sort variables by total sample mean?
`strip.prefix`	Character string prepended to strips of the `barchart` (the remainder of the strip are group levels and group sizes). Ignored if `strip.labels` is specified.
`strip.labels`	Character vector of labels to use for strips of `barchart`.
`which`	Index numbers or names of variables to plot.
`byvar`	If `TRUE`, a panel is plotted for each variable. By default a panel is plotted for each group.
`...`	Passed on to `barchart` or `bwplot`.
`object`	Return value of `propBarchart`.
`xlab`	A title for the x-axis: see `title`. The default is `""`.
`col`	Vector of colors for the panels.
`shade`	If `TRUE`, only variables with significant differences in median are filled with color.
`shadefun`	A function or name of a function to compute which boxes are shaded, e.g. `"kruskalTest"` (default), `"medianInside"` or `"boxOverlap"`.

Details

Function propBarchart splits a binary data matrix into subgroups, computes the percentage of ones in each column and compares the proportions in the groups using prop.test. The p-values for all variables are adjusted for multiple testing and a barchart of group percentages is drawn highlighting variables with significant differences in proportion. The summary method can be used to create a corresponding table for publications.

Function groupBWplot takes a general numeric matrix, also splits into subgroups and uses boxes instead of bars. By default kruskal.test is used to compute significant differences in location, in addition the heuristics from bwplot,kcca-method can be used. Boxes of the complete sample are used as reference in the background.

Author(s)

Friedrich Leisch

Examples

 ## create a binary matrix from the iris data plus a random noise column
 x <- apply(iris[,-5], 2, function(z) z>median(z))
 x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE))

 ## There are significant differences in all 4 original variables, Noise
 ## has most likely no significant difference (of course the difference
 ## will be significant in alpha percent of all random samples).
 p <- propBarchart(x, iris$Species)
 p
 summary(p)
 propBarchart(x, iris$Species, byvar=TRUE)
 
 x <- iris[,-5]
 x <- cbind(x, Noise=rnorm(150, mean=3))
 groupBWplot(x, iris$Species)
 groupBWplot(x, iris$Species, shade=TRUE)
 groupBWplot(x, iris$Species, shadefun="medianInside")
 groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)
## create a binary matrix from the iris data plus a random noise column
 x <- apply(iris[,-5], 2, function(z) z>median(z))
 x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE))

 ## There are significant differences in all 4 original variables, Noise
 ## has most likely no significant difference (of course the difference
 ## will be significant in alpha percent of all random samples).
 p <- propBarchart(x, iris$Species)
 p
 summary(p)
 propBarchart(x, iris$Species, byvar=TRUE)
 
 x <- iris[,-5]
 x <- cbind(x, Noise=rnorm(150, mean=3))
 groupBWplot(x, iris$Species)
 groupBWplot(x, iris$Species, shade=TRUE)
 groupBWplot(x, iris$Species, shadefun="medianInside")
 groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)

Stochastic QT Clustering

Description

Perform stochastic QT clustering on a data matrix.

Usage

qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL,
        save.data=FALSE, kcca=FALSE)
qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL,
        save.data=FALSE, kcca=FALSE)

Arguments

`x`	A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
`radius`	Maximum radius of clusters.
`family`	Object of class `"kccaFamily"` specifying the distance measure to be used.
`control`	An object of class `"flexclustControl"` specifying the minimum number of observations per cluster (`min.size`), and trials per iteration (`ntry`, see details below).

`save.data`	Save a copy of `x` in the return object?
`kcca`	Run `kcca` after the QT cluster algorithm has converged?

Details

This function implements a variation of the QT clustering algorithm by Heyer et al. (1999), see Scharl and Leisch (2006). The main difference is that in each iteration not all possible cluster start points are considered, but only a random sample of size control@ntry. We also consider only points as initial centers where at least one other point is within a circle with radius radius. In most cases the resulting solutions are almost the same at a considerable speed increase, in some cases even better solutions are obtained than with the original algorithm. If control@ntry is set to the size of the data set, an algorithm similar to the original algorithm as proposed by Heyer et al. (1999) is obtained.

Value

Function qtclust by default returns objects of class "kccasimple". If argument kcca is TRUE, function kcca() is run afterwards (initialized on the QT cluster solution). Data points not clustered by the QT cluster algorithm are omitted from the kcca() iterations, but filled back into the return object. All plot methods defined for objects of class "kcca" can be used.

Author(s)

Friedrich Leisch

References

Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research 9, 1106–1115.

Theresa Scharl and Friedrich Leisch. The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006 – Proceedings in Computational Statistics, pages 1015-1022. Physica Verlag, Heidelberg, Germany, 2006.

Examples

x <- matrix(10*runif(1000), ncol=2)

## maximum distrance of point to cluster center is 3
cl1 <- qtclust(x, radius=3)

## maximum distrance of point to cluster center is 1
## -> more clusters, longer runtime
cl2 <- qtclust(x, radius=1)

opar <- par(c("mfrow","mar"))
par(mfrow=c(2,1), mar=c(2.1,2.1,1,1))
plot(x, col=predict(cl1), xlab="", ylab="")
plot(x, col=predict(cl2), xlab="", ylab="")
par(opar)
x <- matrix(10*runif(1000), ncol=2)

## maximum distrance of point to cluster center is 3
cl1 <- qtclust(x, radius=3)

## maximum distrance of point to cluster center is 1
## -> more clusters, longer runtime
cl2 <- qtclust(x, radius=1)

opar <- par(c("mfrow","mar"))
par(mfrow=c(2,1), mar=c(2.1,2.1,1,1))
plot(x, col=predict(cl1), xlab="", ylab="")
plot(x, col=predict(cl2), xlab="", ylab="")
par(opar)

Compare Partitions

Description

Compute the (adjusted) Rand, Jaccard and Fowlkes-Mallows index for agreement of two partitions.

Usage

comPart(x, y, type=c("ARI","RI","J","FM"))
## S4 method for signature 'flexclust,flexclust'
comPart(x, y, type)
## S4 method for signature 'numeric,numeric'
comPart(x, y, type)
## S4 method for signature 'flexclust,numeric'
comPart(x, y, type)
## S4 method for signature 'numeric,flexclust'
comPart(x, y, type)

randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'table,missing'
randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'ANY,ANY'
randIndex(x, y, correct=TRUE, original=!correct)
comPart(x, y, type=c("ARI","RI","J","FM"))
## S4 method for signature 'flexclust,flexclust'
comPart(x, y, type)
## S4 method for signature 'numeric,numeric'
comPart(x, y, type)
## S4 method for signature 'flexclust,numeric'
comPart(x, y, type)
## S4 method for signature 'numeric,flexclust'
comPart(x, y, type)

randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'table,missing'
randIndex(x, y, correct=TRUE, original=!correct)
## S4 method for signature 'ANY,ANY'
randIndex(x, y, correct=TRUE, original=!correct)

Arguments

`x`	Either a 2-dimensional cross-tabulation of cluster assignments (for `randIndex` only), an object inheriting from class `"flexclust"`, or an integer vector of cluster memberships.
`y`	An object inheriting from class `"flexclust"`, or an integer vector of cluster memberships.
`type`	character vector of abbreviations of indices to compute.
`correct`, `original`	Logical, correct the Rand index for agreement by chance?

Value

A vector of indices.

Rand Index

Let $A$ denote the number of all pairs of data points which are either put into the same cluster by both partitions or put into different clusters by both partitions. Conversely, let $D$ denote the number of all pairs of data points that are put into one cluster in one partition, but into different clusters by the other partition. The partitions disagree for all pairs $D$ and agree for all pairs $A$ . We can measure the agreement by the Rand index $A/(A+D)$ which is invariant with respect to permutations of cluster labels.

The index has to be corrected for agreement by chance if the sizes of the clusters are not uniform (which is usually the case), or if there are many clusters, see Hubert & Arabie (1985) for details.

Jaccard Index

If the number of clusters is very large, then usually the vast majority of pairs of points will not be in the same cluster. The Jaccard index tries to account for this by using only pairs of points that are in the same cluster in the defintion of $A$ .

Fowlkes-Mallows

Let $A$ again be the pairs of points that are in the same cluster in both partitions. Fowlkes-Mallows divides this number by the geometric mean of the sums of the number of pairs in each cluster of the two partitions. This gives the probability that a pair of points which are in the same cluster in one partition are also in the same cluster in the other partition.

Author(s)

Friedrich Leisch

References

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.

Marina Meila. Comparing clusterings - an axiomatic view. In Stefan Wrobel and Luc De Raedt, editors, Proceedings of the International Machine Learning Conference (ICML). ACM Press, 2005.

Examples

## no class correlations: corrected Rand almost zero
g1 <- sample(1:5, size=1000, replace=TRUE)
g2 <- sample(1:5, size=1000, replace=TRUE)
tab <- table(g1, g2)
randIndex(tab)

## uncorrected version will be large, because there are many points
## which are assigned to different clusters in both cases
randIndex(tab, correct=FALSE)
comPart(g1, g2)

## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better
k <- sample(1:1000, size=200)
g1[k] <- 1
g2[k] <- 1
k <- sample(1:1000, size=200)
g1[k] <- 3
g2[k] <- 3
tab <- table(g1, g2)

## the index should be larger than before
randIndex(tab, correct=TRUE, original=TRUE)
comPart(g1, g2)
## no class correlations: corrected Rand almost zero
g1 <- sample(1:5, size=1000, replace=TRUE)
g2 <- sample(1:5, size=1000, replace=TRUE)
tab <- table(g1, g2)
randIndex(tab)

## uncorrected version will be large, because there are many points
## which are assigned to different clusters in both cases
randIndex(tab, correct=FALSE)
comPart(g1, g2)

## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better
k <- sample(1:1000, size=200)
g1[k] <- 1
g2[k] <- 1
k <- sample(1:1000, size=200)
g1[k] <- 3
g2[k] <- 3
tab <- table(g1, g2)

## the index should be larger than before
randIndex(tab, correct=TRUE, original=TRUE)
comPart(g1, g2)

Plot a Random Tour

Description

Create a series of projection plots corresponding to a random tour through the data.

Usage

randomTour(object, ...)

## S4 method for signature 'ANY'
randomTour(object, ...)
## S4 method for signature 'matrix'
randomTour(object, ...)
## S4 method for signature 'flexclust'
randomTour(object, data=NULL, col=NULL, ...)

randomTourMatrix(x, directions=10,
                 steps=100, sec=4, sleep = sec/steps,
                 axiscol=2, axislab=colnames(x),
                 center=NULL, radius=1, minradius=0.01, asp=1,
                 ...)
randomTour(object, ...)

## S4 method for signature 'ANY'
randomTour(object, ...)
## S4 method for signature 'matrix'
randomTour(object, ...)
## S4 method for signature 'flexclust'
randomTour(object, data=NULL, col=NULL, ...)

randomTourMatrix(x, directions=10,
                 steps=100, sec=4, sleep = sec/steps,
                 axiscol=2, axislab=colnames(x),
                 center=NULL, radius=1, minradius=0.01, asp=1,
                 ...)

Arguments

`object`, `x`	A matrix or an object of class `"flexclust"`.
`data`	Data to include in plot.
`col`	Plotting colors for data points.
`directions`	Integer value, how many different directions are toured.
`steps`	Integer, number of steps in each direction.
`sec`	Numerical, lower bound for the number of seconds each direction takes.
`sleep`	Numerical, sleep for as many seconds after each picture has been plotted.
`axiscol`	If not `NULL`, then arrows are plotted for projections of the original coordinate axes in these colors.
`axislab`	Optional labels for the projected axes.
`center`	Center of the coordinate system to use in projected space. Default is the center of the plotting region.
`radius`	Relative size of the arrows.
`minradius`	Minimum radius of arrows to include.
`asp`, `...`	Passed on to `randomTourMatrix` and from there to `plot`.

Details

Two random locations are chosen, and data then projected onto hyperplanes which are orthogonal to step vectors interpolating the two locations. The first two coordinates of the projected data are plotted. If directions is larger than one, then after the first steps plots one more random location is chosen, and the procedure is repeated from the current position to the new location, etc..

The whole procedure is similar to a grand tour, but no attempt is made to optimize subsequent directions, randomTour simply chooses a random direction in each iteration. Use rggobi for the real thing.

Obviously the function needs a reasonably fast computer and graphics device to give a smooth impression, for x11 it may be necessary to use type="Xlib" rather than cairo.

Author(s)

Friedrich Leisch

Examples

if(interactive()){
  par(ask=FALSE)
  randomTour(iris[,1:4], axiscol=2:5)
  randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4)

  x <- matrix(runif(300), ncol=3)
  x <- rbind(x, x+1, x+2)
  cl <- cclust(x, k=3, save.data=TRUE)

  randomTour(cl, center=0, axiscol="black")

  ## now use predicted cluster membership for new data as colors
  randomTour(cl, center=0, axiscol="black",
             data=matrix(rnorm(3000, mean=1, sd=2), ncol=3))
}
if(interactive()){
  par(ask=FALSE)
  randomTour(iris[,1:4], axiscol=2:5)
  randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4)

  x <- matrix(runif(300), ncol=3)
  x <- rbind(x, x+1, x+2)
  cl <- cclust(x, k=3, save.data=TRUE)

  randomTour(cl, center=0, axiscol="black")

  ## now use predicted cluster membership for new data as colors
  randomTour(cl, center=0, axiscol="black",
             data=matrix(rnorm(3000, mean=1, sd=2), ncol=3))
}

Relabel Cluster Results.

Description

The clusters are relabelled to obtain a unique labeling.

Usage

relabel(object, by, ...)
## S4 method for signature 'kccasimple,character'
relabel(object, by, which = NULL, ...)
## S4 method for signature 'kccasimple,integer'
relabel(object, by, ...)
## S4 method for signature 'kccasimple,missing'
relabel(object, by, ...)
## S4 method for signature 'stepFlexclust,integer'
relabel(object, by = "series", ...)
## S4 method for signature 'stepFlexclust,missing'
relabel(object, by, ...)
relabel(object, by, ...)
## S4 method for signature 'kccasimple,character'
relabel(object, by, which = NULL, ...)
## S4 method for signature 'kccasimple,integer'
relabel(object, by, ...)
## S4 method for signature 'kccasimple,missing'
relabel(object, by, ...)
## S4 method for signature 'stepFlexclust,integer'
relabel(object, by = "series", ...)
## S4 method for signature 'stepFlexclust,missing'
relabel(object, by, ...)

Arguments

`object`	An object of class `"kccasimple"` or `"stepFlexclust"`.
`by`	If a character vector, it needs to be one of `"mean"`, `"median"`, `"variable"`, `"manual"`, `"centers"`, `"shadow"`, `"symmshadow"` or `"series"`. If missing, `"mean"` or `"series"` is used depending on if `object` is of class `"kccasimple"` or `"stepFlexclust"`. If an integer vector, it needs to indicate the new ordering.
`which`	Either an integer vector indiating the ordering or a vector of length one indicating the variable used for ordering.
`...`	Currently not used.

Details

If by is a character vector with value "mean" or "median", the clusters are ordered by the mean or median values over all variables for each cluster. If by = "manual" which needs to be a vector indicating the ordering. If by = "variable" which needs to be indicate the variable which is used to determine the ordering. If by is "centers", "shadow" or "symmshadow", cluster similarities are calculated using clusterSim and used to determine an ordering using seriate from package seriation.

If by = "series" the relabeling is performed over a series of clustering to minimize the misclassification.

Author(s)

Friedrich Leisch

Cluster Shadows and Silhouettes

Description

Compute and plot shadows and silhouettes.

Usage

## S4 method for signature 'kccasimple'
shadow(object, ...)
## S4 method for signature 'kcca'
Silhouette(object, data=NULL, ...)
## S4 method for signature 'kccasimple'
shadow(object, ...)
## S4 method for signature 'kcca'
Silhouette(object, data=NULL, ...)

Arguments

`object`	An object of class `"kcca"` or `"kccasimple"`.
`data`	Data to compute silhouette values for. If the cluster `object` was created with `save.data=TRUE`, then these are used by default.
`...`	Currently not used.

Details

The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.

The silhouette value of a data point is defined as the scaled difference between the average dissimilarity of a point to all points in its own cluster to the smallest average dissimilarity to the points of a different cluster. Large silhouette values indicate good separation.

The main difference between silhouette values and shadow values is that we replace average dissimilarities to points in a cluster by dissimilarities to point averages (=centroids). See Leisch (2009) for details.

Author(s)

Friedrich Leisch

References

Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.

Examples

data(Nclus)
set.seed(1)
c5 <- cclust(Nclus, 5, save.data=TRUE)
c5
plot(c5)

## high shadow values indicate clusters with *bad* separation
shadow(c5)
plot(shadow(c5))

## high Silhouette values indicate clusters with *good* separation
Silhouette(c5)
plot(Silhouette(c5))
data(Nclus)
set.seed(1)
c5 <- cclust(Nclus, 5, save.data=TRUE)
c5
plot(c5)

## high shadow values indicate clusters with *bad* separation
shadow(c5)
plot(shadow(c5))

## high Silhouette values indicate clusters with *good* separation
Silhouette(c5)
plot(Silhouette(c5))

Shadow Stars

Description

Shadow star plots and corresponding panel functions.

Usage

shadowStars(object, which=1:2, project=NULL,
            width=1, varwidth=FALSE,
            panel=panelShadowStripes,
            box=NULL, col=NULL, add=FALSE, ...)

panelShadowStripes(x, col, ...)
panelShadowViolin(x, ...)
panelShadowBP(x, ...)
panelShadowSkeleton(x, ...)
shadowStars(object, which=1:2, project=NULL,
            width=1, varwidth=FALSE,
            panel=panelShadowStripes,
            box=NULL, col=NULL, add=FALSE, ...)

panelShadowStripes(x, col, ...)
panelShadowViolin(x, ...)
panelShadowBP(x, ...)
panelShadowSkeleton(x, ...)

Arguments

`object`	An object of class `"kcca"`.
`which`	Index numbers of dimensions of (projected) input space to plot.
`project`	Projection object for which a `predict` method exists, e.g., the result of `prcomp`.
`width`	Width of vertices connecting the cluster centroids.
`varwidth`	Logical, shall all vertices have the same width or should the width be proportional to number of points shown on the vertex?
`panel`	Function used to draw vertices.
`box`	Color of rectangle drawn around each vertex.
`col`	A vector of colors for the clusters.
`add`	Logical, start a new plot?
`...`	Passed on to panel function.
`x`	Shadow values of data points corresponding to the vertex.

Details

The neighborhood graph of a cluster solution connects two centroids by a vertex if at least one data point has the two centroids as closest and second closest. The width of the vertex is proportional to the sum of shadow values of all points having these two as closest and second closest. A shadow star depicts the distribution of shadow values on the vertex, see Leisch (2009) for details.

Currently four panel functions are available:

panelShadowStripes:: line segment for each shadow value.
panelShadowViolin:: violin plot of shadow values.
panelShadowBP:: box-percentile plot of shadow values.
panelShadowSkeleton:: average shadow value.

Author(s)

Friedrich Leisch

References

Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.

Examples

data(Nclus)
set.seed(1)
c5 <- cclust(Nclus, 5, save.data=TRUE)
c5
plot(c5)

shadowStars(c5)
shadowStars(c5, varwidth=TRUE)

shadowStars(c5, panel=panelShadowViolin)
shadowStars(c5, panel=panelShadowBP)

## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few
## large shadow values can lead to misleading results:
shadowStars(c5, panel=panelShadowSkeleton)
shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)
data(Nclus)
set.seed(1)
c5 <- cclust(Nclus, 5, save.data=TRUE)
c5
plot(c5)

shadowStars(c5)
shadowStars(c5, varwidth=TRUE)

shadowStars(c5, panel=panelShadowViolin)
shadowStars(c5, panel=panelShadowBP)

## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few
## large shadow values can lead to misleading results:
shadowStars(c5, panel=panelShadowSkeleton)
shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)

Segment Level Stability Across Solutions Plot.

Description

Create a segment level stability across solutions plot, possibly using an additional variable for coloring the nodes.

Usage

slsaplot(object, nodecol = NULL, ...)
slsaplot(object, nodecol = NULL, ...)

Arguments

`object`	An object returned by `stepFlexclust`.
`nodecol`	A numeric vector of length equal to the number of observations clustered in `object` which represents an additional variable where a cluster-specific mean is calculated and used to color the nodes.
`...`	Additional graphical parameters to modify the plot.

Details

For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).

Value

List of length equal to the number of different cluster solutions minus one containing numeric vectors of the entropy values used by default to color the nodes.

Author(s)

Friedrich Leisch

References

Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.

Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.

Examples

data("Nclus")
cl25 <- stepFlexclust(Nclus, k=2:5)
slsaplot(cl25)
cl25 <- relabel(cl25)
slsaplot(cl25)
data("Nclus")
cl25 <- stepFlexclust(Nclus, k=2:5)
slsaplot(cl25)
cl25 <- relabel(cl25)
slsaplot(cl25)

Segment Level Stability Within Solution.

Description

Assess segment level stability within solution.

Usage

slswFlexclust(x, object, ...)
## S4 method for signature 'resampleFlexclust,missing'
plot(x, y, ...)
## S4 method for signature 'resampleFlexclust'
boxplot(x, which=1, ylab=NULL, ...)
## S4 method for signature 'resampleFlexclust'
densityplot(x, data, which=1, ...)
## S4 method for signature 'resampleFlexclust'
summary(object)
slswFlexclust(x, object, ...)
## S4 method for signature 'resampleFlexclust,missing'
plot(x, y, ...)
## S4 method for signature 'resampleFlexclust'
boxplot(x, which=1, ylab=NULL, ...)
## S4 method for signature 'resampleFlexclust'
densityplot(x, data, which=1, ...)
## S4 method for signature 'resampleFlexclust'
summary(object)

Arguments

`x`	A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns) passed to `stepFlexclust`.
`object`	Object of class `"kcca"` for `slwsFlexclust` and `"resampleFlexclust"` for the summary method.
`y`	Missing.
`which`	Integer or character indicating which validation measure is used for plotting.
`ylab`	Axis label.
`data`	Not used.
`...`	Additional arguments; for details see below.

Details

Additional arguments in slswFlexclust are argument nsamp which is by default equal to 100 and allows to change the number of bootstrap pairs drawn. Argument seed allows to set a random seed and argument multicore is by default TRUE and indicates if bootstrap samples should be drawn in parallel. Argument verbose is by default equal to FALSE and if TRUE progress information is shown during computations.

There are plotting as well as printing and summary methods implemented for objects of class "resampleFlexclust". In addition to a standard plot method also methods for densityplot and boxplot are provided.

For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).

Value

An object of class "resampleFlexclust".

Author(s)

Friedrich Leisch

References

Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.

Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.

Examples

data("Nclus")
cl3 <- kcca(Nclus, k = 3)
slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20)
plot(Nclus, col = clusters(cl3))
plot(slsw.cl3)
densityplot(slsw.cl3)
boxplot(slsw.cl3)
data("Nclus")
cl3 <- kcca(Nclus, k = 3)
slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20)
plot(Nclus, col = clusters(cl3))
plot(slsw.cl3)
densityplot(slsw.cl3)
boxplot(slsw.cl3)

Run Flexclust Algorithms Repeatedly

Description

Runs clustering algorithms repeatedly for different numbers of clusters and returns the minimum within cluster distance solution for each.

Usage

stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE,
              group=NULL, simple=FALSE, save.data=FALSE, seed=NULL,
              multicore=TRUE, ...)

stepcclust(...)

## S4 method for signature 'stepFlexclust,missing'
plot(x, y,
  type=c("barplot", "lines"), totaldist=NULL,
  xlab=NULL, ylab=NULL, ...)

## S4 method for signature 'stepFlexclust'
getModel(object, which=1)
stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE,
              group=NULL, simple=FALSE, save.data=FALSE, seed=NULL,
              multicore=TRUE, ...)

stepcclust(...)

## S4 method for signature 'stepFlexclust,missing'
plot(x, y,
  type=c("barplot", "lines"), totaldist=NULL,
  xlab=NULL, ylab=NULL, ...)

## S4 method for signature 'stepFlexclust'
getModel(object, which=1)

Arguments

`x`, `...`	Passed to `kcca` or `cclust`.
`k`	A vector of integers passed in turn to the `k` argument of `kcca`
`nrep`	For each value of `k` run `kcca` `nrep` times and keep only the best solution.
`FUN`	Cluster function to use, typically `kcca` or `cclust`.
`verbose`	If `TRUE`, show progress information during computations.
`drop`	If `TRUE` and `K` is of length 1, then a single cluster object is returned instead of a `"stepFlexclust"` object.
`group`	An optional grouping vector for the data, see `kcca` for details.
`simple`	Return an object of class `"kccasimple"`?
`save.data`	Save a copy of `x` in the return object?
`seed`	If not `NULL`, a call to `set.seed()` is made before any clustering is done.
`multicore`	If `TRUE`, use `mclapply()` from package parallel for parallel processing.
`y`	Not used.
`type`	Create a barplot or lines plot.
`totaldist`	Include value for 1-cluster solution in plot? Default is `TRUE` if `K` contains `2`, else `FALSE`.
`xlab`, `ylab`	Graphical parameters.
`object`	Object of class `"stepFlexclust"`.
`which`	Number of model to get. If character, interpreted as number of clusters.

Details

stepcclust is a simple wrapper for stepFlexclust(...,FUN=cclust).

Author(s)

Friedrich Leisch

Examples

data("Nclus")
plot(Nclus)

## multicore off for CRAN checks
cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE)
cl1

plot(cl1)

# two ways to do the same:
getModel(cl1, 4)
cl1[[4]]

opar <- par("mfrow")
par(mfrow=c(2, 2))
for(k in 3:6){
  image(getModel(cl1, as.character(k)), data=Nclus)
  title(main=paste(k, "clusters"))
}
par(opar)
data("Nclus")
plot(Nclus)

## multicore off for CRAN checks
cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE)
cl1

plot(cl1)

# two ways to do the same:
getModel(cl1, 4)
cl1[[4]]

opar <- par("mfrow")
par(mfrow=c(2, 2))
for(k in 3:6){
  image(getModel(cl1, as.character(k)), data=Nclus)
  title(main=paste(k, "clusters"))
}
par(opar)

Stripes Plot

Description

Plot distance of data points to cluster centroids using stripes.

Usage

stripes(object, groups=NULL, type=c("first", "second", "all"),
        beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL,
        gp.bar2=NULL, number=TRUE, legend=!is.null(groups),
        ylim=NULL, ylab="distance from centroid",
        margins=c(2,5,3,2), ...)
stripes(object, groups=NULL, type=c("first", "second", "all"),
        beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL,
        gp.bar2=NULL, number=TRUE, legend=!is.null(groups),
        ylim=NULL, ylab="distance from centroid",
        margins=c(2,5,3,2), ...)

Arguments

`object`	An object of class `"kcca"`.
`groups`	Grouping variable to color-code the stripes. By default cluster membership is used as `groups`.
`type`	Plot distance to closest, closest and second-closest or to all centroids?
`beside`	Logical, make different stripes for different clusters?
`col`	Vector of colors for clusters or groups.
`gp.line`, `gp.bar`, `gp.bar2`	Graphical parameters for horizontal lines and background rectangular areas, see `gpar`.
`number`	Logical, write cluster numbers on x-axis?
`legend`	Logical, plot a legend for the groups?
`ylim`, `ylab`	Graphical parameters for y-axis.
`margins`	Margin of the plot.
`...`	Further graphical parameters.

Details

A simple, yet very effective plot for visualizing the distance of each point from its closest and second-closest cluster centroids is a stripes plot. For each of the k clusters we have a rectangular area, which we optionally vertically divide into k smaller rectangles (beside=TRUE). Then we draw a horizontal line segment for each data point marking the distance of the data point from the corresponding centroid.

Author(s)

Friedrich Leisch

References

Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 20(4), 457–469, 2010.

Examples

bw05 <- bundestag(2005)
bavaria <- bundestag(2005, state="Bayern")

set.seed(1)
c4 <- cclust(bw05, k=4, save.data=TRUE)
plot(c4)

stripes(c4)
stripes(c4, beside=TRUE)

stripes(c4, type="sec")
stripes(c4, type="sec", beside=FALSE)
stripes(c4, type="all")

stripes(c4, groups=bavaria)

## ugly, but shows how colors of all parts can be changed
library("grid")
stripes(c4, type="all",
        gp.bar=gpar(col="red", lwd=3, fill="white"),
        gp.bar2=gpar(col="green", lwd=3, fill="black"))

bw05 <- bundestag(2005)
bavaria <- bundestag(2005, state="Bayern")

set.seed(1)
c4 <- cclust(bw05, k=4, save.data=TRUE)
plot(c4)

stripes(c4)
stripes(c4, beside=TRUE)

stripes(c4, type="sec")
stripes(c4, type="sec", beside=FALSE)
stripes(c4, type="all")

stripes(c4, groups=bavaria)

## ugly, but shows how colors of all parts can be changed
library("grid")
stripes(c4, type="all",
        gp.bar=gpar(col="red", lwd=3, fill="white"),
        gp.bar2=gpar(col="green", lwd=3, fill="black"))

Vacation Motives of Australians

Description

In 2006 a sample of 1000 respondents representative for the adult Australian population was asked about their environmental behaviour when on vacation. In addition the survey also included a list of statements about vacation motives like "I want to rest and relax," "I use my holiday for the health and beauty of my body," and "Cultural offers and sights are a crucial factor.". Answers are binary ("applies", "does not apply").

Usage

data(vacmot)data(vacmot)

Format

Data frame vacmot has 1000 observations on 20 binary variables on travel motives. Data frame vacmotdesc has 1000 observation on sociodemographic descriptor variables, mean moral obligation to protect the environment score, mean NEP score, and mean environmental behaviour score, see Dolnicar & Leisch (2008) for details. In addition integer vector vacmot6 contains the 6 cluster partition presented in Dolnicar & Leisch (2008).

Source

The data set was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia).

References

Sara Dolnicar and Friedrich Leisch. An investigation of tourists' patterns of obligation to protect the environment. Journal of Travel Research, 46:381-391, 2008.

Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2):97-120, 2014.

Examples

data(vacmot)
summary(vacmotdesc)
dotchart(sort(colMeans(vacmot)))

## reproduce Figure 6 from Dolnicar & Leisch (2008)
cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0))
barchart(cl6)
data(vacmot)
summary(vacmotdesc)
dotchart(sort(colMeans(vacmot)))

## reproduce Figure 6 from Dolnicar & Leisch (2008)
cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0))
barchart(cl6)

Motivation of Australian Volunteers

Description

Part of an Australian survey on motivation of volunteers to work for non-profit organisations like Red Cross, State Emergency Service, Rural Fire Service, Surf Life Saving, Rotary, Parents and Citizens Associations, etc..

Usage

data(volunteers)data(volunteers)

Format

A data frame with 1415 observations on the following 21 variables: age and gender of respondents plus 19 binary motivation items (1 applies/ 0 does not apply).

GENDER: Gender of respondent.
AGEG: Age group, a factor with categorized age of respondents.
meet.people: I can meet different types of people.
no.one.else: There is no-one else to do the work.
example: It sets a good example for others.
socialise: I can socialise with people who are like me.
help.others: It gives me the chance to help others.
give.back: I can give something back to society.
career: It will help my career prospects.
lonely: It makes me feel less lonely.
active: It keeps me active.
community: It will improve my community.
cause: I can support an important cause.
faith: I can put faith into action.
services: I want to maintain services that I may use one day.
children: My children are involved with the organisation.
good.job: I feel like I am doing a good job.
benefited: I know someone who has benefited from the organisation.
network: I can build a network of contacts.
recognition: I can gain recognition within the community.
mind.off: It takes my mind off other things.

Source

The volunteering data was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia), using funding from Bushcare Wollongong and the Australian Research Council under the ARC Linkage Grant scheme (LP0453682).

References

Melanie Randle and Sara Dolnicar. Not Just Any Volunteers: Segmenting the Market to Attract the High-Contributors. Journal of Non-profit and Public Sector Marketing, 21(3), 271-282, 2009.

Melanie Randle and Sara Dolnicar. Self-congruity and volunteering: A multi-organisation comparison. European Journal of Marketing, 45(5), 739-758, 2011.

Melanie Randle, Friedrich Leisch, and Sara Dolnicar. Competition or collaboration? The effect of non-profit brand image on volunteer recruitment strategy. Journal of Brand Management, 20(8):689-704, 2013.

Package 'flexclust'

Help Index

Achievement Test Scores for New Haven Schools

Description

Usage

Format

References

Automobile Customer Survey Data

Description

Usage

Format

Source

References

Examples

Barplot/chart Methods in Package ‘flexclust’

Description

Usage

Arguments

Note

Author(s)

References

Examples

Bagged Clustering

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Birth and Death Rates

Description

Usage

Format

References

Bootstrap Flexclust Algorithms

Description

Usage

Arguments

Details

Author(s)

See Also

Examples

German Parliament Election Data

Description

Usage

Arguments

Format

Details

German Federal Elections

References

Examples

Box-Whisker Plot Methods in Package ‘flexclust’

Description

Usage

Arguments

Examples

Convex Clustering

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Cluster Similarity Matrix

Description

Usage

Arguments

Details

Author(s)

References

Examples

Conversion Between S3 Partition Objects and KCCA

Description

Usage