| Title: | Flexible Cluster Algorithms |
|---|---|
| Description: | The main function kcca implements a general framework for k-centroids cluster analysis supporting arbitrary distance measures and centroid computation. Further cluster methods include hard competitive learning, neural gas, and QT clustering. There are numerous visualization methods for cluster results (neighborhood graphs, convex cluster hulls, barcharts of centroids, ...), and bootstrap methods for the analysis of cluster stability. |
| Authors: | Friedrich Leisch [aut]
(<https://orcid.org/0000-0001-7278-1983>, maintainer up to
2024), Evgenia Dimitriadou [ctb], Lena Ortega Menjivar [ctb]
|
| Maintainer: | Bettina Grün <[email protected]> |
| License: | GPL-2 |
| Version: | 1.5.0 |
| Built: | 2026-05-09 08:48:34 UTC |
| Source: | https://github.com/cran/flexclust |
Measurements at the beginning of the 4th grade (when the national average is 4.0) and of the 6th grade in 25 schools in New Haven.
data(achieve)data(achieve)
A data frame with 25 observations on the following 4 variables.
read44th grade reading.
arith44th grade arithmetic.
read66th grade reading.
arith66th grade arithmetic.
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
A German manufacturer of premium cars asked customers approximately 3 months after a car purchase which characteristics of the car were most important for the decision to buy the car. The survey was done in 1983 and the data set contains all responses without missing values.
data(auto)data(auto)
A data frame with 793 observations on the following 46 variables.
modelA factor with levels A, B,
C, or D; model bought by the customer.
gearA factor with levels 4 gears, 5
econo, 5 sport, or automatic.
leasingA logical vector, was leasing used to finance the car?
usageA factor with levels private, both, business.
previous_modelA factor describing which type of car was owned directly before the purchase.
other_considerA factor with levels same manuf,
other manuf, both, or none.
test_driveA logical vector, did you do a test drive?
info_advA logical vector, was advertising an important source of information?
info_expA logical vector, was experience an important source of information?
info_recA logical vector, were recommendations an important source of information?
ch_clarityA logical vector.
ch_economyA logical vector.
ch_driving_propertiesA logical vector.
ch_serviceA logical vector.
ch_interiorA logical vector.
ch_qualityA logical vector.
ch_technologyA logical vector.
ch_model_continuityA logical vector.
ch_comfortA logical vector.
ch_reliabilityA logical vector.
ch_handlingA logical vector.
ch_reputationA logical vector.
ch_conceptA logical vector.
ch_characterA logical vector.
ch_powerA logical vector.
ch_resale_valueA logical vector.
ch_stylingA logical vector.
ch_safetyA logical vector.
ch_sportyA logical vector.
ch_consumptionA logical vector.
ch_spaceA logical vector.
satisfactionA numeric vector describing overall satisfaction (1=very good, 10=very bad).
good1Conception, styling, dimensions.
good2Auto body.
good3Driving and coupled axles.
good4Engine.
good5Electronics.
good6Financing and customer service.
good7Other.
sportyWhat do you think about the balance of
sportiness and comfort? (good, more sport, more comfort).
drive_charDriving characteristis (gentle < speedy < powerfull < extreme).
tempoWhich average speed do you prefer on German
Autobahn in km/h? (< 130 < 130-150 < 150-180 < > 180)
consumptionAn ordered factor with levels low < ok < high < too high.
genderA factor with levels male and female
occupationA factor with levels self-employed,
freelance, and employee.
householdSize of household, an ordered factor with levels 1-2 < >=3.
The original German data are in the public domain and available from LMU Munich (doi:10.5282/ubm/data.14). The variable names and help page were translated to English and converted into Rd format by Friedrich Leisch.
Open Data LMU (1983): Umfrage unter Kunden einer Automobilfirma, doi:10.5282/ubm/data.14
data(auto) summary(auto)data(auto) summary(auto)
Barplot of cluster centers or other cluster statistics.
## S4 method for signature 'kcca' barplot(height, bycluster = TRUE, oneplot = TRUE, data = NULL, FUN = colMeans, main = deparse(substitute(height)), which = 1:height@k, names.arg = NULL, oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...) ## S4 method for signature 'kcca' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, clusters=1:x@k, ...) ## S4 method for signature 'hclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...) ## S4 method for signature 'bclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, k=x@k, clusters=1:k, ...)## S4 method for signature 'kcca' barplot(height, bycluster = TRUE, oneplot = TRUE, data = NULL, FUN = colMeans, main = deparse(substitute(height)), which = 1:height@k, names.arg = NULL, oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...) ## S4 method for signature 'kcca' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, clusters=1:x@k, ...) ## S4 method for signature 'hclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...) ## S4 method for signature 'bclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, k=x@k, clusters=1:k, ...)
height, x
|
An object of class |
bycluster |
If |
oneplot |
If |
data |
If not |
FUN |
The function to be applied to each cluster for calculating
the bar heights. Only used, if |
which |
For |
names.arg |
A vector of names to be plotted below each bar. |
main, oma, xlab, ...
|
Graphical parameters. |
col |
Vector of colors for the clusters. |
mcol, mlcol
|
If not |
srt |
Number between 0 and 90, rotation of the x-axis labels. |
strip.labels |
Vector of strings for the strips of the Trellis display. |
strip.prefix |
Prefix string for the strips of the Trellis display. |
legend |
If |
shade |
If |
diff |
A numerical vector of length two with absolute and
relative deviations for shading, default is |
byvar |
If |
clusters |
Integer vector of clusters to plot. |
k |
Integer specifying the desired number of clusters. |
The flexclust barchart method uses a horizontal arrangements of the bars, and sorts them from top to bottom. Default barcharts in lattice are the other way round (bottom to top). See the examples below how this affects, e.g., manual labels for the y axis.
The barplot method is legacy code and only maintained to keep up
with changes in R, all active development is done on barchart.
Friedrich Leisch
Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2), 97-120, 2014.
cl <- cclust(iris[,-5], k=3) barplot(cl) barplot(cl, bycluster=FALSE) ## plot the maximum instead of mean value per cluster: barplot(cl, bycluster=FALSE, data=iris[,-5], FUN=function(x) apply(x,2,max)) ## use lattice for plotting: barchart(cl) ## automatic abbreviation of labels barchart(cl, scales=list(abbreviate=TRUE)) ## origin of bars at zero barchart(cl, scales=list(abbreviate=TRUE), origin=0) ## Use manual labels. Note that the flexclust barchart orders bars ## from top to bottom (the default does it the other way round), hence ## we have to rev() the labels: LAB <- c("SL", "SW", "PL", "PW") barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0) ## deviation of each cluster center from the population means barchart(cl, origin=rev(cl@xcent), mlcol=NULL) ## use shading to highlight large deviations from population mean barchart(cl, shade=TRUE) ## use smaller deviation limit than default and add a legend barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)cl <- cclust(iris[,-5], k=3) barplot(cl) barplot(cl, bycluster=FALSE) ## plot the maximum instead of mean value per cluster: barplot(cl, bycluster=FALSE, data=iris[,-5], FUN=function(x) apply(x,2,max)) ## use lattice for plotting: barchart(cl) ## automatic abbreviation of labels barchart(cl, scales=list(abbreviate=TRUE)) ## origin of bars at zero barchart(cl, scales=list(abbreviate=TRUE), origin=0) ## Use manual labels. Note that the flexclust barchart orders bars ## from top to bottom (the default does it the other way round), hence ## we have to rev() the labels: LAB <- c("SL", "SW", "PL", "PW") barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0) ## deviation of each cluster center from the population means barchart(cl, origin=rev(cl@xcent), mlcol=NULL) ## use shading to highlight large deviations from population mean barchart(cl, shade=TRUE) ## use smaller deviation limit than default and add a legend barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)
Cluster the data in x using the bagged clustering
algorithm. A partitioning cluster algorithm such as
cclust is run repeatedly on bootstrap samples from the
original data. The resulting cluster centers are then combined using
the hierarchical cluster algorithm hclust.
bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0, dist.method = "euclidian", hclust.method = "average", FUN = "cclust", verbose = TRUE, final.cclust = FALSE, resample = TRUE, weights = NULL, maxcluster = base.k, ...) ## S4 method for signature 'bclust,missing' plot(x, y, maxcluster = x@maxcluster, main = "", ...) ## S4 method for signature 'bclust,missing' clusters(object, newdata, k, ...) ## S4 method for signature 'bclust' parameters(object, k)bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0, dist.method = "euclidian", hclust.method = "average", FUN = "cclust", verbose = TRUE, final.cclust = FALSE, resample = TRUE, weights = NULL, maxcluster = base.k, ...) ## S4 method for signature 'bclust,missing' plot(x, y, maxcluster = x@maxcluster, main = "", ...) ## S4 method for signature 'bclust,missing' clusters(object, newdata, k, ...) ## S4 method for signature 'bclust' parameters(object, k)
x |
Matrix of inputs (or object of class |
k |
Number of clusters. |
base.iter |
Number of runs of the base cluster algorithm. |
base.k |
Number of centers used in each repetition of the base method. |
minsize |
Minimum number of points in a base cluster. |
dist.method |
Distance method used for the hierarchical
clustering, see |
hclust.method |
Linkage method used for the hierarchical
clustering, see |
FUN |
Partitioning cluster method used as base algorithm. |
verbose |
Output status messages. |
final.cclust |
If |
resample |
Logical, if |
weights |
Vector of length |
maxcluster |
Maximum number of clusters memberships are to be computed for. |
object |
Object of class |
main |
Main title of the plot. |
... |
Optional arguments top be passed to the base method
in |
y |
Missing. |
newdata |
An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used. |
First, base.iter bootstrap samples of the original data in
x are created by drawing with replacement. The base cluster
method is run on each of these samples with base.k
centers. The base.method must be the name of a partitioning
cluster function returning an object with the same slots as the
return value of cclust.
This results in a collection of iter.base * base.centers
centers, which are subsequently clustered using the hierarchical
method hclust. Base centers with less than
minsize points in there respective partitions are removed
before the hierarchical clustering. The resulting dendrogram is
then cut to produce k clusters.
bclust returns objects of class
"bclust" including the slots
hclust |
Return value of the hierarchical clustering of the
collection of base centers (Object of class |
cluster |
Vector with indices of the clusters the inputs are assigned to. |
centers |
Matrix of centers of the final clusters. Only useful, if the hierarchical clustering method produces convex clusters. |
allcenters |
Matrix of all |
Friedrich Leisch
Friedrich Leisch. Bagged clustering. Working Paper 51, SFB “Adaptive Information Systems and Modeling in Economics and Management Science”, August 1999. doi:10.57938/9b129f95-b53b-44ce-a129-5b7a1168d832
Sara Dolnicar and Friedrich Leisch. Winter tourist segments in Austria: Identifying stable vacation styles using bagged clustering techniques. Journal of Travel Research, 41(3):281-292, 2003.
data(iris) bc1 <- bclust(iris[,1:4], 3, base.k=5) plot(bc1) table(clusters(bc1, k=3)) parameters(bc1, k=3)data(iris) bc1 <- bclust(iris[,1:4], 3, base.k=5) plot(bc1) table(clusters(bc1, k=3)) parameters(bc1, k=3)
Birth and death rates for 70 countries.
data(birth)data(birth)
A data frame with 70 observations on the following 2 variables.
birthBirth rate (in percent).
deathDeath rate (in percent).
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Runs clustering algorithms repeatedly for different numbers of clusters on bootstrap replica of the original data and returns corresponding cluster assignments, centroids and (adjusted) Rand indices comparing pairs of partitions.
bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL, multicore=TRUE, verbose=FALSE, ...) ## S4 method for signature 'bootFlexclust' summary(object) ## S4 method for signature 'bootFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'bootFlexclust' boxplot(x, ...) ## S4 method for signature 'bootFlexclust' densityplot(x, data, ...)bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL, multicore=TRUE, verbose=FALSE, ...) ## S4 method for signature 'bootFlexclust' summary(object) ## S4 method for signature 'bootFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'bootFlexclust' boxplot(x, ...) ## S4 method for signature 'bootFlexclust' densityplot(x, data, ...)
x, k, ...
|
Passed to |
nboot |
Number of bootstrap pairs of partitions. |
correct |
Logical, correct the Rand index for agreement by chance also called adjusted Rand index)? |
seed |
If not |
multicore |
If |
verbose |
If |
y, data
|
Not used. |
object |
An object of class |
Availability of multicore is checked
when flexclust is loaded. This information is stored and can be
obtained using
getOption("flexclust")$have_multicore. Set to FALSE
for debugging and more sensible error messages in case something
goes wrong.
Friedrich Leisch
## Not run: ## data uniform on unit square x <- matrix(runif(400), ncol=2) cl <- FALSE ## to run bootstrap replications on a workstation cluster do the following: library("parallel") cl <- makeCluster(2, type = "PSOCK") clusterCall(cl, function() require("flexclust")) ## 50 bootstrap replicates for speed in example, ## use more for real applications bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl) bcl summary(bcl) ## splitting the square into four quadrants should be the most stable ## solution (increase nboot if not) plot(bcl) densityplot(bcl, from=0) ## End(Not run)## Not run: ## data uniform on unit square x <- matrix(runif(400), ncol=2) cl <- FALSE ## to run bootstrap replications on a workstation cluster do the following: library("parallel") cl <- makeCluster(2, type = "PSOCK") clusterCall(cl, function() require("flexclust")) ## 50 bootstrap replicates for speed in example, ## use more for real applications bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl) bcl summary(bcl) ## splitting the square into four quadrants should be the most stable ## solution (increase nboot if not) plot(bcl) densityplot(bcl, from=0) ## End(Not run)
Results of the elections 2002, 2005 or 2009 for the German Bundestag, the first chamber of the German parliament.
data(btw2002) data(btw2005) data(btw2009) bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)data(btw2002) data(btw2005) data(btw2009) bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)
year |
Numeric or character, year of the election. |
second |
Logical, return second or first votes? |
percent |
Logical, return percentages or absolute numbers? |
nazero |
Logical, convert |
state |
Logical or character. If |
btw200x are data frames with 299 rows
(corresponding to constituencies) and 17 columns. All columns except
state are numeric.
stateFactor, the 16 German federal states.
eligibleNumber of citizens eligible to vote.
votesNumber of eligible citizens who did vote.
invalid1, invalid2Number of invalid first and second votes (see details below).
valid1, valid2Number of valid first and second votes.
SPD1, SPD2Number of first and second votes for the Social Democrats.
UNION1, UNION2Number of first and second votes for CDU/CSU, the conservative Christian Democrats.
GRUENE1, GRUENE2Number of first and second votes for the Green Party.
FDP1, FDP2Number of first and second votes for the Liberal Party.
LINKE1, LINKE2Number of first and second votes for the Left Party (PDS in 2002).
Missing values indicate that a party did not candidate in the corresponding constituency.
btw200x are the original data sets.
bundestag() is a helper function which extracts first
or second votes, calculates percentages (number of votes for a party divided by
number of valid votes), replaces missing values by zero, and converts
the result from a data frame to a matrix. By default
it returns the percentage of second votes for each party, which
determines the number of seats each party gets in parliament.
Half of the Members of the German Bundestag are elected directly from Germany's 299 constituencies, the other half on the parties' state lists. Accordingly, each voter has two votes in the elections to the German Bundestag. The first vote, allowing voters to elect their local representatives to the Bundestag, decides which candidates are sent to Parliament from the constituencies.
The second vote is cast for a party list. And it is this second vote that determines the relative strengths of the parties represented in the Bundestag. At least 598 Members of the German Bundestag are elected in this way. In addition to this, there are certain circumstances in which some candidates win what are known as “overhang mandates” when the seats are being distributed.
Homepage of the Bundestag: https://www.bundestag.de
p02 <- bundestag(2002) pairs(p02) p05 <- bundestag(2005) pairs(p05) p09 <- bundestag(2009) pairs(p09) state <- bundestag(2002, state=TRUE) table(state) start.with.b <- bundestag(2002, state="^B") table(start.with.b) pairs(p09, col=2-(state=="Bayern"))p02 <- bundestag(2002) pairs(p02) p05 <- bundestag(2005) pairs(p05) p09 <- bundestag(2009) pairs(p09) state <- bundestag(2002, state=TRUE) table(state) start.with.b <- bundestag(2002, state="^B") table(start.with.b) pairs(p09, col=2-(state=="Bayern"))
Seperate boxplot of variables in each cluster in comparison with boxplot for complete sample.
## S4 method for signature 'kcca' bwplot(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...) ## S4 method for signature 'bclust' bwplot(x, k=x@k, xlab="", strip.labels=NULL, strip.prefix="Cluster ", clusters=1:k, ...)## S4 method for signature 'kcca' bwplot(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...) ## S4 method for signature 'bclust' bwplot(x, k=x@k, xlab="", strip.labels=NULL, strip.prefix="Cluster ", clusters=1:k, ...)
x |
An object of class |
data |
If not |
xlab, ...
|
Graphical parameters. |
col |
Vector of colors for the clusters. |
strip.labels |
Vector of strings for the strips of the Trellis display. |
strip.prefix |
Prefix string for the strips of the Trellis display. |
shade |
If |
shadefun |
A function or name of a function to compute which
boxes are shaded, e.g. |
byvar |
If |
k |
Number of clusters. |
clusters |
Integer vector of clusters to plot. |
set.seed(1) cl <- cclust(iris[,-5], k=3, save.data=TRUE) bwplot(cl) bwplot(cl, byvar=TRUE) ## fill only boxes with color which do not contain the overall median ## (grey dot of background box) bwplot(cl, shade=TRUE) ## fill only boxes with color which do not overlap with the box of the ## complete sample (grey background box) bwplot(cl, shadefun="boxOverlap")set.seed(1) cl <- cclust(iris[,-5], k=3, save.data=TRUE) bwplot(cl) bwplot(cl, byvar=TRUE) ## fill only boxes with color which do not contain the overall median ## (grey dot of background box) bwplot(cl, shade=TRUE) ## fill only boxes with color which do not overlap with the box of the ## complete sample (grey background box) bwplot(cl, shadefun="boxOverlap")
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
cclust(x, k, dist = "euclidean", method = "kmeans", weights=NULL, control=NULL, group=NULL, simple=FALSE, save.data=FALSE)cclust(x, k, dist = "euclidean", method = "kmeans", weights=NULL, control=NULL, group=NULL, simple=FALSE, save.data=FALSE)
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
k |
Either the number of clusters, or a vector of cluster
assignments, or a matrix of initial
(distinct) cluster centroids. If a number, a random set of (distinct)
rows in |
dist |
Distance measure, one of |
method |
Clustering algorithm: one of |
weights |
An optional vector of weights for the observations
(rows of the |
control |
An object of class |
group |
Currently ignored. |
simple |
Return an object of class |
save.data |
Save a copy of |
This function uses the same computational engine as the earlier
function of the same name from package ‘cclust’. The main difference
is that it returns an S4 object of class "kcca", hence all
available methods for "kcca" objects can be used. By default
kcca and cclust use exactly the same algorithm,
but cclust will usually be much faster because it uses compiled
code.
If dist is "euclidean", the distance between the cluster
center and the data points is the Euclidian distance (ordinary kmeans
algorithm), and cluster means are used as centroids.
If "manhattan", the distance between the cluster
center and the data points is the sum of the absolute values of the
distances, and the column-wise cluster medians are used as centroids.
If method is "kmeans", the classic kmeans algorithm as
given by MacQueen (1967) is
used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If
"hardcl",
on-line updates are used (AKA hard competitive learning), which work by
randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996). If
"neuralgas" then the neural gas algorithm by Martinetz et al
(1993) is used. It is similar to hard competitive learning, but in
addition to the closest centroid also the second closest centroid is
moved in each iteration.
An object of class "kcca".
Evgenia Dimitriadou and Friedrich Leisch
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
Martinetz T., Berkovich S., and Schulten K (1993). ‘Neural-Gas’ Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.
## a 2-dimensional example x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2), matrix(rnorm(100, mean=1, sd=0.3), ncol=2)) cl <- cclust(x,2) plot(x, col=predict(cl)) points(cl@centers, pch="x", cex=2, col=3) ## a 3-dimensional example x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3), matrix(rnorm(150, mean=2, sd=0.3), ncol=3), matrix(rnorm(150, mean=4, sd=0.3), ncol=3)) cl <- cclust(x, 6, method="neuralgas", save.data=TRUE) pairs(x, col=predict(cl)) plot(cl)## a 2-dimensional example x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2), matrix(rnorm(100, mean=1, sd=0.3), ncol=2)) cl <- cclust(x,2) plot(x, col=predict(cl)) points(cl@centers, pch="x", cex=2, col=3) ## a 3-dimensional example x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3), matrix(rnorm(150, mean=2, sd=0.3), ncol=3), matrix(rnorm(150, mean=4, sd=0.3), ncol=3)) cl <- cclust(x, 6, method="neuralgas", save.data=TRUE) pairs(x, col=predict(cl)) plot(cl)
Returns a matrix of cluster similarities. Currently two methods for computing similarities of clusters are implemented, see details below.
## S4 method for signature 'kcca' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...) ## S4 method for signature 'kccasimple' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...)## S4 method for signature 'kcca' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...) ## S4 method for signature 'kccasimple' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...)
object |
Fitted object. |
data |
Data to use for computation of the shadow values. If
the cluster object |
method |
Type of similarities, see details below. |
symmetric |
Compute symmetric or asymmetric shadow values?
Ignored if |
... |
Currently not used. |
If method="shadow" (the default), then the similarity of two
clusters is proportional to the number of points in a cluster, where
the centroid of the other cluster is second-closest. See Leisch (2006,
2008) for detailed formulas.
If method="centers", then first the pairwise distances between
all centroids are computed and rescaled to [0,1]. The similarity
between tow clusters is then simply 1 minus the rescaled distance.
Friedrich Leisch
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.
Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.
example(Nclus) clusterSim(cl) clusterSim(cl, symmetric=TRUE) ## should have similar structure but will be numerically different: clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),]) ## different concept of cluster similarity clusterSim(cl, method="centers")example(Nclus) clusterSim(cl) clusterSim(cl, symmetric=TRUE) ## should have similar structure but will be numerically different: clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),]) ## different concept of cluster similarity clusterSim(cl, method="centers")
These functions can be used to convert the results from cluster
functions like
kmeans or pam to objects
of class "kcca" and vice versa.
as.kcca(object, ...) ## S3 method for class 'hclust' as.kcca(object, data, k, family=NULL, save.data=FALSE, ...) ## S3 method for class 'kmeans' as.kcca(object, data, save.data=FALSE, ...) ## S3 method for class 'partition' as.kcca(object, data=NULL, save.data=FALSE, ...) ## S3 method for class 'skmeans' as.kcca(object, data, save.data=FALSE, ...) ## S4 method for signature 'kccasimple,kmeans' coerce(from, to="kmeans", strict=TRUE) Cutree(tree, k=NULL, h=NULL)as.kcca(object, ...) ## S3 method for class 'hclust' as.kcca(object, data, k, family=NULL, save.data=FALSE, ...) ## S3 method for class 'kmeans' as.kcca(object, data, save.data=FALSE, ...) ## S3 method for class 'partition' as.kcca(object, data=NULL, save.data=FALSE, ...) ## S3 method for class 'skmeans' as.kcca(object, data, save.data=FALSE, ...) ## S4 method for signature 'kccasimple,kmeans' coerce(from, to="kmeans", strict=TRUE) Cutree(tree, k=NULL, h=NULL)
object |
Fitted object. |
data |
Data which were used to obtain the clustering. For
|
save.data |
Save a copy of the data in the return object? |
k |
Number of clusters. |
family |
Object of class |
... |
Currently not used. |
from, to, strict
|
Usual arguments for |
tree |
A tree as produced by |
h |
Numeric scalar or vector with heights where the tree should be cut. |
The standard cutree function orders clusters such that
observation one is in cluster one, the first observation (as ordered
in the data set) not in cluster one is in cluster two,
etc. Cutree orders clusters as shown in the dendrogram from
left to right such that similar clusters have similar numbers. The
latter is used when converting to kcca.
For hierarchical clustering the cluster memberships of the converted
object can be different from the result of Cutree,
because one KCCA-iteration has to be performed in order to obtain a
valid kcca object. In this case a warning is issued.
Friedrich Leisch
data(Nclus) cl1 <- kmeans(Nclus, 4) cl1 cl1a <- as.kcca(cl1, Nclus) cl1a cl1b <- as(cl1a, "kmeans") library("cluster") cl2 <- pam(Nclus, 4) cl2 cl2a <- as.kcca(cl2) cl2a ## the same cl2b <- as.kcca(cl2, Nclus) cl2b ## hierarchical clustering hc <- hclust(dist(USArrests)) plot(hc) rect.hclust(hc, k=3) c3 <- Cutree(hc, k=3) k3 <- as.kcca(hc, USArrests, k=3) barchart(k3) table(c3, clusters(k3))data(Nclus) cl1 <- kmeans(Nclus, 4) cl1 cl1a <- as.kcca(cl1, Nclus) cl1a cl1b <- as(cl1a, "kmeans") library("cluster") cl2 <- pam(Nclus, 4) cl2 cl2a <- as.kcca(cl2) cl2a ## the same cl2b <- as.kcca(cl2, Nclus) cl2b ## hierarchical clustering hc <- hclust(dist(USArrests)) plot(hc) rect.hclust(hc, k=3) c3 <- Cutree(hc, k=3) k3 <- as.kcca(hc, USArrests, k=3) barchart(k3) table(c3, clusters(k3))
Mammal's teeth divided into the 4 groups: incisors, canines, premolars and molars.
data(dentitio)data(dentitio)
A data frame with 66 observations on the following 8 variables.
top.incTop incisors.
bot.incBottom incisors.
top.canTop canines.
bot.canBottom canines.
top.preTop premolars.
bot.preBottom premolars.
top.molTop molars.
bot.molBottom molars.
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
This function computes and returns the distance matrix computed by using the specified distance measure to compute the pairwise distances between the rows of two data matrices.
dist2(x, y, method = "euclidean", p=2)dist2(x, y, method = "euclidean", p=2)
x |
A data matrix. |
y |
A vector or second data matrix. |
method |
the distance measure to be used. This must be one of
|
p |
The power of the Minkowski distance. |
This is a two-data-set equivalent of the standard function
dist. It returns a matrix of all pairwise
distances between rows in x and y. The current
implementation is efficient only if y has not too many
rows (the code is vectorized in x but not in y).
The definition of Canberra distance was wrong for negative data prior to version 1.3-5.
Friedrich Leisch
x <- matrix(rnorm(20), ncol=4) rownames(x) = paste("X", 1:nrow(x), sep=".") y <- matrix(rnorm(12), ncol=4) rownames(y) = paste("Y", 1:nrow(y), sep=".") dist2(x, y) dist2(x, y, "man") data(milk) dist2(milk[1:5,], milk[4:6,])x <- matrix(rnorm(20), ncol=4) rownames(x) = paste("X", 1:nrow(x), sep=".") y <- matrix(rnorm(12), ncol=4) rownames(y) = paste("Y", 1:nrow(y), sep=".") dist2(x, y) dist2(x, y, "man") data(milk) dist2(milk[1:5,], milk[4:6,])
Helper functions to create kccaFamily objects.
distAngle(x, centers) distCanberra(x, centers) distCor(x, centers) distEuclidean(x, centers) distJaccard(x, centers) distManhattan(x, centers) distMax(x, centers) distMinkowski(x, centers, p=2) centAngle(x) centMean(x) centMedian(x) centOptim(x, dist) centOptim01(x, dist)distAngle(x, centers) distCanberra(x, centers) distCor(x, centers) distEuclidean(x, centers) distJaccard(x, centers) distManhattan(x, centers) distMax(x, centers) distMinkowski(x, centers, p=2) centAngle(x) centMean(x) centMedian(x) centOptim(x, dist) centOptim01(x, dist)
x |
A data matrix. |
centers |
A matrix of centroids. |
p |
The power of the Minkowski distance. |
dist |
A distance function. |
Friedrich Leisch
Hyperparameters for cluster algorithms.
Objects can be created by calls of the form
new("flexclustControl", ...). In addition, named lists can be
coerced to flexclustControl
objects, names are completed if unique (see examples).
Objects of class "flexclustControl" have the following slots:
iter.max:Maximum number of iterations.
tolerance:The algorithm is stopped when the
(relative) change of the optimization criterion is smaller than
tolerance.
verbose:If a positive integer, then progress is
reported every verbose iterations. If 0,
no output is generated during model fitting.
classify:Character string, one of "auto",
"weighted", "hard" or "simann".
initcent:Character string, name of function for
initial centroids, currently "randomcent" (the default) and
"kmeanspp" are available.
gamma:Gamma value for weighted hard competitive learning.
simann:Parameters for simulated annealing
optimization (only used when classify="simann").
ntry:Number of trials per iteration for QT clustering.
min.size:Clusters smaller than this value are treated as outliers.
Objects of class "cclustControl" inherit from
"flexclustControl" and have the following additional slots:
method:Learning rate for hard competitive learning,
one of "polynomial" or "exponential".
pol.rate:Positive number for polynomial learning rate
of form .
exp.rateVector of length 2 with parameters for
exponential learning rate of form
.
ng.rate:Vector of length 4 with parameters for neural gas, see details below.
The neural gas algorithm uses updates of form
for every centroid, where is the order (minus 1) of the
centroid with
respect to distance to data point (0=closest, 1=second,
...). The parameters and are given by
See Martinetz et al (1993) for details of the algorithm, and the examples section on how to obtain default values.
Friedrich Leisch
Martinetz T., Berkovich S., and Schulten K. (1993). "Neural-Gas Network for Vector Quantization and its Application to Time-Series Prediction." IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
Arthur D. and Vassilvitskii S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. pp. 1027-1035.
## have a look at the defaults new("flexclustControl") ## corce a list mycont <- list(iter=500, tol=0.001, class="w") as(mycont, "flexclustControl") ## some additional slots as(mycont, "cclustControl") ## default values for ng.rate new("cclustControl")@ng.rate## have a look at the defaults new("flexclustControl") ## corce a list mycont <- list(iter=500, tol=0.001, class="w") as(mycont, "flexclustControl") ## some additional slots as(mycont, "cclustControl") ## default values for ng.rate new("cclustControl")@ng.rate
Create and access palettes for the plot methods.
flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE) flxPalette(n, ...)flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE) flxPalette(n, ...)
n |
Index number of color to return (1 to 8) for |
color |
Type of color, see details. |
grey |
Return grey value corresponding to palette. |
... |
Passed on to |
This function creates color palettes in HCL space for up to 8 colors. All palettes have constant chroma and luminance, only the hue of the colors change within a palette.
Palettes "full" and "dark" have the same luminance, and
palettes "medium" and "light" have the same luminance.
Friedrich Leisch
opar <- par(c("mfrow", "mar", "xaxt")) par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n") x <- rep(1, 8) barplot(x, col = flxColors(color="full"), main="full") barplot(x, col = flxColors(color="dark"), main="dark") barplot(x, col = flxColors(color="medium"), main="medium") barplot(x, col = flxColors(color="light"), main="light") par(opar)opar <- par(c("mfrow", "mar", "xaxt")) par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n") x <- rep(1, 8) barplot(x, col = flxColors(color="full"), main="full") barplot(x, col = flxColors(color="dark"), main="dark") barplot(x, col = flxColors(color="medium"), main="medium") barplot(x, col = flxColors(color="light"), main="light") par(opar)
Plot a histogram of the similarity of each observation to each cluster.
## S4 method for signature 'kccasimple,missing' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,data.frame' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,matrix' histogram(x, data, xlab="Similarity", power=1, ...)## S4 method for signature 'kccasimple,missing' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,data.frame' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,matrix' histogram(x, data, xlab="Similarity", power=1, ...)
x |
An object of class |
data |
If not missing, the distance and thus similarity between observations and cluster centers is determined for the new data and used for the plots. By default the values from the training data are used. |
xlab |
Label for the x-axis. |
power |
Numeric indicating how similarities are transformed, for more details see Dolnicar et al. (2018). |
... |
Additional arguments passed to
|
Friedrich Leisch
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
Image plot of cluster segments overlaid by neighbourhood graph.
## S4 method for signature 'kcca' image(x, which = 1:2, npoints = 100, xlab = "", ylab = "", fastcol = TRUE, col=NULL, clwd=0, graph=TRUE, ...)## S4 method for signature 'kcca' image(x, which = 1:2, npoints = 100, xlab = "", ylab = "", fastcol = TRUE, col=NULL, clwd=0, graph=TRUE, ...)
x |
An object of class |
which |
Index number of dimensions of input space to plot. |
npoints |
Number of grid points for image. |
fastcol |
If |
col |
Vector of background colors for the segments. |
clwd |
Line width of contour lines at cluster boundaries, use
larger values for |
graph |
Logical, add a neighborhood graph to the plot? |
xlab, ylab, ...
|
Graphical parameters. |
This works only for "kcca" objects, no method is available for
"kccasimple" objects.
Friedrich Leisch
Returns descriptive information about fitted flexclust objects like cluster sizes or sum of within-cluster distances.
## S4 method for signature 'flexclust,character' info(object, which, drop=TRUE, ...)## S4 method for signature 'flexclust,character' info(object, which, drop=TRUE, ...)
object |
Fitted object. |
which |
Which information to get. Use |
drop |
Logical. If |
... |
Passed to methods. |
Function info can be used to access slots of fitted flexclust
objects in a portable way, and in addition computes some
meta-information like sum of within-cluster distances.
Function infoCheck returns a logical value that is TRUE
if the requested information can be computed from the object.
Friedrich Leisch
data("Nclus") plot(Nclus) cl1 <- cclust(Nclus, k=4) summary(cl1) ## these two are the same info(cl1) info(cl1, "help") ## cluster sizes i1 <- info(cl1, "size") i1 ## average within cluster distances i2 <- info(cl1, "av_dist") i2 ## the sum of all within-cluster distances i3 <- info(cl1, "distsum") i3 ## sum(i1*i2) must of course be the same as i3 stopifnot(all.equal(sum(i1*i2), i3)) ## This should return TRUE modeltools::infoCheck(cl1, "size") ## and this FALSE modeltools::infoCheck(cl1, "Homer Simpson") ## both combined i4 <- modeltools::infoCheck(cl1, c("size", "Homer Simpson")) i4 stopifnot(all.equal(i4, c(TRUE, FALSE)))data("Nclus") plot(Nclus) cl1 <- cclust(Nclus, k=4) summary(cl1) ## these two are the same info(cl1) info(cl1, "help") ## cluster sizes i1 <- info(cl1, "size") i1 ## average within cluster distances i2 <- info(cl1, "av_dist") i2 ## the sum of all within-cluster distances i3 <- info(cl1, "distsum") i3 ## sum(i1*i2) must of course be the same as i3 stopifnot(all.equal(sum(i1*i2), i3)) ## This should return TRUE modeltools::infoCheck(cl1, "size") ## and this FALSE modeltools::infoCheck(cl1, "Homer Simpson") ## both combined i4 <- modeltools::infoCheck(cl1, c("size", "Homer Simpson")) i4 stopifnot(all.equal(i4, c(TRUE, FALSE)))
Perform k-centroids clustering on a data matrix.
kcca(x, k, family=kccaFamily("kmeans"), weights=NULL, group=NULL, control=NULL, simple=FALSE, save.data=FALSE) kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which, preproc = NULL, genDist=NULL, trim=0, groupFun = "minSumClusters") ## S4 method for signature 'kccasimple' summary(object)kcca(x, k, family=kccaFamily("kmeans"), weights=NULL, group=NULL, control=NULL, simple=FALSE, save.data=FALSE) kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which, preproc = NULL, genDist=NULL, trim=0, groupFun = "minSumClusters") ## S4 method for signature 'kccasimple' summary(object)
x |
A numeric matrix of data, or an object that can be coerced to such a matrix using data.matrix. |
k |
Either the number of clusters, or a vector of cluster
assignments, or a matrix of initial
(distinct) cluster centroids. If a number, a random set of (distinct)
rows in |
family |
Object of class |
weights |
An optional vector of weights to be used in the clustering process, cannot be combined with all families. |
group |
An optional grouping vector for the data, see details below. |
control |
An object of class |
simple |
Return an object of class |
save.data |
Save a copy of |
which |
One of |
name |
Optional long name for family, used only for show methods. |
dist |
A function for distance computation, ignored
if |
cent |
A function for centroid computation, ignored
if |
preproc |
Function for data preprocessing. Defaults to
|
genDist |
Function for updating the family object based on
|
trim |
A number in between 0 and 0.5, if non-zero then trimmed
means are used for the |
groupFun |
Function or name of function to obtain clusters for grouped data, see details below. |
object |
Object of class |
See the paper A Toolbox for K-Centroids Cluster Analysis referenced below for details.
Function kcca returns objects of class "kcca" or
"kccasimple" depending on the value of argument
simple. The simpler objects contain fewer slots and hence are
faster to compute, but contain no auxiliary information used by the
plotting methods. Most plot methods for "kccasimple" objects do
nothing and return a warning. If only centroids, cluster membership or
prediction for new data are of interest, then the simple objects are
sufficient.
Function kccaFamily() currently has the following predefined
families (distance / centroid):
Euclidean distance / mean
Manhattan distance / median
angle between observation and centroid / standardized mean
Jaccard distance / numeric optimization
Jaccard distance / mean
See Leisch (2006) for details on all combinations.
If group is not NULL, then observations from the same
group are restricted to belong to the same cluster (must-link
constraint) or different clusters (cannot-link constraint) during the
fitting process. If groupFun = "minSumClusters", then all group
members are
assign to the cluster where the center has minimal average distance to
the group members. If groupFun = "majorityClusters", then all
group members are assigned to the cluster the majority would belong to
without a constraint.
groupFun = "differentClusters" implements a cannot-link
constraint, i.e., members of one group are not allowed to belong to
the same cluster. The optimal allocation for each group is found by
solving a linear sum assignment problem using
solve_LSAP. Obviously the group sizes must be smaller
than the number of clusters in this case.
Ties are broken at random in all cases.
Note that at the moment not all methods for fitted
"kcca" objects respect the grouping information, most
importantly the plot method when a data argument is specified.
Friedrich Leisch
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.
Friedrich Leisch and Bettina Gruen. Extending standard cluster algorithms to allow for group constraints. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006-Proceedings in Computational Statistics, pages 885-892. Physica Verlag, Heidelberg, Germany, 2006.
stepFlexclust, cclust,
distances
data("Nclus") plot(Nclus) ## try kmeans cl1 <- kcca(Nclus, k=4) cl1 image(cl1) points(Nclus) ## A barplot of the centroids barplot(cl1) ## now use k-medians and kmeans++ initialization, cluster centroids ## should be similar... cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"), control=list(initcent="kmeanspp")) cl2 ## ... but the boundaries of the partitions have a different shape image(cl2) points(Nclus)data("Nclus") plot(Nclus) ## try kmeans cl1 <- kcca(Nclus, k=4) cl1 image(cl1) points(Nclus) ## A barplot of the centroids barplot(cl1) ## now use k-medians and kmeans++ initialization, cluster centroids ## should be similar... cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"), control=list(initcent="kmeanspp")) cl2 ## ... but the boundaries of the partitions have a different shape image(cl2) points(Nclus)
Convert object of class "kcca" to a data frame in long format.
kcca2df(object, data)kcca2df(object, data)
object |
Object of class |
data |
Optional data if not saved in |
A data.frame with columns value, variable and
group.
c.iris <- cclust(iris[,-5], 3, save.data=TRUE) df.c.iris <- kcca2df(c.iris) summary(df.c.iris) densityplot(~value|variable+group, data=df.c.iris)c.iris <- cclust(iris[,-5], 3, save.data=TRUE) df.c.iris <- kcca2df(c.iris) summary(df.c.iris) densityplot(~value|variable+group, data=df.c.iris)
The data set contains the ingredients of mammal's milk of 25 animals.
data(milk)data(milk)
A data frame with 25 observations on the following 5 variables (all in percent).
waterWater.
proteinProtein.
fatFat.
lactoseLactose.
ashAsh.
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
A simple artificial regression example with 4 clusters, all of them having a Gaussian distribution.
data(Nclus)data(Nclus)
The Nclus data set can be re-created by loading package
flexmix and running ExNclus(100)
using set.seed(2602). It has been saved as a data set for
simplicity of examples only.
data(Nclus) cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE) plot(cl)data(Nclus) cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE) plot(cl)
The data set contains the measurements of nutrients in several types of meat, fish and fowl.
data(nutrient)data(nutrient)
A data frame with 27 observations on the following 5 variables.
energyFood energy (calories).
proteinProtein (grams).
fatFat (grams).
calciumcalcium (milli grams).
ironIron (milli grams).
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Plot a matrix of neighbourhood graphs.
## S4 method for signature 'kcca' pairs(x, which=NULL, project=NULL, oma=NULL, ...)## S4 method for signature 'kcca' pairs(x, which=NULL, project=NULL, oma=NULL, ...)
x |
An object of class |
which |
Index numbers of dimensions of (projected) input space to plot, default is to plot all dimensions. |
project |
Projection object for which a |
oma |
Outer margin. |
... |
Passed to the |
This works only for "kcca" objects, no method is available for
"kccasimple" objects.
Friedrich Leisch
Returns the matrix of centroids of a fitted object of class "kcca".
## S4 method for signature 'kccasimple' parameters(object, ...)## S4 method for signature 'kccasimple' parameters(object, ...)
object |
Fitted object. |
... |
Currently not used. |
Friedrich Leisch
Plot the neighbourhood graph of a cluster solution together with projected data points.
## S4 method for signature 'kcca,missing' plot(x, y, which=1:2, project=NULL, data=NULL, points=TRUE, hull=TRUE, hull.args=NULL, number = TRUE, simlines=TRUE, lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16, add=FALSE, xlab="", ylab="", xlim = NULL, ylim = NULL, pch=NULL, col=NULL, ...)## S4 method for signature 'kcca,missing' plot(x, y, which=1:2, project=NULL, data=NULL, points=TRUE, hull=TRUE, hull.args=NULL, number = TRUE, simlines=TRUE, lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16, add=FALSE, xlab="", ylab="", xlim = NULL, ylim = NULL, pch=NULL, col=NULL, ...)
x |
An object of class |
y |
Not used |
which |
Index numbers of dimensions of (projected) input space to plot. |
project |
Projection object for which a |
data |
Data to include in plot. If the cluster object |
points |
Logical, shall data points be plotted (if available)? |
hull |
If |
hull.args |
A list of arguments for the hull function. |
number |
Logical, plot number labels in nodes of graph? |
numcol, cex
|
Color and size of number labels in nodes of
graph. If |
nodes |
Plotting symbol to use for nodes if no numbers are drawn. |
simlines |
Logical, plot edges of graph? |
lwd, maxlwd
|
Numerical, thickness of lines. |
add |
Logical, add to existing plot? |
xlab, ylab
|
Axis labels. |
xlim, ylim
|
Axis range. |
pch, col, ...
|
Plotting symbols and colors for data points. |
This works only for "kcca" objects, no method is available for
"kccasimple" objects.
Friedrich Leisch
Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.
Return either the cluster membership of training data or predict for new data.
## S4 method for signature 'kccasimple' predict(object, newdata, ...) ## S4 method for signature 'flexclust,ANY' clusters(object, newdata, ...)## S4 method for signature 'kccasimple' predict(object, newdata, ...) ## S4 method for signature 'flexclust,ANY' clusters(object, newdata, ...)
object |
Object of class inheriting from |
newdata |
An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used. |
... |
Currently not used. |
clusters can be used on any object of class "flexclust"
and returns the cluster memberships of the training data.
predict can be used only on objects of class "kcca"
(which inherit from "flexclust"). If no newdata argument
is specified, the function is identical to clusters, if
newdata is specified, then cluster memberships for the new data
are predicted. clusters(object, newdata, ...) is an alias for
predict(object, newdata, ...).
Friedrich Leisch
Simple artificial 2-dimensional data to demonstrate clustering for market segmentation. One dimension is the hypothetical feature sophistication (or performance or quality, etc) of a product, the second dimension the price customers are willing to pay for the product.
priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust", "ellipse", "triangle", "circle", "square", "largesmall"))priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust", "ellipse", "triangle", "circle", "square", "largesmall"))
n |
Sample size. |
which |
Shape of data set. |
Sara Dolnicar and Friedrich Leisch. Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Marketing Letters, 21:83-101, 2010.
plot(priceFeature(200, "2clust")) plot(priceFeature(200, "3clust")) plot(priceFeature(200, "3clustold")) plot(priceFeature(200, "5clust")) plot(priceFeature(200, "ell")) plot(priceFeature(200, "tri")) plot(priceFeature(200, "circ")) plot(priceFeature(200, "square")) plot(priceFeature(200, "largesmall"))plot(priceFeature(200, "2clust")) plot(priceFeature(200, "3clust")) plot(priceFeature(200, "3clustold")) plot(priceFeature(200, "5clust")) plot(priceFeature(200, "ell")) plot(priceFeature(200, "tri")) plot(priceFeature(200, "circ")) plot(priceFeature(200, "square")) plot(priceFeature(200, "largesmall"))
Adds arrows for original coordinate axes to a projection plot.
projAxes(object, which=1:2, center=NULL, col="red", radius=NULL, minradius=0.1, textargs=list(col=col), col.names=getColnames(object), which.names="", group = NULL, groupFun = colMeans, plot=TRUE, ...) placeLabels(object) ## S4 method for signature 'projAxes' placeLabels(object)projAxes(object, which=1:2, center=NULL, col="red", radius=NULL, minradius=0.1, textargs=list(col=col), col.names=getColnames(object), which.names="", group = NULL, groupFun = colMeans, plot=TRUE, ...) placeLabels(object) ## S4 method for signature 'projAxes' placeLabels(object)
object |
Return value of a projection method like
|
which |
Index number of dimensions of (projected) input space that have been plotted. |
center |
Center of the coordinate system to use in projected space. Default is the center of the plotting region. |
col |
Color of arrows. |
radius |
Relative size of the arrows. |
minradius |
Minimum radius of arrows to include (relative to arrow size). |
textargs |
List of arguments for |
col.names |
Variable names of the original data. |
which.names |
A regular expression which variable names to include in the plot. |
group |
An optional grouping variable for the original
coordinates. Coordinates with group |
groupFun |
Function used to aggregate the projected coordinates
if |
plot |
Logical,if |
... |
Passed to |
projAxes invisibly returns an object of class
"projAxes", which can be
added to an existing plot by its plot
method.
Friedrich Leisch
data(milk) milk.pca <- prcomp(milk, scale=TRUE) ## create a biplot step by step plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca) ## the same, but arrows are blue, centered at origin and all arrows are ## plotted plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca, col="blue", center=0, minradius=0) ## use points instead of text, plot PC2 and PC3, manual radius ## specification, store result plot(predict(milk.pca)[,c(2,3)]) arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE) plot(arr) ## Not run: ## manually try to find new places for the labels: each arrow is marked ## active in turn, use the left mouse button to find a better location ## for the label. Use the right mouse button to go on to the next ## variable. arr1 <- placeLabels(arr) ## now do the plot again: plot(predict(milk.pca)[,c(2,3)]) plot(arr1) ## End(Not run)data(milk) milk.pca <- prcomp(milk, scale=TRUE) ## create a biplot step by step plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca) ## the same, but arrows are blue, centered at origin and all arrows are ## plotted plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca, col="blue", center=0, minradius=0) ## use points instead of text, plot PC2 and PC3, manual radius ## specification, store result plot(predict(milk.pca)[,c(2,3)]) arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE) plot(arr) ## Not run: ## manually try to find new places for the labels: each arrow is marked ## active in turn, use the left mouse button to find a better location ## for the label. Use the right mouse button to go on to the next ## variable. arr1 <- placeLabels(arr) ## now do the plot again: plot(predict(milk.pca)[,c(2,3)]) plot(arr1) ## End(Not run)
Split a binary or numeric matrix by a grouping variable, run a series of tests on all variables, adjust for multiple testing and graphically represent results.
propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test", sort=FALSE, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...) ## S4 method for signature 'propBarchart' summary(object, ...) groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL, shade=!is.null(shadefun), shadefun=NULL, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...)propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test", sort=FALSE, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...) ## S4 method for signature 'propBarchart' summary(object, ...) groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL, shade=!is.null(shadefun), shadefun=NULL, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...)
x |
A binary data matrix. |
g |
A factor specifying the groups. |
alpha |
Significance level for test of differences in proportions. |
correct |
Correction method for multiple testing, passed to
|
test |
Test to use for detecting significant differences in proportions. |
sort |
Logical, sort variables by total sample mean? |
strip.prefix |
Character string prepended to strips of the
|
strip.labels |
Character vector of labels to use for strips of
|
which |
Index numbers or names of variables to plot. |
byvar |
If |
... |
|
object |
Return value of |
xlab |
A title for the x-axis: see |
col |
Vector of colors for the panels. |
shade |
If |
shadefun |
A function or name of a function to compute which
boxes are shaded, e.g. |
Function propBarchart splits a binary data matrix into
subgroups, computes the percentage of ones in each column and compares
the proportions in the groups using prop.test. The
p-values for all variables are adjusted for multiple testing and a
barchart of group percentages is drawn highlighting variables with
significant differences in proportion. The summary method can
be used to create a corresponding table for publications.
Function groupBWplot takes a general numeric matrix, also
splits into subgroups and uses boxes instead of bars. By default
kruskal.test is used to compute significant differences
in location, in addition the heuristics from
bwplot,kcca-method can be used. Boxes of the complete sample
are used as reference in the background.
Friedrich Leisch
barplot-methods,
bwplot,kcca-method
## create a binary matrix from the iris data plus a random noise column x <- apply(iris[,-5], 2, function(z) z>median(z)) x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE)) ## There are significant differences in all 4 original variables, Noise ## has most likely no significant difference (of course the difference ## will be significant in alpha percent of all random samples). p <- propBarchart(x, iris$Species) p summary(p) propBarchart(x, iris$Species, byvar=TRUE) x <- iris[,-5] x <- cbind(x, Noise=rnorm(150, mean=3)) groupBWplot(x, iris$Species) groupBWplot(x, iris$Species, shade=TRUE) groupBWplot(x, iris$Species, shadefun="medianInside") groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)## create a binary matrix from the iris data plus a random noise column x <- apply(iris[,-5], 2, function(z) z>median(z)) x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE)) ## There are significant differences in all 4 original variables, Noise ## has most likely no significant difference (of course the difference ## will be significant in alpha percent of all random samples). p <- propBarchart(x, iris$Species) p summary(p) propBarchart(x, iris$Species, byvar=TRUE) x <- iris[,-5] x <- cbind(x, Noise=rnorm(150, mean=3)) groupBWplot(x, iris$Species) groupBWplot(x, iris$Species, shade=TRUE) groupBWplot(x, iris$Species, shadefun="medianInside") groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)
Perform stochastic QT clustering on a data matrix.
qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL, save.data=FALSE, kcca=FALSE)qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL, save.data=FALSE, kcca=FALSE)
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
radius |
Maximum radius of clusters. |
family |
Object of class |
control |
An object of class |
.
save.data |
Save a copy of |
kcca |
Run |
This function implements a variation of the QT clustering algorithm by
Heyer et al. (1999), see Scharl and Leisch (2006). The main difference
is that in each iteration not
all possible cluster start points are considered, but only a random
sample of size control@ntry. We also consider only points as initial
centers where at least one other point is within a circle with radius
radius. In most cases the resulting
solutions are almost
the same at a considerable speed increase, in some cases even better
solutions are obtained than with the original algorithm. If
control@ntry is set to the size of the data set, an algorithm
similar to the original algorithm as proposed by Heyer et al. (1999)
is obtained.
Function qtclust by default returns objects of class
"kccasimple". If argument kcca is TRUE, function
kcca() is run afterwards (initialized on the QT cluster
solution). Data points
not clustered by the QT cluster algorithm are omitted from the
kcca() iterations, but filled back into the return
object. All plot methods defined for objects of class "kcca"
can be used.
Friedrich Leisch
Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research 9, 1106–1115.
Theresa Scharl and Friedrich Leisch. The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006 – Proceedings in Computational Statistics, pages 1015-1022. Physica Verlag, Heidelberg, Germany, 2006.
x <- matrix(10*runif(1000), ncol=2) ## maximum distrance of point to cluster center is 3 cl1 <- qtclust(x, radius=3) ## maximum distrance of point to cluster center is 1 ## -> more clusters, longer runtime cl2 <- qtclust(x, radius=1) opar <- par(c("mfrow","mar")) par(mfrow=c(2,1), mar=c(2.1,2.1,1,1)) plot(x, col=predict(cl1), xlab="", ylab="") plot(x, col=predict(cl2), xlab="", ylab="") par(opar)x <- matrix(10*runif(1000), ncol=2) ## maximum distrance of point to cluster center is 3 cl1 <- qtclust(x, radius=3) ## maximum distrance of point to cluster center is 1 ## -> more clusters, longer runtime cl2 <- qtclust(x, radius=1) opar <- par(c("mfrow","mar")) par(mfrow=c(2,1), mar=c(2.1,2.1,1,1)) plot(x, col=predict(cl1), xlab="", ylab="") plot(x, col=predict(cl2), xlab="", ylab="") par(opar)
Compute the (adjusted) Rand, Jaccard and Fowlkes-Mallows index for agreement of two partitions.
comPart(x, y, type=c("ARI","RI","J","FM")) ## S4 method for signature 'flexclust,flexclust' comPart(x, y, type) ## S4 method for signature 'numeric,numeric' comPart(x, y, type) ## S4 method for signature 'flexclust,numeric' comPart(x, y, type) ## S4 method for signature 'numeric,flexclust' comPart(x, y, type) randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'table,missing' randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'ANY,ANY' randIndex(x, y, correct=TRUE, original=!correct)comPart(x, y, type=c("ARI","RI","J","FM")) ## S4 method for signature 'flexclust,flexclust' comPart(x, y, type) ## S4 method for signature 'numeric,numeric' comPart(x, y, type) ## S4 method for signature 'flexclust,numeric' comPart(x, y, type) ## S4 method for signature 'numeric,flexclust' comPart(x, y, type) randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'table,missing' randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'ANY,ANY' randIndex(x, y, correct=TRUE, original=!correct)
x |
Either a 2-dimensional cross-tabulation of cluster
assignments (for |
y |
An object inheriting from class
|
type |
character vector of abbreviations of indices to compute. |
correct, original
|
Logical, correct the Rand index for agreement by chance? |
A vector of indices.
Let denote the number of all pairs of data
points which are either put into the same cluster by both partitions or
put into different clusters by both partitions. Conversely, let
denote the number of all pairs of data points that are put into one
cluster in one partition, but into different clusters by the other
partition. The partitions disagree for all pairs and
agree for all pairs . We can measure the agreement by the Rand
index which is invariant with respect to permutations of
cluster labels.
The index has to be corrected for agreement by chance if the sizes of the clusters are not uniform (which is usually the case), or if there are many clusters, see Hubert & Arabie (1985) for details.
If the number of clusters is very large, then usually the vast
majority of pairs of points will not be in the same cluster. The
Jaccard index tries to account for this by using only pairs of points
that are in the same cluster in the defintion of .
Let again be the pairs of points that
are in the same cluster in both partitions. Fowlkes-Mallows divides
this number by the geometric mean of the sums of the number of pairs in each
cluster of the two partitions. This gives the probability that a pair
of points which are in the same cluster in one partition are also in the
same cluster in the other partition.
Friedrich Leisch
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.
Marina Meila. Comparing clusterings - an axiomatic view. In Stefan Wrobel and Luc De Raedt, editors, Proceedings of the International Machine Learning Conference (ICML). ACM Press, 2005.
## no class correlations: corrected Rand almost zero g1 <- sample(1:5, size=1000, replace=TRUE) g2 <- sample(1:5, size=1000, replace=TRUE) tab <- table(g1, g2) randIndex(tab) ## uncorrected version will be large, because there are many points ## which are assigned to different clusters in both cases randIndex(tab, correct=FALSE) comPart(g1, g2) ## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better k <- sample(1:1000, size=200) g1[k] <- 1 g2[k] <- 1 k <- sample(1:1000, size=200) g1[k] <- 3 g2[k] <- 3 tab <- table(g1, g2) ## the index should be larger than before randIndex(tab, correct=TRUE, original=TRUE) comPart(g1, g2)## no class correlations: corrected Rand almost zero g1 <- sample(1:5, size=1000, replace=TRUE) g2 <- sample(1:5, size=1000, replace=TRUE) tab <- table(g1, g2) randIndex(tab) ## uncorrected version will be large, because there are many points ## which are assigned to different clusters in both cases randIndex(tab, correct=FALSE) comPart(g1, g2) ## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better k <- sample(1:1000, size=200) g1[k] <- 1 g2[k] <- 1 k <- sample(1:1000, size=200) g1[k] <- 3 g2[k] <- 3 tab <- table(g1, g2) ## the index should be larger than before randIndex(tab, correct=TRUE, original=TRUE) comPart(g1, g2)
Create a series of projection plots corresponding to a random tour through the data.
randomTour(object, ...) ## S4 method for signature 'ANY' randomTour(object, ...) ## S4 method for signature 'matrix' randomTour(object, ...) ## S4 method for signature 'flexclust' randomTour(object, data=NULL, col=NULL, ...) randomTourMatrix(x, directions=10, steps=100, sec=4, sleep = sec/steps, axiscol=2, axislab=colnames(x), center=NULL, radius=1, minradius=0.01, asp=1, ...)randomTour(object, ...) ## S4 method for signature 'ANY' randomTour(object, ...) ## S4 method for signature 'matrix' randomTour(object, ...) ## S4 method for signature 'flexclust' randomTour(object, data=NULL, col=NULL, ...) randomTourMatrix(x, directions=10, steps=100, sec=4, sleep = sec/steps, axiscol=2, axislab=colnames(x), center=NULL, radius=1, minradius=0.01, asp=1, ...)
object, x
|
A matrix or an object of class |
data |
Data to include in plot. |
col |
Plotting colors for data points. |
directions |
Integer value, how many different directions are toured. |
steps |
Integer, number of steps in each direction. |
sec |
Numerical, lower bound for the number of seconds each direction takes. |
sleep |
Numerical, sleep for as many seconds after each picture has been plotted. |
axiscol |
If not |
axislab |
Optional labels for the projected axes. |
center |
Center of the coordinate system to use in projected space. Default is the center of the plotting region. |
radius |
Relative size of the arrows. |
minradius |
Minimum radius of arrows to include. |
asp, ...
|
Passed on to |
Two random locations are chosen, and data then projected onto
hyperplanes which are orthogonal to step vectors interpolating
the two locations. The first two coordinates of the projected data are
plotted. If directions is larger than one, then after the first
steps plots one more random location is chosen, and the
procedure is repeated from the current position to the
new location, etc..
The whole procedure is similar to a grand tour, but no attempt is made
to optimize subsequent directions, randomTour simply chooses a random
direction in each iteration. Use rggobi for the real thing.
Obviously the function needs a reasonably fast computer and graphics
device to give a smooth impression, for x11 it may be
necessary to use type="Xlib" rather than cairo.
Friedrich Leisch
if(interactive()){ par(ask=FALSE) randomTour(iris[,1:4], axiscol=2:5) randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4) x <- matrix(runif(300), ncol=3) x <- rbind(x, x+1, x+2) cl <- cclust(x, k=3, save.data=TRUE) randomTour(cl, center=0, axiscol="black") ## now use predicted cluster membership for new data as colors randomTour(cl, center=0, axiscol="black", data=matrix(rnorm(3000, mean=1, sd=2), ncol=3)) }if(interactive()){ par(ask=FALSE) randomTour(iris[,1:4], axiscol=2:5) randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4) x <- matrix(runif(300), ncol=3) x <- rbind(x, x+1, x+2) cl <- cclust(x, k=3, save.data=TRUE) randomTour(cl, center=0, axiscol="black") ## now use predicted cluster membership for new data as colors randomTour(cl, center=0, axiscol="black", data=matrix(rnorm(3000, mean=1, sd=2), ncol=3)) }
The clusters are relabelled to obtain a unique labeling.
relabel(object, by, ...) ## S4 method for signature 'kccasimple,character' relabel(object, by, which = NULL, ...) ## S4 method for signature 'kccasimple,integer' relabel(object, by, ...) ## S4 method for signature 'kccasimple,missing' relabel(object, by, ...) ## S4 method for signature 'stepFlexclust,integer' relabel(object, by = "series", ...) ## S4 method for signature 'stepFlexclust,missing' relabel(object, by, ...)relabel(object, by, ...) ## S4 method for signature 'kccasimple,character' relabel(object, by, which = NULL, ...) ## S4 method for signature 'kccasimple,integer' relabel(object, by, ...) ## S4 method for signature 'kccasimple,missing' relabel(object, by, ...) ## S4 method for signature 'stepFlexclust,integer' relabel(object, by = "series", ...) ## S4 method for signature 'stepFlexclust,missing' relabel(object, by, ...)
object |
An object of class |
by |
If a character vector, it needs to be one of |
which |
Either an integer vector indiating the ordering or a vector of length one indicating the variable used for ordering. |
... |
Currently not used. |
If by is a character vector with value "mean" or
"median", the clusters are ordered by the mean or median values
over all variables for each cluster. If by = "manual"
which needs to be a vector indicating the ordering. If
by = "variable" which needs to be indicate the variable
which is used to determine the ordering. If by is
"centers", "shadow" or "symmshadow", cluster
similarities are calculated using clusterSim and used to
determine an ordering using seriate from package
seriation.
If by = "series" the relabeling is performed over a series of
clustering to minimize the misclassification.
Friedrich Leisch
Compute and plot shadows and silhouettes.
## S4 method for signature 'kccasimple' shadow(object, ...) ## S4 method for signature 'kcca' Silhouette(object, data=NULL, ...)## S4 method for signature 'kccasimple' shadow(object, ...) ## S4 method for signature 'kcca' Silhouette(object, data=NULL, ...)
object |
An object of class |
data |
Data to compute silhouette values for. If the cluster
|
... |
Currently not used. |
The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.
The silhouette value of a data point is defined as the scaled difference between the average dissimilarity of a point to all points in its own cluster to the smallest average dissimilarity to the points of a different cluster. Large silhouette values indicate good separation.
The main difference between silhouette values and shadow values is that we replace average dissimilarities to points in a cluster by dissimilarities to point averages (=centroids). See Leisch (2009) for details.
Friedrich Leisch
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.
data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) ## high shadow values indicate clusters with *bad* separation shadow(c5) plot(shadow(c5)) ## high Silhouette values indicate clusters with *good* separation Silhouette(c5) plot(Silhouette(c5))data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) ## high shadow values indicate clusters with *bad* separation shadow(c5) plot(shadow(c5)) ## high Silhouette values indicate clusters with *good* separation Silhouette(c5) plot(Silhouette(c5))
Shadow star plots and corresponding panel functions.
shadowStars(object, which=1:2, project=NULL, width=1, varwidth=FALSE, panel=panelShadowStripes, box=NULL, col=NULL, add=FALSE, ...) panelShadowStripes(x, col, ...) panelShadowViolin(x, ...) panelShadowBP(x, ...) panelShadowSkeleton(x, ...)shadowStars(object, which=1:2, project=NULL, width=1, varwidth=FALSE, panel=panelShadowStripes, box=NULL, col=NULL, add=FALSE, ...) panelShadowStripes(x, col, ...) panelShadowViolin(x, ...) panelShadowBP(x, ...) panelShadowSkeleton(x, ...)
object |
An object of class |
which |
Index numbers of dimensions of (projected) input space to plot. |
project |
Projection object for which a |
width |
Width of vertices connecting the cluster centroids. |
varwidth |
Logical, shall all vertices have the same width or should the width be proportional to number of points shown on the vertex? |
panel |
Function used to draw vertices. |
box |
Color of rectangle drawn around each vertex. |
col |
A vector of colors for the clusters. |
add |
Logical, start a new plot? |
... |
Passed on to panel function. |
x |
Shadow values of data points corresponding to the vertex. |
The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.
The neighborhood graph of a cluster solution connects two centroids by a vertex if at least one data point has the two centroids as closest and second closest. The width of the vertex is proportional to the sum of shadow values of all points having these two as closest and second closest. A shadow star depicts the distribution of shadow values on the vertex, see Leisch (2009) for details.
Currently four panel functions are available:
panelShadowStripes:line segment for each shadow value.
panelShadowViolin:violin plot of shadow values.
panelShadowBP:box-percentile plot of shadow values.
panelShadowSkeleton:average shadow value.
Friedrich Leisch
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.
data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) shadowStars(c5) shadowStars(c5, varwidth=TRUE) shadowStars(c5, panel=panelShadowViolin) shadowStars(c5, panel=panelShadowBP) ## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few ## large shadow values can lead to misleading results: shadowStars(c5, panel=panelShadowSkeleton) shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) shadowStars(c5) shadowStars(c5, varwidth=TRUE) shadowStars(c5, panel=panelShadowViolin) shadowStars(c5, panel=panelShadowBP) ## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few ## large shadow values can lead to misleading results: shadowStars(c5, panel=panelShadowSkeleton) shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)
Create a segment level stability across solutions plot, possibly using an additional variable for coloring the nodes.
slsaplot(object, nodecol = NULL, ...)slsaplot(object, nodecol = NULL, ...)
object |
An object returned by |
nodecol |
A numeric vector of length equal to the number of
observations clustered in |
... |
Additional graphical parameters to modify the plot. |
For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).
List of length equal to the number of different cluster solutions minus one containing numeric vectors of the entropy values used by default to color the nodes.
Friedrich Leisch
Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
stepFlexclust, relabel, slswFlexclust
data("Nclus") cl25 <- stepFlexclust(Nclus, k=2:5) slsaplot(cl25) cl25 <- relabel(cl25) slsaplot(cl25)data("Nclus") cl25 <- stepFlexclust(Nclus, k=2:5) slsaplot(cl25) cl25 <- relabel(cl25) slsaplot(cl25)
Assess segment level stability within solution.
slswFlexclust(x, object, ...) ## S4 method for signature 'resampleFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'resampleFlexclust' boxplot(x, which=1, ylab=NULL, ...) ## S4 method for signature 'resampleFlexclust' densityplot(x, data, which=1, ...) ## S4 method for signature 'resampleFlexclust' summary(object)slswFlexclust(x, object, ...) ## S4 method for signature 'resampleFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'resampleFlexclust' boxplot(x, which=1, ylab=NULL, ...) ## S4 method for signature 'resampleFlexclust' densityplot(x, data, which=1, ...) ## S4 method for signature 'resampleFlexclust' summary(object)
x |
A numeric matrix of data, or an object that can be coerced to
such a matrix (such as a numeric vector or a data frame with all
numeric columns) passed to |
object |
Object of class |
y |
Missing. |
which |
Integer or character indicating which validation measure is used for plotting. |
ylab |
Axis label. |
data |
Not used. |
... |
Additional arguments; for details see below. |
Additional arguments in slswFlexclust are argument nsamp
which is by default equal to 100 and allows to change the number of
bootstrap pairs drawn. Argument seed allows to set a random
seed and argument multicore is by default TRUE and
indicates if bootstrap samples should be drawn in parallel. Argument
verbose is by default equal to FALSE and if TRUE
progress information is shown during computations.
There are plotting as well as printing and summary methods implemented
for objects of class "resampleFlexclust". In addition to a
standard plot method also methods for densityplot and
boxplot are provided.
For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).
An object of class "resampleFlexclust".
Friedrich Leisch
Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
data("Nclus") cl3 <- kcca(Nclus, k = 3) slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20) plot(Nclus, col = clusters(cl3)) plot(slsw.cl3) densityplot(slsw.cl3) boxplot(slsw.cl3)data("Nclus") cl3 <- kcca(Nclus, k = 3) slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20) plot(Nclus, col = clusters(cl3)) plot(slsw.cl3) densityplot(slsw.cl3) boxplot(slsw.cl3)
Runs clustering algorithms repeatedly for different numbers of clusters and returns the minimum within cluster distance solution for each.
stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE, group=NULL, simple=FALSE, save.data=FALSE, seed=NULL, multicore=TRUE, ...) stepcclust(...) ## S4 method for signature 'stepFlexclust,missing' plot(x, y, type=c("barplot", "lines"), totaldist=NULL, xlab=NULL, ylab=NULL, ...) ## S4 method for signature 'stepFlexclust' getModel(object, which=1)stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE, group=NULL, simple=FALSE, save.data=FALSE, seed=NULL, multicore=TRUE, ...) stepcclust(...) ## S4 method for signature 'stepFlexclust,missing' plot(x, y, type=c("barplot", "lines"), totaldist=NULL, xlab=NULL, ylab=NULL, ...) ## S4 method for signature 'stepFlexclust' getModel(object, which=1)
x, ...
|
|
k |
A vector of integers passed in turn to the |
nrep |
For each value of |
FUN |
Cluster function to use, typically |
verbose |
If |
drop |
If |
group |
An optional grouping vector for the data, see
|
simple |
Return an object of class |
save.data |
Save a copy of |
seed |
If not |
multicore |
If |
y |
Not used. |
type |
Create a barplot or lines plot. |
totaldist |
Include value for 1-cluster solution in plot? Default
is |
xlab, ylab
|
Graphical parameters. |
object |
Object of class |
which |
Number of model to get. If character, interpreted as number of clusters. |
stepcclust is a simple wrapper for
stepFlexclust(...,FUN=cclust).
Friedrich Leisch
data("Nclus") plot(Nclus) ## multicore off for CRAN checks cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE) cl1 plot(cl1) # two ways to do the same: getModel(cl1, 4) cl1[[4]] opar <- par("mfrow") par(mfrow=c(2, 2)) for(k in 3:6){ image(getModel(cl1, as.character(k)), data=Nclus) title(main=paste(k, "clusters")) } par(opar)data("Nclus") plot(Nclus) ## multicore off for CRAN checks cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE) cl1 plot(cl1) # two ways to do the same: getModel(cl1, 4) cl1[[4]] opar <- par("mfrow") par(mfrow=c(2, 2)) for(k in 3:6){ image(getModel(cl1, as.character(k)), data=Nclus) title(main=paste(k, "clusters")) } par(opar)
Plot distance of data points to cluster centroids using stripes.
stripes(object, groups=NULL, type=c("first", "second", "all"), beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL, gp.bar2=NULL, number=TRUE, legend=!is.null(groups), ylim=NULL, ylab="distance from centroid", margins=c(2,5,3,2), ...)stripes(object, groups=NULL, type=c("first", "second", "all"), beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL, gp.bar2=NULL, number=TRUE, legend=!is.null(groups), ylim=NULL, ylab="distance from centroid", margins=c(2,5,3,2), ...)
object |
An object of class |
groups |
Grouping variable to color-code the stripes. By default
cluster membership is used as |
type |
Plot distance to closest, closest and second-closest or to all centroids? |
beside |
Logical, make different stripes for different clusters? |
col |
Vector of colors for clusters or groups. |
gp.line, gp.bar, gp.bar2
|
Graphical parameters for horizontal
lines and background rectangular areas, see
|
number |
Logical, write cluster numbers on x-axis? |
legend |
Logical, plot a legend for the groups? |
ylim, ylab
|
Graphical parameters for y-axis. |
margins |
Margin of the plot. |
... |
Further graphical parameters. |
A simple, yet very effective plot for visualizing the distance of each
point from its closest and second-closest cluster centroids is a
stripes plot. For each of the k clusters we have a rectangular area,
which we optionally vertically
divide into k smaller rectangles (beside=TRUE). Then we draw a
horizontal line segment for each data point marking the distance of
the data point from the corresponding centroid.
Friedrich Leisch
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 20(4), 457–469, 2010.
bw05 <- bundestag(2005) bavaria <- bundestag(2005, state="Bayern") set.seed(1) c4 <- cclust(bw05, k=4, save.data=TRUE) plot(c4) stripes(c4) stripes(c4, beside=TRUE) stripes(c4, type="sec") stripes(c4, type="sec", beside=FALSE) stripes(c4, type="all") stripes(c4, groups=bavaria) ## ugly, but shows how colors of all parts can be changed library("grid") stripes(c4, type="all", gp.bar=gpar(col="red", lwd=3, fill="white"), gp.bar2=gpar(col="green", lwd=3, fill="black"))bw05 <- bundestag(2005) bavaria <- bundestag(2005, state="Bayern") set.seed(1) c4 <- cclust(bw05, k=4, save.data=TRUE) plot(c4) stripes(c4) stripes(c4, beside=TRUE) stripes(c4, type="sec") stripes(c4, type="sec", beside=FALSE) stripes(c4, type="all") stripes(c4, groups=bavaria) ## ugly, but shows how colors of all parts can be changed library("grid") stripes(c4, type="all", gp.bar=gpar(col="red", lwd=3, fill="white"), gp.bar2=gpar(col="green", lwd=3, fill="black"))
In 2006 a sample of 1000 respondents representative for the adult Australian population was asked about their environmental behaviour when on vacation. In addition the survey also included a list of statements about vacation motives like "I want to rest and relax," "I use my holiday for the health and beauty of my body," and "Cultural offers and sights are a crucial factor.". Answers are binary ("applies", "does not apply").
data(vacmot)data(vacmot)
Data frame vacmot has 1000 observations on 20 binary
variables on travel motives. Data frame vacmotdesc has 1000
observation on sociodemographic descriptor variables, mean moral
obligation to protect the environment score, mean NEP score, and
mean environmental behaviour score, see Dolnicar & Leisch
(2008) for details.
In addition integer vector vacmot6 contains the 6
cluster partition presented in Dolnicar & Leisch (2008).
The data set was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia).
Sara Dolnicar and Friedrich Leisch. An investigation of tourists' patterns of obligation to protect the environment. Journal of Travel Research, 46:381-391, 2008.
Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2):97-120, 2014.
data(vacmot) summary(vacmotdesc) dotchart(sort(colMeans(vacmot))) ## reproduce Figure 6 from Dolnicar & Leisch (2008) cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0)) barchart(cl6)data(vacmot) summary(vacmotdesc) dotchart(sort(colMeans(vacmot))) ## reproduce Figure 6 from Dolnicar & Leisch (2008) cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0)) barchart(cl6)
Part of an Australian survey on motivation of volunteers to work for non-profit organisations like Red Cross, State Emergency Service, Rural Fire Service, Surf Life Saving, Rotary, Parents and Citizens Associations, etc..
data(volunteers)data(volunteers)
A data frame with 1415 observations on the following 21 variables: age and gender of respondents plus 19 binary motivation items (1 applies/ 0 does not apply).
GENDERGender of respondent.
AGEGAge group, a factor with categorized age of respondents.
meet.peopleI can meet different types of people.
no.one.elseThere is no-one else to do the work.
exampleIt sets a good example for others.
socialiseI can socialise with people who are like me.
help.othersIt gives me the chance to help others.
give.backI can give something back to society.
careerIt will help my career prospects.
lonelyIt makes me feel less lonely.
activeIt keeps me active.
communityIt will improve my community.
causeI can support an important cause.
faithI can put faith into action.
servicesI want to maintain services that I may use one day.
childrenMy children are involved with the organisation.
good.jobI feel like I am doing a good job.
benefitedI know someone who has benefited from the organisation.
networkI can build a network of contacts.
recognitionI can gain recognition within the community.
mind.offIt takes my mind off other things.
The volunteering data was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia), using funding from Bushcare Wollongong and the Australian Research Council under the ARC Linkage Grant scheme (LP0453682).
Melanie Randle and Sara Dolnicar. Not Just Any Volunteers: Segmenting the Market to Attract the High-Contributors. Journal of Non-profit and Public Sector Marketing, 21(3), 271-282, 2009.
Melanie Randle and Sara Dolnicar. Self-congruity and volunteering: A multi-organisation comparison. European Journal of Marketing, 45(5), 739-758, 2011.
Melanie Randle, Friedrich Leisch, and Sara Dolnicar. Competition or collaboration? The effect of non-profit brand image on volunteer recruitment strategy. Journal of Brand Management, 20(8):689-704, 2013.