Title: | Flexible Cluster Algorithms |
---|---|
Description: | The main function kcca implements a general framework for k-centroids cluster analysis supporting arbitrary distance measures and centroid computation. Further cluster methods include hard competitive learning, neural gas, and QT clustering. There are numerous visualization methods for cluster results (neighborhood graphs, convex cluster hulls, barcharts of centroids, ...), and bootstrap methods for the analysis of cluster stability. |
Authors: | Friedrich Leisch [aut] (<https://orcid.org/0000-0001-7278-1983>, maintainer up to 2024), Evgenia Dimitriadou [ctb], Bettina Grün [ctb, cre] |
Maintainer: | Bettina Grün <[email protected]> |
License: | GPL-2 |
Version: | 1.4-2 |
Built: | 2024-10-25 06:35:59 UTC |
Source: | CRAN |
Measurements at the beginning of the 4th grade (when the national average is 4.0) and of the 6th grade in 25 schools in New Haven.
data(achieve)
data(achieve)
A data frame with 25 observations on the following 4 variables.
read4
4th grade reading.
arith4
4th grade arithmetic.
read6
6th grade reading.
arith6
6th grade arithmetic.
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
A German manufacturer of premium cars asked customers approximately 3 months after a car purchase which characteristics of the car were most important for the decision to buy the car. The survey was done in 1983 and the data set contains all responses without missing values.
data(auto)
data(auto)
A data frame with 793 observations on the following 46 variables.
model
A factor with levels A
, B
,
C
, or D
; model bought by the customer.
gear
A factor with levels 4 gears
, 5
econo
, 5 sport
, or automatic
.
leasing
A logical vector, was leasing used to finance the car?
usage
A factor with levels private
, both
, business
.
previous_model
A factor describing which type of car was owned directly before the purchase.
other_consider
A factor with levels same manuf
,
other manuf
, both
, or none
.
test_drive
A logical vector, did you do a test drive?
info_adv
A logical vector, was advertising an important source of information?
info_exp
A logical vector, was experience an important source of information?
info_rec
A logical vector, were recommendations an important source of information?
ch_clarity
A logical vector.
ch_economy
A logical vector.
ch_driving_properties
A logical vector.
ch_service
A logical vector.
ch_interior
A logical vector.
ch_quality
A logical vector.
ch_technology
A logical vector.
ch_model_continuity
A logical vector.
ch_comfort
A logical vector.
ch_reliability
A logical vector.
ch_handling
A logical vector.
ch_reputation
A logical vector.
ch_concept
A logical vector.
ch_character
A logical vector.
ch_power
A logical vector.
ch_resale_value
A logical vector.
ch_styling
A logical vector.
ch_safety
A logical vector.
ch_sporty
A logical vector.
ch_consumption
A logical vector.
ch_space
A logical vector.
satisfaction
A numeric vector describing overall satisfaction (1=very good, 10=very bad).
good1
Conception, styling, dimensions.
good2
Auto body.
good3
Driving and coupled axles.
good4
Engine.
good5
Electronics.
good6
Financing and customer service.
good7
Other.
sporty
What do you think about the balance of
sportiness and comfort? (good
, more sport
, more comfort
).
drive_char
Driving characteristis (gentle
< speedy
< powerfull
< extreme
).
tempo
Which average speed do you prefer on German
Autobahn in km/h? (< 130
< 130-150
< 150-180
< > 180
)
consumption
An ordered factor with levels low
< ok
< high
< too high
.
gender
A factor with levels male
and female
occupation
A factor with levels self-employed
,
freelance
, and employee
.
household
Size of household, an ordered factor with levels 1-2
< >=3
.
The original German data are in the public domain and available from LMU Munich (doi:10.5282/ubm/data.14). The variable names and help page were translated to English and converted into Rd format by Friedrich Leisch.
Open Data LMU (1983): Umfrage unter Kunden einer Automobilfirma, doi:10.5282/ubm/data.14
data(auto) summary(auto)
data(auto) summary(auto)
Barplot of cluster centers or other cluster statistics.
## S4 method for signature 'kcca' barplot(height, bycluster = TRUE, oneplot = TRUE, data = NULL, FUN = colMeans, main = deparse(substitute(height)), which = 1:height@k, names.arg = NULL, oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...) ## S4 method for signature 'kcca' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, clusters=1:x@k, ...) ## S4 method for signature 'hclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...) ## S4 method for signature 'bclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, k=x@k, clusters=1:k, ...)
## S4 method for signature 'kcca' barplot(height, bycluster = TRUE, oneplot = TRUE, data = NULL, FUN = colMeans, main = deparse(substitute(height)), which = 1:height@k, names.arg = NULL, oma = par("oma"), col = NULL, mcol = "darkred", srt = 45, ...) ## S4 method for signature 'kcca' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, clusters=1:x@k, ...) ## S4 method for signature 'hclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, shade=FALSE, diff=NULL, byvar=FALSE, k=2, ...) ## S4 method for signature 'bclust' barchart(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, mcol="darkred", mlcol=mcol, which=NULL, legend=FALSE, shade=FALSE, diff=NULL, byvar=FALSE, k=x@k, clusters=1:k, ...)
height , x
|
An object of class |
bycluster |
If |
oneplot |
If |
data |
If not |
FUN |
The function to be applied to each cluster for calculating
the bar heights. Only used, if |
which |
For |
names.arg |
A vector of names to be plotted below each bar. |
main , oma , xlab , ...
|
Graphical parameters. |
col |
Vector of colors for the clusters. |
mcol , mlcol
|
If not |
srt |
Number between 0 and 90, rotation of the x-axis labels. |
strip.labels |
Vector of strings for the strips of the Trellis display. |
strip.prefix |
Prefix string for the strips of the Trellis display. |
legend |
If |
shade |
If |
diff |
A numerical vector of length two with absolute and
relative deviations for shading, default is |
byvar |
If |
clusters |
Integer vector of clusters to plot. |
k |
Integer specifying the desired number of clusters. |
The flexclust barchart method uses a horizontal arrangements of the bars, and sorts them from top to bottom. Default barcharts in lattice are the other way round (bottom to top). See the examples below how this affects, e.g., manual labels for the y axis.
The barplot
method is legacy code and only maintained to keep up
with changes in R, all active development is done on barchart
.
Friedrich Leisch
Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2), 97-120, 2014.
cl <- cclust(iris[,-5], k=3) barplot(cl) barplot(cl, bycluster=FALSE) ## plot the maximum instead of mean value per cluster: barplot(cl, bycluster=FALSE, data=iris[,-5], FUN=function(x) apply(x,2,max)) ## use lattice for plotting: barchart(cl) ## automatic abbreviation of labels barchart(cl, scales=list(abbreviate=TRUE)) ## origin of bars at zero barchart(cl, scales=list(abbreviate=TRUE), origin=0) ## Use manual labels. Note that the flexclust barchart orders bars ## from top to bottom (the default does it the other way round), hence ## we have to rev() the labels: LAB <- c("SL", "SW", "PL", "PW") barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0) ## deviation of each cluster center from the population means barchart(cl, origin=rev(cl@xcent), mlcol=NULL) ## use shading to highlight large deviations from population mean barchart(cl, shade=TRUE) ## use smaller deviation limit than default and add a legend barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)
cl <- cclust(iris[,-5], k=3) barplot(cl) barplot(cl, bycluster=FALSE) ## plot the maximum instead of mean value per cluster: barplot(cl, bycluster=FALSE, data=iris[,-5], FUN=function(x) apply(x,2,max)) ## use lattice for plotting: barchart(cl) ## automatic abbreviation of labels barchart(cl, scales=list(abbreviate=TRUE)) ## origin of bars at zero barchart(cl, scales=list(abbreviate=TRUE), origin=0) ## Use manual labels. Note that the flexclust barchart orders bars ## from top to bottom (the default does it the other way round), hence ## we have to rev() the labels: LAB <- c("SL", "SW", "PL", "PW") barchart(cl, scales=list(y=list(labels=rev(LAB))), origin=0) ## deviation of each cluster center from the population means barchart(cl, origin=rev(cl@xcent), mlcol=NULL) ## use shading to highlight large deviations from population mean barchart(cl, shade=TRUE) ## use smaller deviation limit than default and add a legend barchart(cl, shade=TRUE, diff=0.2, legend=TRUE)
Cluster the data in x
using the bagged clustering
algorithm. A partitioning cluster algorithm such as
cclust
is run repeatedly on bootstrap samples from the
original data. The resulting cluster centers are then combined using
the hierarchical cluster algorithm hclust
.
bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0, dist.method = "euclidian", hclust.method = "average", FUN = "cclust", verbose = TRUE, final.cclust = FALSE, resample = TRUE, weights = NULL, maxcluster = base.k, ...) ## S4 method for signature 'bclust,missing' plot(x, y, maxcluster = x@maxcluster, main = "", ...) ## S4 method for signature 'bclust,missing' clusters(object, newdata, k, ...) ## S4 method for signature 'bclust' parameters(object, k)
bclust(x, k = 2, base.iter = 10, base.k = 20, minsize = 0, dist.method = "euclidian", hclust.method = "average", FUN = "cclust", verbose = TRUE, final.cclust = FALSE, resample = TRUE, weights = NULL, maxcluster = base.k, ...) ## S4 method for signature 'bclust,missing' plot(x, y, maxcluster = x@maxcluster, main = "", ...) ## S4 method for signature 'bclust,missing' clusters(object, newdata, k, ...) ## S4 method for signature 'bclust' parameters(object, k)
x |
Matrix of inputs (or object of class |
k |
Number of clusters. |
base.iter |
Number of runs of the base cluster algorithm. |
base.k |
Number of centers used in each repetition of the base method. |
minsize |
Minimum number of points in a base cluster. |
dist.method |
Distance method used for the hierarchical
clustering, see |
hclust.method |
Linkage method used for the hierarchical
clustering, see |
FUN |
Partitioning cluster method used as base algorithm. |
verbose |
Output status messages. |
final.cclust |
If |
resample |
Logical, if |
weights |
Vector of length |
maxcluster |
Maximum number of clusters memberships are to be computed for. |
object |
Object of class |
main |
Main title of the plot. |
... |
Optional arguments top be passed to the base method
in |
y |
Missing. |
newdata |
An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used. |
First, base.iter
bootstrap samples of the original data in
x
are created by drawing with replacement. The base cluster
method is run on each of these samples with base.k
centers. The base.method
must be the name of a partitioning
cluster function returning an object with the same slots as the
return value of cclust
.
This results in a collection of iter.base * base.centers
centers, which are subsequently clustered using the hierarchical
method hclust
. Base centers with less than
minsize
points in there respective partitions are removed
before the hierarchical clustering. The resulting dendrogram is
then cut to produce k
clusters.
bclust
returns objects of class
"bclust"
including the slots
hclust |
Return value of the hierarchical clustering of the
collection of base centers (Object of class |
cluster |
Vector with indices of the clusters the inputs are assigned to. |
centers |
Matrix of centers of the final clusters. Only useful, if the hierarchical clustering method produces convex clusters. |
allcenters |
Matrix of all |
Friedrich Leisch
Friedrich Leisch. Bagged clustering. Working Paper 51, SFB “Adaptive Information Systems and Modeling in Economics and Management Science”, August 1999. https://epub.wu.ac.at/1272/1/document.pdf
Sara Dolnicar and Friedrich Leisch. Winter tourist segments in Austria: Identifying stable vacation styles using bagged clustering techniques. Journal of Travel Research, 41(3):281-292, 2003.
data(iris) bc1 <- bclust(iris[,1:4], 3, base.k=5) plot(bc1) table(clusters(bc1, k=3)) parameters(bc1, k=3)
data(iris) bc1 <- bclust(iris[,1:4], 3, base.k=5) plot(bc1) table(clusters(bc1, k=3)) parameters(bc1, k=3)
Birth and death rates for 70 countries.
data(birth)
data(birth)
A data frame with 70 observations on the following 2 variables.
birth
Birth rate (in percent).
death
Death rate (in percent).
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Runs clustering algorithms repeatedly for different numbers of clusters on bootstrap replica of the original data and returns corresponding cluster assignments, centroids and (adjusted) Rand indices comparing pairs of partitions.
bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL, multicore=TRUE, verbose=FALSE, ...) ## S4 method for signature 'bootFlexclust' summary(object) ## S4 method for signature 'bootFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'bootFlexclust' boxplot(x, ...) ## S4 method for signature 'bootFlexclust' densityplot(x, data, ...)
bootFlexclust(x, k, nboot=100, correct=TRUE, seed=NULL, multicore=TRUE, verbose=FALSE, ...) ## S4 method for signature 'bootFlexclust' summary(object) ## S4 method for signature 'bootFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'bootFlexclust' boxplot(x, ...) ## S4 method for signature 'bootFlexclust' densityplot(x, data, ...)
x , k , ...
|
Passed to |
nboot |
Number of bootstrap pairs of partitions. |
correct |
Logical, correct the Rand index for agreement by chance also called adjusted Rand index)? |
seed |
If not |
multicore |
If |
verbose |
If |
y , data
|
Not used. |
object |
An object of class |
Availability of multicore is checked
when flexclust is loaded. This information is stored and can be
obtained using
getOption("flexclust")$have_multicore
. Set to FALSE
for debugging and more sensible error messages in case something
goes wrong.
Friedrich Leisch
## Not run: ## data uniform on unit square x <- matrix(runif(400), ncol=2) cl <- FALSE ## to run bootstrap replications on a workstation cluster do the following: library("parallel") cl <- makeCluster(2, type = "PSOCK") clusterCall(cl, function() require("flexclust")) ## 50 bootstrap replicates for speed in example, ## use more for real applications bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl) bcl summary(bcl) ## splitting the square into four quadrants should be the most stable ## solution (increase nboot if not) plot(bcl) densityplot(bcl, from=0) ## End(Not run)
## Not run: ## data uniform on unit square x <- matrix(runif(400), ncol=2) cl <- FALSE ## to run bootstrap replications on a workstation cluster do the following: library("parallel") cl <- makeCluster(2, type = "PSOCK") clusterCall(cl, function() require("flexclust")) ## 50 bootstrap replicates for speed in example, ## use more for real applications bcl <- bootFlexclust(x, k=2:7, nboot=50, FUN=cclust, multicore=cl) bcl summary(bcl) ## splitting the square into four quadrants should be the most stable ## solution (increase nboot if not) plot(bcl) densityplot(bcl, from=0) ## End(Not run)
Results of the elections 2002, 2005 or 2009 for the German Bundestag, the first chamber of the German parliament.
data(btw2002) data(btw2005) data(btw2009) bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)
data(btw2002) data(btw2005) data(btw2009) bundestag(year, second=TRUE, percent=TRUE, nazero=TRUE, state=FALSE)
year |
Numeric or character, year of the election. |
second |
Logical, return second or first votes? |
percent |
Logical, return percentages or absolute numbers? |
nazero |
Logical, convert |
state |
Logical or character. If |
btw200x
are data frames with 299 rows
(corresponding to constituencies) and 17 columns. All columns except
state
are numeric.
state
Factor, the 16 German federal states.
eligible
Number of citizens eligible to vote.
votes
Number of eligible citizens who did vote.
invalid1, invalid2
Number of invalid first and second votes (see details below).
valid1, valid2
Number of valid first and second votes.
SPD1, SPD2
Number of first and second votes for the Social Democrats.
UNION1, UNION2
Number of first and second votes for CDU/CSU, the conservative Christian Democrats.
GRUENE1, GRUENE2
Number of first and second votes for the Green Party.
FDP1, FDP2
Number of first and second votes for the Liberal Party.
LINKE1, LINKE2
Number of first and second votes for the Left Party (PDS in 2002).
Missing values indicate that a party did not candidate in the corresponding constituency.
btw200x
are the original data sets.
bundestag()
is a helper function which extracts first
or second votes, calculates percentages (number of votes for a party divided by
number of valid votes), replaces missing values by zero, and converts
the result from a data frame to a matrix. By default
it returns the percentage of second votes for each party, which
determines the number of seats each party gets in parliament.
Half of the Members of the German Bundestag are elected directly from Germany's 299 constituencies, the other half on the parties' state lists. Accordingly, each voter has two votes in the elections to the German Bundestag. The first vote, allowing voters to elect their local representatives to the Bundestag, decides which candidates are sent to Parliament from the constituencies.
The second vote is cast for a party list. And it is this second vote that determines the relative strengths of the parties represented in the Bundestag. At least 598 Members of the German Bundestag are elected in this way. In addition to this, there are certain circumstances in which some candidates win what are known as “overhang mandates” when the seats are being distributed.
Homepage of the Bundestag: https://www.bundestag.de
p02 <- bundestag(2002) pairs(p02) p05 <- bundestag(2005) pairs(p05) p09 <- bundestag(2009) pairs(p09) state <- bundestag(2002, state=TRUE) table(state) start.with.b <- bundestag(2002, state="^B") table(start.with.b) pairs(p09, col=2-(state=="Bayern"))
p02 <- bundestag(2002) pairs(p02) p05 <- bundestag(2005) pairs(p05) p09 <- bundestag(2009) pairs(p09) state <- bundestag(2002, state=TRUE) table(state) start.with.b <- bundestag(2002, state="^B") table(start.with.b) pairs(p09, col=2-(state=="Bayern"))
Seperate boxplot of variables in each cluster in comparison with boxplot for complete sample.
## S4 method for signature 'kcca' bwplot(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...) ## S4 method for signature 'bclust' bwplot(x, k=x@k, xlab="", strip.labels=NULL, strip.prefix="Cluster ", clusters=1:k, ...)
## S4 method for signature 'kcca' bwplot(x, data, xlab="", strip.labels=NULL, strip.prefix="Cluster ", col=NULL, shade=!is.null(shadefun), shadefun=NULL, byvar=FALSE, ...) ## S4 method for signature 'bclust' bwplot(x, k=x@k, xlab="", strip.labels=NULL, strip.prefix="Cluster ", clusters=1:k, ...)
x |
An object of class |
data |
If not |
xlab , ...
|
Graphical parameters. |
col |
Vector of colors for the clusters. |
strip.labels |
Vector of strings for the strips of the Trellis display. |
strip.prefix |
Prefix string for the strips of the Trellis display. |
shade |
If |
shadefun |
A function or name of a function to compute which
boxes are shaded, e.g. |
byvar |
If |
k |
Number of clusters. |
clusters |
Integer vector of clusters to plot. |
set.seed(1) cl <- cclust(iris[,-5], k=3, save.data=TRUE) bwplot(cl) bwplot(cl, byvar=TRUE) ## fill only boxes with color which do not contain the overall median ## (grey dot of background box) bwplot(cl, shade=TRUE) ## fill only boxes with color which do not overlap with the box of the ## complete sample (grey background box) bwplot(cl, shadefun="boxOverlap")
set.seed(1) cl <- cclust(iris[,-5], k=3, save.data=TRUE) bwplot(cl) bwplot(cl, byvar=TRUE) ## fill only boxes with color which do not contain the overall median ## (grey dot of background box) bwplot(cl, shade=TRUE) ## fill only boxes with color which do not overlap with the box of the ## complete sample (grey background box) bwplot(cl, shadefun="boxOverlap")
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
cclust(x, k, dist = "euclidean", method = "kmeans", weights=NULL, control=NULL, group=NULL, simple=FALSE, save.data=FALSE)
cclust(x, k, dist = "euclidean", method = "kmeans", weights=NULL, control=NULL, group=NULL, simple=FALSE, save.data=FALSE)
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
k |
Either the number of clusters, or a vector of cluster
assignments, or a matrix of initial
(distinct) cluster centroids. If a number, a random set of (distinct)
rows in |
dist |
Distance measure, one of |
method |
Clustering algorithm: one of |
weights |
An optional vector of weights for the observations
(rows of the |
control |
An object of class |
group |
Currently ignored. |
simple |
Return an object of class |
save.data |
Save a copy of |
This function uses the same computational engine as the earlier
function of the same name from package ‘cclust’. The main difference
is that it returns an S4 object of class "kcca"
, hence all
available methods for "kcca"
objects can be used. By default
kcca
and cclust
use exactly the same algorithm,
but cclust
will usually be much faster because it uses compiled
code.
If dist
is "euclidean"
, the distance between the cluster
center and the data points is the Euclidian distance (ordinary kmeans
algorithm), and cluster means are used as centroids.
If "manhattan"
, the distance between the cluster
center and the data points is the sum of the absolute values of the
distances, and the column-wise cluster medians are used as centroids.
If method
is "kmeans"
, the classic kmeans algorithm as
given by MacQueen (1967) is
used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If
"hardcl"
,
on-line updates are used (AKA hard competitive learning), which work by
randomly drawing an observation from x
and moving the closest
center towards that point (e.g., Ripley 1996). If
"neuralgas"
then the neural gas algorithm by Martinetz et al
(1993) is used. It is similar to hard competitive learning, but in
addition to the closest centroid also the second closest centroid is
moved in each iteration.
An object of class "kcca"
.
Evgenia Dimitriadou and Friedrich Leisch
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
Martinetz T., Berkovich S., and Schulten K (1993). ‘Neural-Gas’ Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge.
## a 2-dimensional example x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2), matrix(rnorm(100, mean=1, sd=0.3), ncol=2)) cl <- cclust(x,2) plot(x, col=predict(cl)) points(cl@centers, pch="x", cex=2, col=3) ## a 3-dimensional example x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3), matrix(rnorm(150, mean=2, sd=0.3), ncol=3), matrix(rnorm(150, mean=4, sd=0.3), ncol=3)) cl <- cclust(x, 6, method="neuralgas", save.data=TRUE) pairs(x, col=predict(cl)) plot(cl)
## a 2-dimensional example x <- rbind(matrix(rnorm(100, sd=0.3), ncol=2), matrix(rnorm(100, mean=1, sd=0.3), ncol=2)) cl <- cclust(x,2) plot(x, col=predict(cl)) points(cl@centers, pch="x", cex=2, col=3) ## a 3-dimensional example x <- rbind(matrix(rnorm(150, sd=0.3), ncol=3), matrix(rnorm(150, mean=2, sd=0.3), ncol=3), matrix(rnorm(150, mean=4, sd=0.3), ncol=3)) cl <- cclust(x, 6, method="neuralgas", save.data=TRUE) pairs(x, col=predict(cl)) plot(cl)
Returns a matrix of cluster similarities. Currently two methods for computing similarities of clusters are implemented, see details below.
## S4 method for signature 'kcca' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...) ## S4 method for signature 'kccasimple' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...)
## S4 method for signature 'kcca' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...) ## S4 method for signature 'kccasimple' clusterSim(object, data=NULL, method=c("shadow", "centers"), symmetric=FALSE, ...)
object |
Fitted object. |
data |
Data to use for computation of the shadow values. If
the cluster object |
method |
Type of similarities, see details below. |
symmetric |
Compute symmetric or asymmetric shadow values?
Ignored if |
... |
Currently not used. |
If method="shadow"
(the default), then the similarity of two
clusters is proportional to the number of points in a cluster, where
the centroid of the other cluster is second-closest. See Leisch (2006,
2008) for detailed formulas.
If method="centers"
, then first the pairwise distances between
all centroids are computed and rescaled to [0,1]. The similarity
between tow clusters is then simply 1 minus the rescaled distance.
Friedrich Leisch
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.
Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.
example(Nclus) clusterSim(cl) clusterSim(cl, symmetric=TRUE) ## should have similar structure but will be numerically different: clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),]) ## different concept of cluster similarity clusterSim(cl, method="centers")
example(Nclus) clusterSim(cl) clusterSim(cl, symmetric=TRUE) ## should have similar structure but will be numerically different: clusterSim(cl, symmetric=TRUE, data=Nclus[sample(1:550, 200),]) ## different concept of cluster similarity clusterSim(cl, method="centers")
These functions can be used to convert the results from cluster
functions like
kmeans
or pam
to objects
of class "kcca"
and vice versa.
as.kcca(object, ...) ## S3 method for class 'hclust' as.kcca(object, data, k, family=NULL, save.data=FALSE, ...) ## S3 method for class 'kmeans' as.kcca(object, data, save.data=FALSE, ...) ## S3 method for class 'partition' as.kcca(object, data=NULL, save.data=FALSE, ...) ## S3 method for class 'skmeans' as.kcca(object, data, save.data=FALSE, ...) ## S4 method for signature 'kccasimple,kmeans' coerce(from, to="kmeans", strict=TRUE) Cutree(tree, k=NULL, h=NULL)
as.kcca(object, ...) ## S3 method for class 'hclust' as.kcca(object, data, k, family=NULL, save.data=FALSE, ...) ## S3 method for class 'kmeans' as.kcca(object, data, save.data=FALSE, ...) ## S3 method for class 'partition' as.kcca(object, data=NULL, save.data=FALSE, ...) ## S3 method for class 'skmeans' as.kcca(object, data, save.data=FALSE, ...) ## S4 method for signature 'kccasimple,kmeans' coerce(from, to="kmeans", strict=TRUE) Cutree(tree, k=NULL, h=NULL)
object |
Fitted object. |
data |
Data which were used to obtain the clustering. For
|
save.data |
Save a copy of the data in the return object? |
k |
Number of clusters. |
family |
Object of class |
... |
Currently not used. |
from , to , strict
|
Usual arguments for |
tree |
A tree as produced by |
h |
Numeric scalar or vector with heights where the tree should be cut. |
The standard cutree
function orders clusters such that
observation one is in cluster one, the first observation (as ordered
in the data set) not in cluster one is in cluster two,
etc. Cutree
orders clusters as shown in the dendrogram from
left to right such that similar clusters have similar numbers. The
latter is used when converting to kcca
.
For hierarchical clustering the cluster memberships of the converted
object can be different from the result of Cutree
,
because one KCCA-iteration has to be performed in order to obtain a
valid kcca
object. In this case a warning is issued.
Friedrich Leisch
data(Nclus) cl1 <- kmeans(Nclus, 4) cl1 cl1a <- as.kcca(cl1, Nclus) cl1a cl1b <- as(cl1a, "kmeans") library("cluster") cl2 <- pam(Nclus, 4) cl2 cl2a <- as.kcca(cl2) cl2a ## the same cl2b <- as.kcca(cl2, Nclus) cl2b ## hierarchical clustering hc <- hclust(dist(USArrests)) plot(hc) rect.hclust(hc, k=3) c3 <- Cutree(hc, k=3) k3 <- as.kcca(hc, USArrests, k=3) barchart(k3) table(c3, clusters(k3))
data(Nclus) cl1 <- kmeans(Nclus, 4) cl1 cl1a <- as.kcca(cl1, Nclus) cl1a cl1b <- as(cl1a, "kmeans") library("cluster") cl2 <- pam(Nclus, 4) cl2 cl2a <- as.kcca(cl2) cl2a ## the same cl2b <- as.kcca(cl2, Nclus) cl2b ## hierarchical clustering hc <- hclust(dist(USArrests)) plot(hc) rect.hclust(hc, k=3) c3 <- Cutree(hc, k=3) k3 <- as.kcca(hc, USArrests, k=3) barchart(k3) table(c3, clusters(k3))
Mammal's teeth divided into the 4 groups: incisors, canines, premolars and molars.
data(dentitio)
data(dentitio)
A data frame with 66 observations on the following 8 variables.
top.inc
Top incisors.
bot.inc
Bottom incisors.
top.can
Top canines.
bot.can
Bottom canines.
top.pre
Top premolars.
bot.pre
Bottom premolars.
top.mol
Top molars.
bot.mol
Bottom molars.
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
This function computes and returns the distance matrix computed by using the specified distance measure to compute the pairwise distances between the rows of two data matrices.
dist2(x, y, method = "euclidean", p=2)
dist2(x, y, method = "euclidean", p=2)
x |
A data matrix. |
y |
A vector or second data matrix. |
method |
the distance measure to be used. This must be one of
|
p |
The power of the Minkowski distance. |
This is a two-data-set equivalent of the standard function
dist
. It returns a matrix of all pairwise
distances between rows in x
and y
. The current
implementation is efficient only if y
has not too many
rows (the code is vectorized in x
but not in y
).
The definition of Canberra distance was wrong for negative data prior to version 1.3-5.
Friedrich Leisch
x <- matrix(rnorm(20), ncol=4) rownames(x) = paste("X", 1:nrow(x), sep=".") y <- matrix(rnorm(12), ncol=4) rownames(y) = paste("Y", 1:nrow(y), sep=".") dist2(x, y) dist2(x, y, "man") data(milk) dist2(milk[1:5,], milk[4:6,])
x <- matrix(rnorm(20), ncol=4) rownames(x) = paste("X", 1:nrow(x), sep=".") y <- matrix(rnorm(12), ncol=4) rownames(y) = paste("Y", 1:nrow(y), sep=".") dist2(x, y) dist2(x, y, "man") data(milk) dist2(milk[1:5,], milk[4:6,])
Helper functions to create kccaFamily
objects.
distAngle(x, centers) distCanberra(x, centers) distCor(x, centers) distEuclidean(x, centers) distJaccard(x, centers) distManhattan(x, centers) distMax(x, centers) distMinkowski(x, centers, p=2) centAngle(x) centMean(x) centMedian(x) centOptim(x, dist) centOptim01(x, dist)
distAngle(x, centers) distCanberra(x, centers) distCor(x, centers) distEuclidean(x, centers) distJaccard(x, centers) distManhattan(x, centers) distMax(x, centers) distMinkowski(x, centers, p=2) centAngle(x) centMean(x) centMedian(x) centOptim(x, dist) centOptim01(x, dist)
x |
A data matrix. |
centers |
A matrix of centroids. |
p |
The power of the Minkowski distance. |
dist |
A distance function. |
Friedrich Leisch
Hyperparameters for cluster algorithms.
Objects can be created by calls of the form
new("flexclustControl", ...)
. In addition, named lists can be
coerced to flexclustControl
objects, names are completed if unique (see examples).
Objects of class "flexclustControl"
have the following slots:
iter.max
:Maximum number of iterations.
tolerance
:The algorithm is stopped when the
(relative) change of the optimization criterion is smaller than
tolerance
.
verbose
:If a positive integer, then progress is
reported every verbose
iterations. If 0,
no output is generated during model fitting.
classify
:Character string, one of "auto"
,
"weighted"
, "hard"
or "simann"
.
initcent
:Character string, name of function for
initial centroids, currently "randomcent"
(the default) and
"kmeanspp"
are available.
gamma
:Gamma value for weighted hard competitive learning.
simann
:Parameters for simulated annealing
optimization (only used when classify="simann"
).
ntry
:Number of trials per iteration for QT clustering.
min.size
:Clusters smaller than this value are treated as outliers.
Objects of class "cclustControl"
inherit from
"flexclustControl"
and have the following additional slots:
method
:Learning rate for hard competitive learning,
one of "polynomial"
or "exponential"
.
pol.rate
:Positive number for polynomial learning rate
of form .
exp.rate
Vector of length 2 with parameters for
exponential learning rate of form
.
ng.rate
:Vector of length 4 with parameters for neural gas, see details below.
The neural gas algorithm uses updates of form
for every centroid, where is the order (minus 1) of the
centroid with
respect to distance to data point
(0=closest, 1=second,
...). The parameters
and
are given by
See Martinetz et al (1993) for details of the algorithm, and the examples section on how to obtain default values.
Friedrich Leisch
Martinetz T., Berkovich S., and Schulten K. (1993). "Neural-Gas Network for Vector Quantization and its Application to Time-Series Prediction." IEEE Transactions on Neural Networks, 4 (4), pp. 558–569.
Arthur D. and Vassilvitskii S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. pp. 1027-1035.
## have a look at the defaults new("flexclustControl") ## corce a list mycont <- list(iter=500, tol=0.001, class="w") as(mycont, "flexclustControl") ## some additional slots as(mycont, "cclustControl") ## default values for ng.rate new("cclustControl")@ng.rate
## have a look at the defaults new("flexclustControl") ## corce a list mycont <- list(iter=500, tol=0.001, class="w") as(mycont, "flexclustControl") ## some additional slots as(mycont, "cclustControl") ## default values for ng.rate new("cclustControl")@ng.rate
Create and access palettes for the plot methods.
flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE) flxPalette(n, ...)
flxColors(n=1:8, color=c("full","medium", "light","dark"), grey=FALSE) flxPalette(n, ...)
n |
Index number of color to return (1 to 8) for |
color |
Type of color, see details. |
grey |
Return grey value corresponding to palette. |
... |
Passed on to |
This function creates color palettes in HCL space for up to 8 colors. All palettes have constant chroma and luminance, only the hue of the colors change within a palette.
Palettes "full"
and "dark"
have the same luminance, and
palettes "medium"
and "light"
have the same luminance.
Friedrich Leisch
opar <- par(c("mfrow", "mar", "xaxt")) par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n") x <- rep(1, 8) barplot(x, col = flxColors(color="full"), main="full") barplot(x, col = flxColors(color="dark"), main="dark") barplot(x, col = flxColors(color="medium"), main="medium") barplot(x, col = flxColors(color="light"), main="light") par(opar)
opar <- par(c("mfrow", "mar", "xaxt")) par(mfrow=c(2, 2), mar=c(0, 0, 2, 0), yaxt="n") x <- rep(1, 8) barplot(x, col = flxColors(color="full"), main="full") barplot(x, col = flxColors(color="dark"), main="dark") barplot(x, col = flxColors(color="medium"), main="medium") barplot(x, col = flxColors(color="light"), main="light") par(opar)
Plot a histogram of the similarity of each observation to each cluster.
## S4 method for signature 'kccasimple,missing' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,data.frame' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,matrix' histogram(x, data, xlab="Similarity", power=1, ...)
## S4 method for signature 'kccasimple,missing' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,data.frame' histogram(x, data, xlab="", ...) ## S4 method for signature 'kccasimple,matrix' histogram(x, data, xlab="Similarity", power=1, ...)
x |
An object of class |
data |
If not missing, the distance and thus similarity between observations and cluster centers is determined for the new data and used for the plots. By default the values from the training data are used. |
xlab |
Label for the x-axis. |
power |
Numeric indicating how similarities are transformed, for more details see Dolnicar et al. (2018). |
... |
Additional arguments passed to
|
Friedrich Leisch
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
Image plot of cluster segments overlaid by neighbourhood graph.
## S4 method for signature 'kcca' image(x, which = 1:2, npoints = 100, xlab = "", ylab = "", fastcol = TRUE, col=NULL, clwd=0, graph=TRUE, ...)
## S4 method for signature 'kcca' image(x, which = 1:2, npoints = 100, xlab = "", ylab = "", fastcol = TRUE, col=NULL, clwd=0, graph=TRUE, ...)
x |
An object of class |
which |
Index number of dimensions of input space to plot. |
npoints |
Number of grid points for image. |
fastcol |
If |
col |
Vector of background colors for the segments. |
clwd |
Line width of contour lines at cluster boundaries, use
larger values for |
graph |
Logical, add a neighborhood graph to the plot? |
xlab , ylab , ...
|
Graphical parameters. |
This works only for "kcca"
objects, no method is available for
"kccasimple" objects.
Friedrich Leisch
Returns descriptive information about fitted flexclust objects like cluster sizes or sum of within-cluster distances.
## S4 method for signature 'flexclust,character' info(object, which, drop=TRUE, ...)
## S4 method for signature 'flexclust,character' info(object, which, drop=TRUE, ...)
object |
Fitted object. |
which |
Which information to get. Use |
drop |
Logical. If |
... |
Passed to methods. |
Function info
can be used to access slots of fitted flexclust
objects in a portable way, and in addition computes some
meta-information like sum of within-cluster distances.
Function infoCheck
returns a logical value that is TRUE
if the requested information can be computed from the object
.
Friedrich Leisch
data("Nclus") plot(Nclus) cl1 <- cclust(Nclus, k=4) summary(cl1) ## these two are the same info(cl1) info(cl1, "help") ## cluster sizes i1 <- info(cl1, "size") i1 ## average within cluster distances i2 <- info(cl1, "av_dist") i2 ## the sum of all within-cluster distances i3 <- info(cl1, "distsum") i3 ## sum(i1*i2) must of course be the same as i3 stopifnot(all.equal(sum(i1*i2), i3)) ## This should return TRUE infoCheck(cl1, "size") ## and this FALSE infoCheck(cl1, "Homer Simpson") ## both combined i4 <- infoCheck(cl1, c("size", "Homer Simpson")) i4 stopifnot(all.equal(i4, c(TRUE, FALSE)))
data("Nclus") plot(Nclus) cl1 <- cclust(Nclus, k=4) summary(cl1) ## these two are the same info(cl1) info(cl1, "help") ## cluster sizes i1 <- info(cl1, "size") i1 ## average within cluster distances i2 <- info(cl1, "av_dist") i2 ## the sum of all within-cluster distances i3 <- info(cl1, "distsum") i3 ## sum(i1*i2) must of course be the same as i3 stopifnot(all.equal(sum(i1*i2), i3)) ## This should return TRUE infoCheck(cl1, "size") ## and this FALSE infoCheck(cl1, "Homer Simpson") ## both combined i4 <- infoCheck(cl1, c("size", "Homer Simpson")) i4 stopifnot(all.equal(i4, c(TRUE, FALSE)))
Perform k-centroids clustering on a data matrix.
kcca(x, k, family=kccaFamily("kmeans"), weights=NULL, group=NULL, control=NULL, simple=FALSE, save.data=FALSE) kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which, preproc = NULL, trim=0, groupFun = "minSumClusters") ## S4 method for signature 'kccasimple' summary(object)
kcca(x, k, family=kccaFamily("kmeans"), weights=NULL, group=NULL, control=NULL, simple=FALSE, save.data=FALSE) kccaFamily(which=NULL, dist=NULL, cent=NULL, name=which, preproc = NULL, trim=0, groupFun = "minSumClusters") ## S4 method for signature 'kccasimple' summary(object)
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
k |
Either the number of clusters, or a vector of cluster
assignments, or a matrix of initial
(distinct) cluster centroids. If a number, a random set of (distinct)
rows in |
family |
Object of class |
weights |
An optional vector of weights to be used in the clustering process, cannot be combined with all families. |
group |
An optional grouping vector for the data, see details below. |
control |
An object of class |
simple |
Return an object of class |
save.data |
Save a copy of |
which |
One of |
name |
Optional long name for family, used only for show methods. |
dist |
A function for distance computation, ignored
if |
cent |
A function for centroid computation, ignored
if |
preproc |
Function for data preprocessing. |
trim |
A number in between 0 and 0.5, if non-zero then trimmed
means are used for the |
groupFun |
Function or name of function to obtain clusters for grouped data, see details below. |
object |
Object of class |
See the paper A Toolbox for K-Centroids Cluster Analysis referenced below for details.
Function kcca
returns objects of class "kcca"
or
"kccasimple"
depending on the value of argument
simple
. The simpler objects contain fewer slots and hence are
faster to compute, but contain no auxiliary information used by the
plotting methods. Most plot methods for "kccasimple"
objects do
nothing and return a warning. If only centroids, cluster membership or
prediction for new data are of interest, then the simple objects are
sufficient.
Function kccaFamily()
currently has the following predefined
families (distance / centroid):
Euclidean distance / mean
Manhattan distance / median
angle between observation and centroid / standardized mean
Jaccard distance / numeric optimization
Jaccard distance / mean
See Leisch (2006) for details on all combinations.
If group
is not NULL
, then observations from the same
group are restricted to belong to the same cluster (must-link
constraint) or different clusters (cannot-link constraint) during the
fitting process. If groupFun = "minSumClusters"
, then all group
members are
assign to the cluster where the center has minimal average distance to
the group members. If groupFun = "majorityClusters"
, then all
group members are assigned to the cluster the majority would belong to
without a constraint.
groupFun = "differentClusters"
implements a cannot-link
constraint, i.e., members of one group are not allowed to belong to
the same cluster. The optimal allocation for each group is found by
solving a linear sum assignment problem using
solve_LSAP
. Obviously the group sizes must be smaller
than the number of clusters in this case.
Ties are broken at random in all cases.
Note that at the moment not all methods for fitted
"kcca"
objects respect the grouping information, most
importantly the plot method when a data argument is specified.
Friedrich Leisch
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526–544, 2006.
Friedrich Leisch and Bettina Gruen. Extending standard cluster algorithms to allow for group constraints. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006-Proceedings in Computational Statistics, pages 885-892. Physica Verlag, Heidelberg, Germany, 2006.
stepFlexclust
, cclust
,
distances
data("Nclus") plot(Nclus) ## try kmeans cl1 <- kcca(Nclus, k=4) cl1 image(cl1) points(Nclus) ## A barplot of the centroids barplot(cl1) ## now use k-medians and kmeans++ initialization, cluster centroids ## should be similar... cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"), control=list(initcent="kmeanspp")) cl2 ## ... but the boundaries of the partitions have a different shape image(cl2) points(Nclus)
data("Nclus") plot(Nclus) ## try kmeans cl1 <- kcca(Nclus, k=4) cl1 image(cl1) points(Nclus) ## A barplot of the centroids barplot(cl1) ## now use k-medians and kmeans++ initialization, cluster centroids ## should be similar... cl2 <- kcca(Nclus, k=4, family=kccaFamily("kmedians"), control=list(initcent="kmeanspp")) cl2 ## ... but the boundaries of the partitions have a different shape image(cl2) points(Nclus)
Convert object of class "kcca"
to a data frame in long format.
kcca2df(object, data)
kcca2df(object, data)
object |
Object of class |
data |
Optional data if not saved in |
A data.frame
with columns value
, variable
and
group
.
c.iris <- cclust(iris[,-5], 3, save.data=TRUE) df.c.iris <- kcca2df(c.iris) summary(df.c.iris) densityplot(~value|variable+group, data=df.c.iris)
c.iris <- cclust(iris[,-5], 3, save.data=TRUE) df.c.iris <- kcca2df(c.iris) summary(df.c.iris) densityplot(~value|variable+group, data=df.c.iris)
The data set contains the ingredients of mammal's milk of 25 animals.
data(milk)
data(milk)
A data frame with 25 observations on the following 5 variables (all in percent).
water
Water.
protein
Protein.
fat
Fat.
lactose
Lactose.
ash
Ash.
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
A simple artificial regression example with 4 clusters, all of them having a Gaussian distribution.
data(Nclus)
data(Nclus)
The Nclus
data set can be re-created by loading package
flexmix and running ExNclus(100)
using set.seed(2602)
. It has been saved as a data set for
simplicity of examples only.
data(Nclus) cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE) plot(cl)
data(Nclus) cl <- cclust(Nclus, k=4, simple=FALSE, save.data=TRUE) plot(cl)
The data set contains the measurements of nutrients in several types of meat, fish and fowl.
data(nutrient)
data(nutrient)
A data frame with 27 observations on the following 5 variables.
energy
Food energy (calories).
protein
Protein (grams).
fat
Fat (grams).
calcium
calcium (milli grams).
iron
Iron (milli grams).
John A. Hartigan: Clustering Algorithms. Wiley, New York, 1975.
Plot a matrix of neighbourhood graphs.
## S4 method for signature 'kcca' pairs(x, which=NULL, project=NULL, oma=NULL, ...)
## S4 method for signature 'kcca' pairs(x, which=NULL, project=NULL, oma=NULL, ...)
x |
An object of class |
which |
Index numbers of dimensions of (projected) input space to plot, default is to plot all dimensions. |
project |
Projection object for which a |
oma |
Outer margin. |
... |
Passed to the |
This works only for "kcca"
objects, no method is available for
"kccasimple" objects.
Friedrich Leisch
Returns the matrix of centroids of a fitted object of class "kcca"
.
## S4 method for signature 'kccasimple' parameters(object, ...)
## S4 method for signature 'kccasimple' parameters(object, ...)
object |
Fitted object. |
... |
Currently not used. |
Friedrich Leisch
Plot the neighbourhood graph of a cluster solution together with projected data points.
## S4 method for signature 'kcca,missing' plot(x, y, which=1:2, project=NULL, data=NULL, points=TRUE, hull=TRUE, hull.args=NULL, number = TRUE, simlines=TRUE, lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16, add=FALSE, xlab="", ylab="", xlim = NULL, ylim = NULL, pch=NULL, col=NULL, ...)
## S4 method for signature 'kcca,missing' plot(x, y, which=1:2, project=NULL, data=NULL, points=TRUE, hull=TRUE, hull.args=NULL, number = TRUE, simlines=TRUE, lwd=1, maxlwd=8*lwd, cex=1.5, numcol=FALSE, nodes=16, add=FALSE, xlab="", ylab="", xlim = NULL, ylim = NULL, pch=NULL, col=NULL, ...)
x |
An object of class |
y |
Not used |
which |
Index numbers of dimensions of (projected) input space to plot. |
project |
Projection object for which a |
data |
Data to include in plot. If the cluster object |
points |
Logical, shall data points be plotted (if available)? |
hull |
If |
hull.args |
A list of arguments for the hull function. |
number |
Logical, plot number labels in nodes of graph? |
numcol , cex
|
Color and size of number labels in nodes of
graph. If |
nodes |
Plotting symbol to use for nodes if no numbers are drawn. |
simlines |
Logical, plot edges of graph? |
lwd , maxlwd
|
Numerical, thickness of lines. |
add |
Logical, add to existing plot? |
xlab , ylab
|
Axis labels. |
xlim , ylim
|
Axis range. |
pch , col , ...
|
Plotting symbols and colors for data points. |
This works only for "kcca"
objects, no method is available for
"kccasimple" objects.
Friedrich Leisch
Friedrich Leisch. Visualizing cluster analysis and finite mixture models. In Chun houh Chen, Wolfgang Haerdle, and Antony Unwin, editors, Handbook of Data Visualization, Springer Handbooks of Computational Statistics. Springer Verlag, 2008.
Return either the cluster membership of training data or predict for new data.
## S4 method for signature 'kccasimple' predict(object, newdata, ...) ## S4 method for signature 'flexclust,ANY' clusters(object, newdata, ...)
## S4 method for signature 'kccasimple' predict(object, newdata, ...) ## S4 method for signature 'flexclust,ANY' clusters(object, newdata, ...)
object |
Object of class inheriting from |
newdata |
An optional data matrix with the same number of columns as the cluster centers. If omitted, the fitted values are used. |
... |
Currently not used. |
clusters
can be used on any object of class "flexclust"
and returns the cluster memberships of the training data.
predict
can be used only on objects of class "kcca"
(which inherit from "flexclust"
). If no newdata
argument
is specified, the function is identical to clusters
, if
newdata
is specified, then cluster memberships for the new data
are predicted. clusters(object, newdata, ...)
is an alias for
predict(object, newdata, ...)
.
Friedrich Leisch
Simple artificial 2-dimensional data to demonstrate clustering for market segmentation. One dimension is the hypothetical feature sophistication (or performance or quality, etc) of a product, the second dimension the price customers are willing to pay for the product.
priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust", "ellipse", "triangle", "circle", "square", "largesmall"))
priceFeature(n, which=c("2clust", "3clust", "3clustold", "5clust", "ellipse", "triangle", "circle", "square", "largesmall"))
n |
Sample size. |
which |
Shape of data set. |
Sara Dolnicar and Friedrich Leisch. Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Marketing Letters, 21:83-101, 2010.
plot(priceFeature(200, "2clust")) plot(priceFeature(200, "3clust")) plot(priceFeature(200, "3clustold")) plot(priceFeature(200, "5clust")) plot(priceFeature(200, "ell")) plot(priceFeature(200, "tri")) plot(priceFeature(200, "circ")) plot(priceFeature(200, "square")) plot(priceFeature(200, "largesmall"))
plot(priceFeature(200, "2clust")) plot(priceFeature(200, "3clust")) plot(priceFeature(200, "3clustold")) plot(priceFeature(200, "5clust")) plot(priceFeature(200, "ell")) plot(priceFeature(200, "tri")) plot(priceFeature(200, "circ")) plot(priceFeature(200, "square")) plot(priceFeature(200, "largesmall"))
Adds arrows for original coordinate axes to a projection plot.
projAxes(object, which=1:2, center=NULL, col="red", radius=NULL, minradius=0.1, textargs=list(col=col), col.names=getColnames(object), which.names="", group = NULL, groupFun = colMeans, plot=TRUE, ...) placeLabels(object) ## S4 method for signature 'projAxes' placeLabels(object)
projAxes(object, which=1:2, center=NULL, col="red", radius=NULL, minradius=0.1, textargs=list(col=col), col.names=getColnames(object), which.names="", group = NULL, groupFun = colMeans, plot=TRUE, ...) placeLabels(object) ## S4 method for signature 'projAxes' placeLabels(object)
object |
Return value of a projection method like
|
which |
Index number of dimensions of (projected) input space that have been plotted. |
center |
Center of the coordinate system to use in projected space. Default is the center of the plotting region. |
col |
Color of arrows. |
radius |
Relative size of the arrows. |
minradius |
Minimum radius of arrows to include (relative to arrow size). |
textargs |
List of arguments for |
col.names |
Variable names of the original data. |
which.names |
A regular expression which variable names to include in the plot. |
group |
An optional grouping variable for the original
coordinates. Coordinates with group |
groupFun |
Function used to aggregate the projected coordinates
if |
plot |
Logical,if |
... |
Passed to |
projAxes
invisibly returns an object of class
"projAxes"
, which can be
added to an existing plot by its plot
method.
Friedrich Leisch
data(milk) milk.pca <- prcomp(milk, scale=TRUE) ## create a biplot step by step plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca) ## the same, but arrows are blue, centered at origin and all arrows are ## plotted plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca, col="blue", center=0, minradius=0) ## use points instead of text, plot PC2 and PC3, manual radius ## specification, store result plot(predict(milk.pca)[,c(2,3)]) arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE) plot(arr) ## Not run: ## manually try to find new places for the labels: each arrow is marked ## active in turn, use the left mouse button to find a better location ## for the label. Use the right mouse button to go on to the next ## variable. arr1 <- placeLabels(arr) ## now do the plot again: plot(predict(milk.pca)[,c(2,3)]) plot(arr1) ## End(Not run)
data(milk) milk.pca <- prcomp(milk, scale=TRUE) ## create a biplot step by step plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca) ## the same, but arrows are blue, centered at origin and all arrows are ## plotted plot(predict(milk.pca), type="n") text(predict(milk.pca), rownames(milk), col="green", cex=0.8) projAxes(milk.pca, col="blue", center=0, minradius=0) ## use points instead of text, plot PC2 and PC3, manual radius ## specification, store result plot(predict(milk.pca)[,c(2,3)]) arr <- projAxes(milk.pca, which=c(2,3), radius=1.2, plot=FALSE) plot(arr) ## Not run: ## manually try to find new places for the labels: each arrow is marked ## active in turn, use the left mouse button to find a better location ## for the label. Use the right mouse button to go on to the next ## variable. arr1 <- placeLabels(arr) ## now do the plot again: plot(predict(milk.pca)[,c(2,3)]) plot(arr1) ## End(Not run)
Split a binary or numeric matrix by a grouping variable, run a series of tests on all variables, adjust for multiple testing and graphically represent results.
propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test", sort=FALSE, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...) ## S4 method for signature 'propBarchart' summary(object, ...) groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL, shade=!is.null(shadefun), shadefun=NULL, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...)
propBarchart(x, g, alpha=0.05, correct="holm", test="prop.test", sort=FALSE, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...) ## S4 method for signature 'propBarchart' summary(object, ...) groupBWplot(x, g, alpha=0.05, correct="holm", xlab="", col=NULL, shade=!is.null(shadefun), shadefun=NULL, strip.prefix="", strip.labels=NULL, which=NULL, byvar=FALSE, ...)
x |
A binary data matrix. |
g |
A factor specifying the groups. |
alpha |
Significance level for test of differences in proportions. |
correct |
Correction method for multiple testing, passed to
|
test |
Test to use for detecting significant differences in proportions. |
sort |
Logical, sort variables by total sample mean? |
strip.prefix |
Character string prepended to strips of the
|
strip.labels |
Character vector of labels to use for strips of
|
which |
Index numbers or names of variables to plot. |
byvar |
If |
... |
|
object |
Return value of |
xlab |
A title for the x-axis: see |
col |
Vector of colors for the panels. |
shade |
If |
shadefun |
A function or name of a function to compute which
boxes are shaded, e.g. |
Function propBarchart
splits a binary data matrix into
subgroups, computes the percentage of ones in each column and compares
the proportions in the groups using prop.test
. The
p-values for all variables are adjusted for multiple testing and a
barchart of group percentages is drawn highlighting variables with
significant differences in proportion. The summary
method can
be used to create a corresponding table for publications.
Function groupBWplot
takes a general numeric matrix, also
splits into subgroups and uses boxes instead of bars. By default
kruskal.test
is used to compute significant differences
in location, in addition the heuristics from
bwplot,kcca-method
can be used. Boxes of the complete sample
are used as reference in the background.
Friedrich Leisch
barplot-methods
,
bwplot,kcca-method
## create a binary matrix from the iris data plus a random noise column x <- apply(iris[,-5], 2, function(z) z>median(z)) x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE)) ## There are significant differences in all 4 original variables, Noise ## has most likely no significant difference (of course the difference ## will be significant in alpha percent of all random samples). p <- propBarchart(x, iris$Species) p summary(p) propBarchart(x, iris$Species, byvar=TRUE) x <- iris[,-5] x <- cbind(x, Noise=rnorm(150, mean=3)) groupBWplot(x, iris$Species) groupBWplot(x, iris$Species, shade=TRUE) groupBWplot(x, iris$Species, shadefun="medianInside") groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)
## create a binary matrix from the iris data plus a random noise column x <- apply(iris[,-5], 2, function(z) z>median(z)) x <- cbind(x, Noise=sample(0:1, 150, replace=TRUE)) ## There are significant differences in all 4 original variables, Noise ## has most likely no significant difference (of course the difference ## will be significant in alpha percent of all random samples). p <- propBarchart(x, iris$Species) p summary(p) propBarchart(x, iris$Species, byvar=TRUE) x <- iris[,-5] x <- cbind(x, Noise=rnorm(150, mean=3)) groupBWplot(x, iris$Species) groupBWplot(x, iris$Species, shade=TRUE) groupBWplot(x, iris$Species, shadefun="medianInside") groupBWplot(x, iris$Species, shade=TRUE, byvar=TRUE)
Perform stochastic QT clustering on a data matrix.
qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL, save.data=FALSE, kcca=FALSE)
qtclust(x, radius, family = kccaFamily("kmeans"), control = NULL, save.data=FALSE, kcca=FALSE)
x |
A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). |
radius |
Maximum radius of clusters. |
family |
Object of class |
control |
An object of class |
.
save.data |
Save a copy of |
kcca |
Run |
This function implements a variation of the QT clustering algorithm by
Heyer et al. (1999), see Scharl and Leisch (2006). The main difference
is that in each iteration not
all possible cluster start points are considered, but only a random
sample of size control@ntry
. We also consider only points as initial
centers where at least one other point is within a circle with radius
radius
. In most cases the resulting
solutions are almost
the same at a considerable speed increase, in some cases even better
solutions are obtained than with the original algorithm. If
control@ntry
is set to the size of the data set, an algorithm
similar to the original algorithm as proposed by Heyer et al. (1999)
is obtained.
Function qtclust
by default returns objects of class
"kccasimple"
. If argument kcca
is TRUE
, function
kcca()
is run afterwards (initialized on the QT cluster
solution). Data points
not clustered by the QT cluster algorithm are omitted from the
kcca()
iterations, but filled back into the return
object. All plot methods defined for objects of class "kcca"
can be used.
Friedrich Leisch
Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research 9, 1106–1115.
Theresa Scharl and Friedrich Leisch. The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data. In Alfredo Rizzi and Maurizio Vichi, editors, Compstat 2006 – Proceedings in Computational Statistics, pages 1015-1022. Physica Verlag, Heidelberg, Germany, 2006.
x <- matrix(10*runif(1000), ncol=2) ## maximum distrance of point to cluster center is 3 cl1 <- qtclust(x, radius=3) ## maximum distrance of point to cluster center is 1 ## -> more clusters, longer runtime cl2 <- qtclust(x, radius=1) opar <- par(c("mfrow","mar")) par(mfrow=c(2,1), mar=c(2.1,2.1,1,1)) plot(x, col=predict(cl1), xlab="", ylab="") plot(x, col=predict(cl2), xlab="", ylab="") par(opar)
x <- matrix(10*runif(1000), ncol=2) ## maximum distrance of point to cluster center is 3 cl1 <- qtclust(x, radius=3) ## maximum distrance of point to cluster center is 1 ## -> more clusters, longer runtime cl2 <- qtclust(x, radius=1) opar <- par(c("mfrow","mar")) par(mfrow=c(2,1), mar=c(2.1,2.1,1,1)) plot(x, col=predict(cl1), xlab="", ylab="") plot(x, col=predict(cl2), xlab="", ylab="") par(opar)
Compute the (adjusted) Rand, Jaccard and Fowlkes-Mallows index for agreement of two partitions.
comPart(x, y, type=c("ARI","RI","J","FM")) ## S4 method for signature 'flexclust,flexclust' comPart(x, y, type) ## S4 method for signature 'numeric,numeric' comPart(x, y, type) ## S4 method for signature 'flexclust,numeric' comPart(x, y, type) ## S4 method for signature 'numeric,flexclust' comPart(x, y, type) randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'table,missing' randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'ANY,ANY' randIndex(x, y, correct=TRUE, original=!correct)
comPart(x, y, type=c("ARI","RI","J","FM")) ## S4 method for signature 'flexclust,flexclust' comPart(x, y, type) ## S4 method for signature 'numeric,numeric' comPart(x, y, type) ## S4 method for signature 'flexclust,numeric' comPart(x, y, type) ## S4 method for signature 'numeric,flexclust' comPart(x, y, type) randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'table,missing' randIndex(x, y, correct=TRUE, original=!correct) ## S4 method for signature 'ANY,ANY' randIndex(x, y, correct=TRUE, original=!correct)
x |
Either a 2-dimensional cross-tabulation of cluster
assignments (for |
y |
An object inheriting from class
|
type |
character vector of abbreviations of indices to compute. |
correct , original
|
Logical, correct the Rand index for agreement by chance? |
A vector of indices.
Let denote the number of all pairs of data
points which are either put into the same cluster by both partitions or
put into different clusters by both partitions. Conversely, let
denote the number of all pairs of data points that are put into one
cluster in one partition, but into different clusters by the other
partition. The partitions disagree for all pairs
and
agree for all pairs
. We can measure the agreement by the Rand
index
which is invariant with respect to permutations of
cluster labels.
The index has to be corrected for agreement by chance if the sizes of the clusters are not uniform (which is usually the case), or if there are many clusters, see Hubert & Arabie (1985) for details.
If the number of clusters is very large, then usually the vast
majority of pairs of points will not be in the same cluster. The
Jaccard index tries to account for this by using only pairs of points
that are in the same cluster in the defintion of .
Let again be the pairs of points that
are in the same cluster in both partitions. Fowlkes-Mallows divides
this number by the geometric mean of the sums of the number of pairs in each
cluster of the two partitions. This gives the probability that a pair
of points which are in the same cluster in one partition are also in the
same cluster in the other partition.
Friedrich Leisch
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.
Marina Meila. Comparing clusterings - an axiomatic view. In Stefan Wrobel and Luc De Raedt, editors, Proceedings of the International Machine Learning Conference (ICML). ACM Press, 2005.
## no class correlations: corrected Rand almost zero g1 <- sample(1:5, size=1000, replace=TRUE) g2 <- sample(1:5, size=1000, replace=TRUE) tab <- table(g1, g2) randIndex(tab) ## uncorrected version will be large, because there are many points ## which are assigned to different clusters in both cases randIndex(tab, correct=FALSE) comPart(g1, g2) ## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better k <- sample(1:1000, size=200) g1[k] <- 1 g2[k] <- 1 k <- sample(1:1000, size=200) g1[k] <- 3 g2[k] <- 3 tab <- table(g1, g2) ## the index should be larger than before randIndex(tab, correct=TRUE, original=TRUE) comPart(g1, g2)
## no class correlations: corrected Rand almost zero g1 <- sample(1:5, size=1000, replace=TRUE) g2 <- sample(1:5, size=1000, replace=TRUE) tab <- table(g1, g2) randIndex(tab) ## uncorrected version will be large, because there are many points ## which are assigned to different clusters in both cases randIndex(tab, correct=FALSE) comPart(g1, g2) ## let pairs (g1=1,g2=1) and (g1=3,g2=3) agree better k <- sample(1:1000, size=200) g1[k] <- 1 g2[k] <- 1 k <- sample(1:1000, size=200) g1[k] <- 3 g2[k] <- 3 tab <- table(g1, g2) ## the index should be larger than before randIndex(tab, correct=TRUE, original=TRUE) comPart(g1, g2)
Create a series of projection plots corresponding to a random tour through the data.
randomTour(object, ...) ## S4 method for signature 'ANY' randomTour(object, ...) ## S4 method for signature 'matrix' randomTour(object, ...) ## S4 method for signature 'flexclust' randomTour(object, data=NULL, col=NULL, ...) randomTourMatrix(x, directions=10, steps=100, sec=4, sleep = sec/steps, axiscol=2, axislab=colnames(x), center=NULL, radius=1, minradius=0.01, asp=1, ...)
randomTour(object, ...) ## S4 method for signature 'ANY' randomTour(object, ...) ## S4 method for signature 'matrix' randomTour(object, ...) ## S4 method for signature 'flexclust' randomTour(object, data=NULL, col=NULL, ...) randomTourMatrix(x, directions=10, steps=100, sec=4, sleep = sec/steps, axiscol=2, axislab=colnames(x), center=NULL, radius=1, minradius=0.01, asp=1, ...)
object , x
|
A matrix or an object of class |
data |
Data to include in plot. |
col |
Plotting colors for data points. |
directions |
Integer value, how many different directions are toured. |
steps |
Integer, number of steps in each direction. |
sec |
Numerical, lower bound for the number of seconds each direction takes. |
sleep |
Numerical, sleep for as many seconds after each picture has been plotted. |
axiscol |
If not |
axislab |
Optional labels for the projected axes. |
center |
Center of the coordinate system to use in projected space. Default is the center of the plotting region. |
radius |
Relative size of the arrows. |
minradius |
Minimum radius of arrows to include. |
asp , ...
|
Passed on to |
Two random locations are chosen, and data then projected onto
hyperplanes which are orthogonal to step
vectors interpolating
the two locations. The first two coordinates of the projected data are
plotted. If directions
is larger than one, then after the first
steps
plots one more random location is chosen, and the
procedure is repeated from the current position to the
new location, etc..
The whole procedure is similar to a grand tour, but no attempt is made
to optimize subsequent directions, randomTour
simply chooses a random
direction in each iteration. Use rggobi
for the real thing.
Obviously the function needs a reasonably fast computer and graphics
device to give a smooth impression, for x11
it may be
necessary to use type="Xlib"
rather than cairo.
Friedrich Leisch
if(interactive()){ par(ask=FALSE) randomTour(iris[,1:4], axiscol=2:5) randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4) x <- matrix(runif(300), ncol=3) x <- rbind(x, x+1, x+2) cl <- cclust(x, k=3, save.data=TRUE) randomTour(cl, center=0, axiscol="black") ## now use predicted cluster membership for new data as colors randomTour(cl, center=0, axiscol="black", data=matrix(rnorm(3000, mean=1, sd=2), ncol=3)) }
if(interactive()){ par(ask=FALSE) randomTour(iris[,1:4], axiscol=2:5) randomTour(iris[,1:4], col=as.numeric(iris$Species), axiscol=4) x <- matrix(runif(300), ncol=3) x <- rbind(x, x+1, x+2) cl <- cclust(x, k=3, save.data=TRUE) randomTour(cl, center=0, axiscol="black") ## now use predicted cluster membership for new data as colors randomTour(cl, center=0, axiscol="black", data=matrix(rnorm(3000, mean=1, sd=2), ncol=3)) }
The clusters are relabelled to obtain a unique labeling.
relabel(object, by, ...) ## S4 method for signature 'kccasimple,character' relabel(object, by, which = NULL, ...) ## S4 method for signature 'kccasimple,integer' relabel(object, by, ...) ## S4 method for signature 'kccasimple,missing' relabel(object, by, ...) ## S4 method for signature 'stepFlexclust,integer' relabel(object, by = "series", ...) ## S4 method for signature 'stepFlexclust,missing' relabel(object, by, ...)
relabel(object, by, ...) ## S4 method for signature 'kccasimple,character' relabel(object, by, which = NULL, ...) ## S4 method for signature 'kccasimple,integer' relabel(object, by, ...) ## S4 method for signature 'kccasimple,missing' relabel(object, by, ...) ## S4 method for signature 'stepFlexclust,integer' relabel(object, by = "series", ...) ## S4 method for signature 'stepFlexclust,missing' relabel(object, by, ...)
object |
An object of class |
by |
If a character vector, it needs to be one of |
which |
Either an integer vector indiating the ordering or a vector of length one indicating the variable used for ordering. |
... |
Currently not used. |
If by
is a character vector with value "mean"
or
"median"
, the clusters are ordered by the mean or median values
over all variables for each cluster. If by = "manual"
which
needs to be a vector indicating the ordering. If
by = "variable"
which
needs to be indicate the variable
which is used to determine the ordering. If by
is
"centers"
, "shadow"
or "symmshadow"
, cluster
similarities are calculated using clusterSim
and used to
determine an ordering using seriate
from package
seriation.
If by = "series"
the relabeling is performed over a series of
clustering to minimize the misclassification.
Friedrich Leisch
Compute and plot shadows and silhouettes.
## S4 method for signature 'kccasimple' shadow(object, ...) ## S4 method for signature 'kcca' Silhouette(object, data=NULL, ...)
## S4 method for signature 'kccasimple' shadow(object, ...) ## S4 method for signature 'kcca' Silhouette(object, data=NULL, ...)
object |
An object of class |
data |
Data to compute silhouette values for. If the cluster
|
... |
Currently not used. |
The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.
The silhouette value of a data point is defined as the scaled difference between the average dissimilarity of a point to all points in its own cluster to the smallest average dissimilarity to the points of a different cluster. Large silhouette values indicate good separation.
The main difference between silhouette values and shadow values is that we replace average dissimilarities to points in a cluster by dissimilarities to point averages (=centroids). See Leisch (2009) for details.
Friedrich Leisch
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.
data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) ## high shadow values indicate clusters with *bad* separation shadow(c5) plot(shadow(c5)) ## high Silhouette values indicate clusters with *good* separation Silhouette(c5) plot(Silhouette(c5))
data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) ## high shadow values indicate clusters with *bad* separation shadow(c5) plot(shadow(c5)) ## high Silhouette values indicate clusters with *good* separation Silhouette(c5) plot(Silhouette(c5))
Shadow star plots and corresponding panel functions.
shadowStars(object, which=1:2, project=NULL, width=1, varwidth=FALSE, panel=panelShadowStripes, box=NULL, col=NULL, add=FALSE, ...) panelShadowStripes(x, col, ...) panelShadowViolin(x, ...) panelShadowBP(x, ...) panelShadowSkeleton(x, ...)
shadowStars(object, which=1:2, project=NULL, width=1, varwidth=FALSE, panel=panelShadowStripes, box=NULL, col=NULL, add=FALSE, ...) panelShadowStripes(x, col, ...) panelShadowViolin(x, ...) panelShadowBP(x, ...) panelShadowSkeleton(x, ...)
object |
An object of class |
which |
Index numbers of dimensions of (projected) input space to plot. |
project |
Projection object for which a |
width |
Width of vertices connecting the cluster centroids. |
varwidth |
Logical, shall all vertices have the same width or should the width be proportional to number of points shown on the vertex? |
panel |
Function used to draw vertices. |
box |
Color of rectangle drawn around each vertex. |
col |
A vector of colors for the clusters. |
add |
Logical, start a new plot? |
... |
Passed on to panel function. |
x |
Shadow values of data points corresponding to the vertex. |
The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to closest and second-closest centroid. If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.
The neighborhood graph of a cluster solution connects two centroids by a vertex if at least one data point has the two centroids as closest and second closest. The width of the vertex is proportional to the sum of shadow values of all points having these two as closest and second closest. A shadow star depicts the distribution of shadow values on the vertex, see Leisch (2009) for details.
Currently four panel functions are available:
panelShadowStripes
:line segment for each shadow value.
panelShadowViolin
:violin plot of shadow values.
panelShadowBP
:box-percentile plot of shadow values.
panelShadowSkeleton
:average shadow value.
Friedrich Leisch
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 2009. Accepted for publication on 2009-06-16.
data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) shadowStars(c5) shadowStars(c5, varwidth=TRUE) shadowStars(c5, panel=panelShadowViolin) shadowStars(c5, panel=panelShadowBP) ## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few ## large shadow values can lead to misleading results: shadowStars(c5, panel=panelShadowSkeleton) shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)
data(Nclus) set.seed(1) c5 <- cclust(Nclus, 5, save.data=TRUE) c5 plot(c5) shadowStars(c5) shadowStars(c5, varwidth=TRUE) shadowStars(c5, panel=panelShadowViolin) shadowStars(c5, panel=panelShadowBP) ## always use varwidth=TRUE with panelShadowSkeleton, otherwise a few ## large shadow values can lead to misleading results: shadowStars(c5, panel=panelShadowSkeleton) shadowStars(c5, panel=panelShadowSkeleton, varwidth=TRUE)
Create a segment level stability across solutions plot, possibly using an additional variable for coloring the nodes.
slsaplot(object, nodecol = NULL, ...)
slsaplot(object, nodecol = NULL, ...)
object |
An object returned by |
nodecol |
A numeric vector of length equal to the number of
observations clustered in |
... |
Additional graphical parameters to modify the plot. |
For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).
List of length equal to the number of different cluster solutions minus one containing numeric vectors of the entropy values used by default to color the nodes.
Friedrich Leisch
Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
stepFlexclust
, relabel
, slswFlexclust
data("Nclus") cl25 <- stepFlexclust(Nclus, k=2:5) slsaplot(cl25) cl25 <- relabel(cl25) slsaplot(cl25)
data("Nclus") cl25 <- stepFlexclust(Nclus, k=2:5) slsaplot(cl25) cl25 <- relabel(cl25) slsaplot(cl25)
Assess segment level stability within solution.
slswFlexclust(x, object, ...) ## S4 method for signature 'resampleFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'resampleFlexclust' boxplot(x, which=1, ylab=NULL, ...) ## S4 method for signature 'resampleFlexclust' densityplot(x, data, which=1, ...) ## S4 method for signature 'resampleFlexclust' summary(object)
slswFlexclust(x, object, ...) ## S4 method for signature 'resampleFlexclust,missing' plot(x, y, ...) ## S4 method for signature 'resampleFlexclust' boxplot(x, which=1, ylab=NULL, ...) ## S4 method for signature 'resampleFlexclust' densityplot(x, data, which=1, ...) ## S4 method for signature 'resampleFlexclust' summary(object)
x |
A numeric matrix of data, or an object that can be coerced to
such a matrix (such as a numeric vector or a data frame with all
numeric columns) passed to |
object |
Object of class |
y |
Missing. |
which |
Integer or character indicating which validation measure is used for plotting. |
ylab |
Axis label. |
data |
Not used. |
... |
Additional arguments; for details see below. |
Additional arguments in slswFlexclust
are argument nsamp
which is by default equal to 100 and allows to change the number of
bootstrap pairs drawn. Argument seed
allows to set a random
seed and argument multicore
is by default TRUE
and
indicates if bootstrap samples should be drawn in parallel. Argument
verbose
is by default equal to FALSE
and if TRUE
progress information is shown during computations.
There are plotting as well as printing and summary methods implemented
for objects of class "resampleFlexclust"
. In addition to a
standard plot
method also methods for densityplot
and
boxplot
are provided.
For more details see Dolnicar and Leisch (2017) and Dolnicar et al. (2018).
An object of class "resampleFlexclust"
.
Friedrich Leisch
Dolnicar S. and Leisch F. (2017) "Using Segment Level Stability to Select Target Segments in Data-Driven Market Segmentation Studies" Marketing Letters, 28 (3), pp. 423–436.
Dolnicar S., Gruen B., and Leisch F. (2018) Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful. Springer Singapore.
data("Nclus") cl3 <- kcca(Nclus, k = 3) slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20) plot(Nclus, col = clusters(cl3)) plot(slsw.cl3) densityplot(slsw.cl3) boxplot(slsw.cl3)
data("Nclus") cl3 <- kcca(Nclus, k = 3) slsw.cl3 <- slswFlexclust(Nclus, cl3, nsamp = 20) plot(Nclus, col = clusters(cl3)) plot(slsw.cl3) densityplot(slsw.cl3) boxplot(slsw.cl3)
Runs clustering algorithms repeatedly for different numbers of clusters and returns the minimum within cluster distance solution for each.
stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE, group=NULL, simple=FALSE, save.data=FALSE, seed=NULL, multicore=TRUE, ...) stepcclust(...) ## S4 method for signature 'stepFlexclust,missing' plot(x, y, type=c("barplot", "lines"), totaldist=NULL, xlab=NULL, ylab=NULL, ...) ## S4 method for signature 'stepFlexclust' getModel(object, which=1)
stepFlexclust(x, k, nrep=3, verbose=TRUE, FUN = kcca, drop=TRUE, group=NULL, simple=FALSE, save.data=FALSE, seed=NULL, multicore=TRUE, ...) stepcclust(...) ## S4 method for signature 'stepFlexclust,missing' plot(x, y, type=c("barplot", "lines"), totaldist=NULL, xlab=NULL, ylab=NULL, ...) ## S4 method for signature 'stepFlexclust' getModel(object, which=1)
x , ...
|
|
k |
A vector of integers passed in turn to the |
nrep |
For each value of |
FUN |
Cluster function to use, typically |
verbose |
If |
drop |
If |
group |
An optional grouping vector for the data, see
|
simple |
Return an object of class |
save.data |
Save a copy of |
seed |
If not |
multicore |
If |
y |
Not used. |
type |
Create a barplot or lines plot. |
totaldist |
Include value for 1-cluster solution in plot? Default
is |
xlab , ylab
|
Graphical parameters. |
object |
Object of class |
which |
Number of model to get. If character, interpreted as number of clusters. |
stepcclust
is a simple wrapper for
stepFlexclust(...,FUN=cclust)
.
Friedrich Leisch
data("Nclus") plot(Nclus) ## multicore off for CRAN checks cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE) cl1 plot(cl1) # two ways to do the same: getModel(cl1, 4) cl1[[4]] opar <- par("mfrow") par(mfrow=c(2, 2)) for(k in 3:6){ image(getModel(cl1, as.character(k)), data=Nclus) title(main=paste(k, "clusters")) } par(opar)
data("Nclus") plot(Nclus) ## multicore off for CRAN checks cl1 <- stepFlexclust(Nclus, k=2:7, FUN=cclust, multicore=FALSE) cl1 plot(cl1) # two ways to do the same: getModel(cl1, 4) cl1[[4]] opar <- par("mfrow") par(mfrow=c(2, 2)) for(k in 3:6){ image(getModel(cl1, as.character(k)), data=Nclus) title(main=paste(k, "clusters")) } par(opar)
Plot distance of data points to cluster centroids using stripes.
stripes(object, groups=NULL, type=c("first", "second", "all"), beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL, gp.bar2=NULL, number=TRUE, legend=!is.null(groups), ylim=NULL, ylab="distance from centroid", margins=c(2,5,3,2), ...)
stripes(object, groups=NULL, type=c("first", "second", "all"), beside=(type!="first"), col=NULL, gp.line=NULL, gp.bar=NULL, gp.bar2=NULL, number=TRUE, legend=!is.null(groups), ylim=NULL, ylab="distance from centroid", margins=c(2,5,3,2), ...)
object |
An object of class |
groups |
Grouping variable to color-code the stripes. By default
cluster membership is used as |
type |
Plot distance to closest, closest and second-closest or to all centroids? |
beside |
Logical, make different stripes for different clusters? |
col |
Vector of colors for clusters or groups. |
gp.line , gp.bar , gp.bar2
|
Graphical parameters for horizontal
lines and background rectangular areas, see
|
number |
Logical, write cluster numbers on x-axis? |
legend |
Logical, plot a legend for the groups? |
ylim , ylab
|
Graphical parameters for y-axis. |
margins |
Margin of the plot. |
... |
Further graphical parameters. |
A simple, yet very effective plot for visualizing the distance of each
point from its closest and second-closest cluster centroids is a
stripes plot. For each of the k clusters we have a rectangular area,
which we optionally vertically
divide into k smaller rectangles (beside=TRUE
). Then we draw a
horizontal line segment for each data point marking the distance of
the data point from the corresponding centroid.
Friedrich Leisch
Friedrich Leisch. Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing, 20(4), 457–469, 2010.
bw05 <- bundestag(2005) bavaria <- bundestag(2005, state="Bayern") set.seed(1) c4 <- cclust(bw05, k=4, save.data=TRUE) plot(c4) stripes(c4) stripes(c4, beside=TRUE) stripes(c4, type="sec") stripes(c4, type="sec", beside=FALSE) stripes(c4, type="all") stripes(c4, groups=bavaria) ## ugly, but shows how colors of all parts can be changed library("grid") stripes(c4, type="all", gp.bar=gpar(col="red", lwd=3, fill="white"), gp.bar2=gpar(col="green", lwd=3, fill="black"))
bw05 <- bundestag(2005) bavaria <- bundestag(2005, state="Bayern") set.seed(1) c4 <- cclust(bw05, k=4, save.data=TRUE) plot(c4) stripes(c4) stripes(c4, beside=TRUE) stripes(c4, type="sec") stripes(c4, type="sec", beside=FALSE) stripes(c4, type="all") stripes(c4, groups=bavaria) ## ugly, but shows how colors of all parts can be changed library("grid") stripes(c4, type="all", gp.bar=gpar(col="red", lwd=3, fill="white"), gp.bar2=gpar(col="green", lwd=3, fill="black"))
In 2006 a sample of 1000 respondents representative for the adult Australian population was asked about their environmental behaviour when on vacation. In addition the survey also included a list of statements about vacation motives like "I want to rest and relax," "I use my holiday for the health and beauty of my body," and "Cultural offers and sights are a crucial factor.". Answers are binary ("applies", "does not apply").
data(vacmot)
data(vacmot)
Data frame vacmot
has 1000 observations on 20 binary
variables on travel motives. Data frame vacmotdesc
has 1000
observation on sociodemographic descriptor variables, mean moral
obligation to protect the environment score, mean NEP score, and
mean environmental behaviour score, see Dolnicar & Leisch
(2008) for details.
In addition integer vector vacmot6
contains the 6
cluster partition presented in Dolnicar & Leisch (2008).
The data set was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia).
Sara Dolnicar and Friedrich Leisch. An investigation of tourists' patterns of obligation to protect the environment. Journal of Travel Research, 46:381-391, 2008.
Sara Dolnicar and Friedrich Leisch. Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research, 56(2):97-120, 2014.
data(vacmot) summary(vacmotdesc) dotchart(sort(colMeans(vacmot))) ## reproduce Figure 6 from Dolnicar & Leisch (2008) cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0)) barchart(cl6)
data(vacmot) summary(vacmotdesc) dotchart(sort(colMeans(vacmot))) ## reproduce Figure 6 from Dolnicar & Leisch (2008) cl6 <- kcca(vacmot, k=vacmot6, control=list(iter=0)) barchart(cl6)
Part of an Australian survey on motivation of volunteers to work for non-profit organisations like Red Cross, State Emergency Service, Rural Fire Service, Surf Life Saving, Rotary, Parents and Citizens Associations, etc..
data(volunteers)
data(volunteers)
A data frame with 1415 observations on the following 21 variables: age and gender of respondents plus 19 binary motivation items (1 applies/ 0 does not apply).
GENDER
Gender of respondent.
AGEG
Age group, a factor with categorized age of respondents.
meet.people
I can meet different types of people.
no.one.else
There is no-one else to do the work.
example
It sets a good example for others.
socialise
I can socialise with people who are like me.
help.others
It gives me the chance to help others.
give.back
I can give something back to society.
career
It will help my career prospects.
lonely
It makes me feel less lonely.
active
It keeps me active.
community
It will improve my community.
cause
I can support an important cause.
faith
I can put faith into action.
services
I want to maintain services that I may use one day.
children
My children are involved with the organisation.
good.job
I feel like I am doing a good job.
benefited
I know someone who has benefited from the organisation.
network
I can build a network of contacts.
recognition
I can gain recognition within the community.
mind.off
It takes my mind off other things.
The volunteering data was collected by the Institute for Innovation in Business and Social Research, University of Wollongong (NSW, Australia), using funding from Bushcare Wollongong and the Australian Research Council under the ARC Linkage Grant scheme (LP0453682).
Melanie Randle and Sara Dolnicar. Not Just Any Volunteers: Segmenting the Market to Attract the High-Contributors. Journal of Non-profit and Public Sector Marketing, 21(3), 271-282, 2009.
Melanie Randle and Sara Dolnicar. Self-congruity and volunteering: A multi-organisation comparison. European Journal of Marketing, 45(5), 739-758, 2011.
Melanie Randle, Friedrich Leisch, and Sara Dolnicar. Competition or collaboration? The effect of non-profit brand image on volunteer recruitment strategy. Journal of Brand Management, 20(8):689-704, 2013.