Package 'clue'

Title: Cluster Ensembles
Description: CLUster Ensembles.
Authors: Kurt Hornik [aut, cre] , Walter Böhm [ctb]
Maintainer: Kurt Hornik <[email protected]>
License: GPL-2
Version: 0.3-66
Built: 2024-11-25 14:51:14 UTC
Source: CRAN

Help Index


Additive Tree Distances

Description

Objects representing additive tree distances.

Usage

as.cl_addtree(x)

Arguments

x

an R object representing additive tree distances.

Details

Additive tree distances are object dissimilarities dd satisfying the so-called additive tree conditions, also known as four-point conditions dij+dklmax(dik+djl,dil+djk)d_{ij} + d_{kl} \le \max(d_{ik} + d_{jl}, d_{il} + d_{jk}) for all quadruples i,j,k,li, j, k, l. Equivalently, for each such quadruple, the largest two values of the sums dij+dkld_{ij} + d_{kl}, dik+djld_{ik} + d_{jl}, and dil+djkd_{il} + d_{jk} must be equal. Centroid distances are additive tree distances where the inequalities in the four-point conditions are strengthened to equalities (such that all three sums are equal), and can be represented as dij=gi+gjd_{ij} = g_i + g_j, i.e., as sums of distances from a “centroid”. See, e.g., Barthélémy and Guénoche (1991) for more details on additive tree distances.

as.cl_addtree is a generic function. Its default method can handle objects representing ultrametric distances and raw additive distance matrices. In addition, there is a method for coercing objects of class "phylo" from package ape.

Functions ls_fit_addtree and ls_fit_centroid can be used to find the additive tree distance or centroid distance minimizing least squares distance (Euclidean dissimilarity) to a given dissimilarity object.

There is a plot method for additive tree distances.

Value

An object of class "cl_addtree" containing the additive tree distances.

References

J.-P. Barthélémy and A. Guénoche (1991). Trees and proximity representations. Chichester: John Wiley & Sons. ISBN 0-471-92263-3.


Cassini Data

Description

A Cassini data set with 1000 points in 2-dimensional space which are drawn from the uniform distribution on 3 structures. The two outer structures are banana-shaped; the “middle” structure in between them is a circle.

Usage

data("Cassini")

Format

A classed list with components

x

a matrix with 1000 rows and 2 columns giving the coordinates of the points.

classes

a factor indicating which structure the respective points belong to.

Details

Instances of Cassini data sets can be created using function mlbench.cassini in package mlbench. The data set at hand was obtained using

    library("mlbench")
    set.seed(1234)
    Cassini <- mlbench.cassini(1000)
  

Examples

data("Cassini")
op <- par(mfcol = c(1, 2))
## Plot the data set:
plot(Cassini$x, col = as.integer(Cassini$classes),
     xlab = "", ylab = "")
## Create a "random" k-means partition of the data:
set.seed(1234)
party <- kmeans(Cassini$x, 3)
## And plot that.
plot(Cassini$x, col = cl_class_ids(party),
     xlab = "", ylab = "")
## (We can see the problem ...)
par(op)

Cassini Data Partitions Obtained by K-Means

Description

A cluster ensemble of 50 kk-means partitions of the Cassini data into three classes.

Usage

data("CKME")

Format

A cluster ensemble of 50 (kk-means) partitions.

Details

The ensemble was generated via

    require("clue")
    data("Cassini")
    set.seed(1234)
    CKME <- cl_boot(Cassini$x, 50, 3)
  

Agreement Between Partitions or Hierarchies

Description

Compute the agreement between (ensembles) of partitions or hierarchies.

Usage

cl_agreement(x, y = NULL, method = "euclidean", ...)

Arguments

x

an ensemble of partitions or hierarchies and dissimilarities, or something coercible to that (see cl_ensemble).

y

NULL (default), or as for x.

method

a character string specifying one of the built-in methods for computing agreement, or a function to be taken as a user-defined method. If a character string, its lower-cased version is matched against the lower-cased names of the available built-in methods using pmatch. See Details for available built-in methods.

...

further arguments to be passed to methods.

Details

If y is given, its components must be of the same kind as those of x (i.e., components must either all be partitions, or all be hierarchies or dissimilarities).

If all components are partitions, the following built-in methods for measuring agreement between two partitions with respective membership matrices uu and vv (brought to a common number of columns) are available:

"euclidean"

1d/m1 - d / m, where dd is the Euclidean dissimilarity of the memberships, i.e., the square root of the minimal sum of the squared differences of uu and all column permutations of vv, and mm is an upper bound for the maximal Euclidean dissimilarity. See Dimitriadou, Weingessel and Hornik (2002).

"manhattan"

1d/m1 - d / m, where dd is the Manhattan dissimilarity of the memberships, i.e., the minimal sum of the absolute differences of uu and all column permutations of vv, and mm is an upper bound for the maximal Manhattan dissimilarity.

"Rand"

the Rand index (the rate of distinct pairs of objects both in the same class or both in different classes in both partitions), see Rand (1971) or Gordon (1999), page 198. For soft partitions, (currently) the Rand index of the corresponding nearest hard partitions is used.

"cRand"

the Rand index corrected for agreement by chance, see Hubert and Arabie (1985) or Gordon (1999), page 198. Can only be used for hard partitions.

"NMI"

Normalized Mutual Information, see Strehl and Ghosh (2002). For soft partitions, (currently) the NMI of the corresponding nearest hard partitions is used.

"KP"

the Katz-Powell index, i.e., the product-moment correlation coefficient between the elements of the co-membership matrices C(u)=uuC(u) = u u' and C(v)C(v), respectively, see Katz and Powell (1953). For soft partitions, (currently) the Katz-Powell index of the corresponding nearest hard partitions is used. (Note that for hard partitions, the (i,j)(i,j) entry of C(u)C(u) is one iff objects ii and jj are in the same class.)

"angle"

the maximal cosine of the angle between the elements of uu and all column permutations of vv.

"diag"

the maximal co-classification rate, i.e., the maximal rate of objects with the same class ids in both partitions after arbitrarily permuting the ids.

"FM"

the index of Fowlkes and Mallows (1983), i.e., the ratio Nxy/NxNyN_{xy} / \sqrt{N_x N_y} of the number NxyN_{xy} of distinct pairs of objects in the same class in both partitions and the geometric mean of the numbers NxN_x and NyN_y of distinct pairs of objects in the same class in partition xx and partition yy, respectively. For soft partitions, (currently) the Fowlkes-Mallows index of the corresponding nearest hard partitions is used.

"Jaccard"

the Jaccard index, i.e., the ratio of the numbers of distinct pairs of objects in the same class in both partitions and in at least one partition, respectively. For soft partitions, (currently) the Jaccard index of the corresponding nearest hard partitions is used.

"purity"

the purity of the classes of x with respect to those of y, i.e., jmaxinij/n\sum_j \max_i n_{ij} / n, where nijn_{ij} is the joint frequency of objects in class ii for x and in class jj for y, and nn is the total number of objects.

"PS"

Prediction Strength, see Tibshirani and Walter (2005): the minimum, over all classes jj of y, of the maximal rate of objects in the same class for x and in class jj for y.

If all components are hierarchies, available built-in methods for measuring agreement between two hierarchies with respective ultrametrics uu and vv are as follows.

"euclidean"

1/(1+d)1 / (1 + d), where dd is the Euclidean dissimilarity of the ultrametrics (i.e., the square root of the sum of the squared differences of uu and vv).

"manhattan"

1/(1+d)1 / (1 + d), where dd is the Manhattan dissimilarity of the ultrametrics (i.e., the sum of the absolute differences of uu and vv).

"cophenetic"

The cophenetic correlation coefficient. (I.e., the product-moment correlation of the ultrametrics.)

"angle"

the cosine of the angle between the ultrametrics.

"gamma"

1d1 - d, where dd is the rate of inversions between the associated ultrametrics (i.e., the rate of pairs (i,j)(i,j) and (k,l)(k,l) for which uij<uklu_{ij} < u_{kl} and vij>vklv_{ij} > v_{kl}). (This agreement measure is a linear transformation of Kruskal's γ\gamma.)

The measures based on ultrametrics also allow computing agreement with “raw” dissimilarities on the underlying objects (R objects inheriting from class "dist").

If a user-defined agreement method is to be employed, it must be a function taking two clusterings as its arguments.

Symmetric agreement objects of class "cl_agreement" are implemented as symmetric proximity objects with self-proximities identical to one, and inherit from class "cl_proximity". They can be coerced to dense square matrices using as.matrix. It is possible to use 2-index matrix-style subscripting for such objects; unless this uses identical row and column indices, this results in a (non-symmetric agreement) object of class "cl_cross_agreement".

Value

If y is NULL, an object of class "cl_agreement" containing the agreements between the all pairs of components of x. Otherwise, an object of class "cl_cross_agreement" with the agreements between the components of x and the components of y.

References

E. Dimitriadou, A. Weingessel and K. Hornik (2002). A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16, 901–912.
doi:10.1142/S0218001402002052.

E. B. Fowlkes and C. L. Mallows (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.
doi:10.1080/01621459.1983.10478008.

A. D. Gordon (1999). Classification (2nd edition). Boca Raton, FL: Chapman & Hall/CRC.

L. Hubert and P. Arabie (1985). Comparing partitions. Journal of Classification, 2, 193–218. doi:10.1007/bf01908075.

W. M. Rand (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850. doi:10.2307/2284239.

L. Katz and J. H. Powell (1953). A proposed index of the conformity of one sociometric measurement to another. Psychometrika, 18, 249–256. doi:10.1007/BF02289063.

A. Strehl and J. Ghosh (2002). Cluster ensembles — A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617.
https://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf.

R. Tibshirani and G. Walter (2005). Cluster validation by Prediction Strength. Journal of Computational and Graphical Statistics, 14/3, 511–528. doi:10.1198/106186005X59243.

See Also

cl_dissimilarity; classAgreement in package e1071.

Examples

## An ensemble of partitions.
data("CKME")
pens <- CKME[1 : 20]		# for saving precious time ...
summary(c(cl_agreement(pens)))
summary(c(cl_agreement(pens, method = "Rand")))
summary(c(cl_agreement(pens, method = "diag")))
cl_agreement(pens[1:5], pens[6:7], method = "NMI")
## Equivalently, using subscripting.
cl_agreement(pens, method = "NMI")[1:5, 6:7]

## An ensemble of hierarchies.
d <- dist(USArrests)
hclust_methods <-
    c("ward", "single", "complete", "average", "mcquitty")
hclust_results <- lapply(hclust_methods, function(m) hclust(d, m))
names(hclust_results) <- hclust_methods 
hens <- cl_ensemble(list = hclust_results)
summary(c(cl_agreement(hens)))
## Note that the Euclidean agreements are *very* small.
## This is because the ultrametrics differ substantially in height:
u <- lapply(hens, cl_ultrametric)
round(sapply(u, max), 3)
## Rescaling the ultrametrics to [0, 1] gives:
u <- lapply(u, function(x) (x - min(x)) / (max(x) - min(x)))
shens <- cl_ensemble(list = lapply(u, as.cl_dendrogram))
summary(c(cl_agreement(shens)))
## Au contraire ...
summary(c(cl_agreement(hens, method = "cophenetic")))
cl_agreement(hens[1:3], hens[4:5], method = "gamma")

Bagging for Clustering

Description

Construct partitions of objects by running a base clustering algorithm on bootstrap samples from a given data set, and “suitably” aggregating these primary partitions.

Usage

cl_bag(x, B, k = NULL, algorithm = "kmeans", parameters = NULL, 
       method = "DFBC1", control = NULL)

Arguments

x

the data set of objects to be clustered, as appropriate for the base clustering algorithm.

B

an integer giving the number of bootstrap replicates.

k

NULL (default), or an integer giving the number of classes to be used for a partitioning base algorithm.

algorithm

a character string or function specifying the base clustering algorithm.

parameters

a named list of additional arguments to be passed to the base algorithm.

method

a character string indicating the bagging method to use. Currently, only method "DFBC1" is available, which implements algorithm BagClust1 in Dudoit & Fridlyand (2003).

control

a list of control parameters for the aggregation. Currently, not used.

Details

Bagging for clustering is really a rather general conceptual framework than a specific algorithm. If the primary partitions generated in the bootstrap stage form a cluster ensemble (so that class memberships of the objects in x can be obtained), consensus methods for cluster ensembles (as implemented, e.g., in cl_consensus and cl_medoid) can be employed for the aggregation stage. In particular, (possibly new) bagging algorithms can easily be realized by directly running cl_consensus on the results of cl_boot.

In BagClust1, aggregation proceeds by generating a reference partition by running the base clustering algorithm on the whole given data set, and averaging the ensemble memberships after optimally matching them to the reference partition (in fact, by minimizing Euclidean dissimilarity, see cl_dissimilarity).

If the base clustering algorithm yields prototypes, aggregation can be based on clustering these. This is the idea underlying the “Bagged Clustering” algorithm introduced in Leisch (1999) and implemented by function bclust in package e1071.

Value

An R object representing a partition of the objects given in x.

References

S. Dudoit and J. Fridlyand (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19/9, 1090–1099. doi:10.1093/bioinformatics/btg038.

F. Leisch (1999). Bagged Clustering. Working Paper 51, SFB “Adaptive Information Systems and Modeling in Economics and Management Science”. doi:10.57938/9b129f95-b53b-44ce-a129-5b7a1168d832.

Examples

set.seed(1234)
## Run BagClust1 on the Cassini data.
data("Cassini")
party <- cl_bag(Cassini$x, 50, 3)
plot(Cassini$x, col = cl_class_ids(party), xlab = "", ylab = "")
## Actually, using fuzzy c-means as a base learner works much better:
if(require("e1071", quietly = TRUE)) {
    party <- cl_bag(Cassini$x, 20, 3, algorithm = "cmeans")
    plot(Cassini$x, col = cl_class_ids(party), xlab = "", ylab = "")
}

Bootstrap Resampling of Clustering Algorithms

Description

Generate bootstrap replicates of the results of applying a “base” clustering algorithm to a given data set.

Usage

cl_boot(x, B, k = NULL,
        algorithm = if (is.null(k)) "hclust" else "kmeans", 
        parameters = list(), resample = FALSE)

Arguments

x

the data set of objects to be clustered, as appropriate for the base clustering algorithm.

B

an integer giving the number of bootstrap replicates.

k

NULL (default), or an integer giving the number of classes to be used for a partitioning base algorithm.

algorithm

a character string or function specifying the base clustering algorithm.

parameters

a named list of additional arguments to be passed to the base algorithm.

resample

a logical indicating whether the data should be resampled in addition to “sampling from the algorithm”. If resampling is used, the class memberships of the objects given in x are predicted from the results of running the base algorithm on bootstrap samples of x.

Details

This is a rather simple-minded function with limited applicability, and mostly useful for studying the effect of (uncontrolled) random initializations of fixed-point partitioning algorithms such as kmeans or cmeans, see the examples. To study the effect of varying control parameters or explicitly providing random starting values, the respective cluster ensemble has to be generated explicitly (most conveniently by using replicate to create a list lst of suitable instances of clusterings obtained by the base algorithm, and using cl_ensemble(list = lst) to create the ensemble).

Value

A cluster ensemble of length BB, with either (if resampling is not used, default) the results of running the base algorithm on the given data set, or (if resampling is used) the memberships for the given data predicted from the results of running the base algorithm on bootstrap samples of the data.

Examples

## Study e.g. the effect of random kmeans() initializations.
data("Cassini")
pens <- cl_boot(Cassini$x, 15, 3)
diss <- cl_dissimilarity(pens)
summary(c(diss))
plot(hclust(diss))

Cluster Classes

Description

Extract the classes in a partition or hierarchy.

Usage

cl_classes(x)

Arguments

x

an R object representing a partition or hierarchy of objects.

Details

For partitions, the classes are the equivalence classes (“clusters”) of the partition; for soft partitions, the classes of the nearest hard partition are used.

For hierarchies represented by trees, the classes are the sets of objects corresponding to (joined at or split by) the nodes of the tree.

Value

A list inheriting from "cl_classes_of_objects" of vectors indicating the classes.


Consensus Partitions and Hierarchies

Description

Compute the consensus clustering of an ensemble of partitions or hierarchies.

Usage

cl_consensus(x, method = NULL, weights = 1, control = list())

Arguments

x

an ensemble of partitions or hierarchies, or something coercible to that (see cl_ensemble).

method

a character string specifying one of the built-in methods for computing consensus clusterings, or a function to be taken as a user-defined method, or NULL (default value). If a character string, its lower-cased version is matched against the lower-cased names of the available built-in methods using pmatch. See Details for available built-in methods and defaults.

weights

a numeric vector with non-negative case weights. Recycled to the number of elements in the ensemble given by x if necessary.

control

a list of control parameters. See Details.

Details

Consensus clusterings “synthesize” the information in the elements of a cluster ensemble into a single clustering, often by minimizing a criterion function measuring how dissimilar consensus candidates are from the (elements of) the ensemble (the so-called “optimization approach” to consensus clustering).

The most popular criterion functions are of the form L(x)=wbd(xb,x)pL(x) = \sum w_b d(x_b, x)^p, where dd is a suitable dissimilarity measure (see cl_dissimilarity), wbw_b is the case weight given to element xbx_b of the ensemble, and p1p \ge 1. If p=1p = 1 and minimization is over all possible base clusterings, a consensus solution is called a median of the ensemble; if minimization is restricted to the elements of the ensemble, a consensus solution is called a medoid (see cl_medoid). For p=2p = 2, we obtain least squares consensus partitions and hierarchies (generalized means). See also Gordon (1999) for more information.

If all elements of the ensemble are partitions, the built-in consensus methods compute consensus partitions by minimizing a criterion of the form L(x)=wbd(xb,x)pL(x) = \sum w_b d(x_b, x)^p over all hard or soft partitions xx with a given (maximal) number kk of classes. Available built-in methods are as follows.

"SE"

a fixed-point algorithm for obtaining soft least squares Euclidean consensus partitions (i.e., for minimizing LL with Euclidean dissimilarity dd and p=2p = 2 over all soft partitions with a given maximal number of classes).

This iterates between individually matching all partitions to the current approximation to the consensus partition, and computing the next approximation as the membership matrix closest to a suitable weighted average of the memberships of all partitions after permuting their columns for the optimal matchings of class ids.

The following control parameters are available for this method.

k

an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.

maxiter

an integer giving the maximal number of iterations to be performed. Defaults to 100.

nruns

an integer giving the number of runs to be performed. Defaults to 1.

reltol

the relative convergence tolerance. Defaults to sqrt(.Machine$double.eps).

start

a matrix with number of rows equal to the number of objects of the cluster ensemble, and kk columns, to be used as a starting value, or a list of such matrices. By default, suitable random membership matrices are used.

verbose

a logical indicating whether to provide some output on minimization progress. Defaults to getOption("verbose").

In the case of multiple runs, the first optimum found is returned.

This method can also be referred to as "soft/euclidean".

"GV1"

the fixed-point algorithm for the “first model” in Gordon and Vichi (2001) for minimizing LL with dd being GV1 dissimilarity and p=2p = 2 over all soft partitions with a given maximal number of classes.

This is similar to "SE", but uses GV1 rather than Euclidean dissimilarity.

Available control parameters are the same as for "SE".

"DWH"

an extension of the greedy algorithm in Dimitriadou, Weingessel and Hornik (2002) for (approximately) obtaining soft least squares Euclidean consensus partitions. The reference provides some structure theory relating finding the consensus partition to an instance of the multiple assignment problem, which is known to be NP-hard, and suggests a simple heuristic based on successively matching an individual partition xbx_b to the current approximation to the consensus partition, and compute the memberships of the next approximation as a weighted average of those of the current one and of xbx_b after permuting its columns for the optimal matching of class ids.

The following control parameters are available for this method.

k

an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.

order

a permutation of the integers from 1 to the size of the ensemble, specifying the order in which the partitions in the ensemble should be aggregated. Defaults to using a random permutation (unlike the reference, which does not permute at all).

"HE"

a fixed-point algorithm for obtaining hard least squares Euclidean consensus partitions (i.e., for minimizing LL with Euclidean dissimilarity dd and p=2p = 2 over all hard partitions with a given maximal number of classes.)

Available control parameters are the same as for "SE".

This method can also be referred to as "hard/euclidean".

"SM"

a fixed-point algorithm for obtaining soft median Manhattan consensus partitions (i.e., for minimizing LL with Manhattan dissimilarity dd and p=1p = 1 over all soft partitions with a given maximal number of classes).

Available control parameters are the same as for "SE".

This method can also be referred to as "soft/manhattan".

"HM"

a fixed-point algorithm for obtaining hard median Manhattan consensus partitions (i.e., for minimizing LL with Manhattan dissimilarity dd and p=1p = 1 over all hard partitions with a given maximal number of classes).

Available control parameters are the same as for "SE".

This method can also be referred to as "hard/manhattan".

"GV3"

a SUMT algorithm for the “third model” in Gordon and Vichi (2001) for minimizing LL with dd being co-membership dissimilarity and p=2p = 2. (See sumt for more information on the SUMT approach.) This optimization problem is equivalent to finding the membership matrix mm for which the sum of the squared differences between C(m)=mmC(m) = m m' and the weighted average co-membership matrix bwbC(mb)\sum_b w_b C(m_b) of the partitions is minimal.

Available control parameters are method, control, eps, q, and verbose, which have the same roles as for sumt, and the following.

k

an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.

nruns

an integer giving the number of runs to be performed. Defaults to 1.

start

a matrix with number of rows equal to the size of the cluster ensemble, and kk columns, to be used as a starting value, or a list of such matrices. By default, a membership based on a rank kk approximation to the weighted average co-membership matrix is used.

In the case of multiple runs, the first optimum found is returned.

"soft/symdiff"

a SUMT approach for minimizing L=wbd(xb,x)L = \sum w_b d(x_b, x) over all soft partitions with a given maximal number of classes, where dd is the Manhattan dissimilarity of the co-membership matrices (coinciding with symdiff partition dissimilarity in the case of hard partitions).

Available control parameters are the same as for "GV3".

"hard/symdiff"

an exact solver for minimizing L=wbd(xb,x)L = \sum w_b d(x_b, x) over all hard partitions (possibly with a given maximal number of classes as specified by the control parameter k), where dd is symdiff partition dissimilarity (so that soft partitions in the ensemble are replaced by their closest hard partitions), or equivalently, Rand distance or pair-bonds (Boorman-Arabie DD) distance. The consensus solution is found via mixed linear or quadratic programming.

By default, method "SE" is used for ensembles of partitions.

If all elements of the ensemble are hierarchies, the following built-in methods for computing consensus hierarchies are available.

"euclidean"

an algorithm for minimizing L(x)=wbd(xb,x)2L(x) = \sum w_b d(x_b, x) ^ 2 over all dendrograms, where dd is Euclidean dissimilarity. This is equivalent to finding the best least squares ultrametric approximation of the weighted average d=wbubd = \sum w_b u_b of the ultrametrics ubu_b of the hierarchies xbx_b, which is attempted by calling ls_fit_ultrametric on dd with appropriate control parameters.

This method can also be referred to as "cophenetic".

"manhattan"

a SUMT for minimizing L=wbd(xb,x)L = \sum w_b d(x_b, x) over all dendrograms, where dd is Manhattan dissimilarity.

Available control parameters are the same as for "euclidean".

"majority"

a hierarchy obtained from an extension of the majority consensus tree of Margush and McMorris (1981), which minimizes L(x)=wbd(xb,x)L(x) = \sum w_b d(x_b, x) over all dendrograms, where dd is the symmetric difference dissimilarity. The unweighted pp-majority tree is the nn-tree (hierarchy in the strict sense) consisting of all subsets of objects contained in more than 100p100 p percent of the nn-trees TbT_b induced by the dendrograms, where 1/2p<11/2 \le p < 1 and p=1/2p = 1/2 (default) corresponds to the standard majority tree. In the weighted case, it consists of all subsets AA for which b:ATbwb>Wp\sum_{b: A \in T_b} w_b > W p, where W=bwbW = \sum_b w_b. We also allow for p=1p = 1, which gives the strict consensus tree consisting of all subsets contained in each of the nn-trees. The majority dendrogram returned is a representation of the majority tree where all splits have height one.

The fraction pp can be specified via the control parameter p.

By default, method "euclidean" is used for ensembles of hierarchies.

If a user-defined consensus method is to be employed, it must be a function taking the cluster ensemble, the case weights, and a list of control parameters as its arguments, with formals named x, weights, and control, respectively.

Most built-in methods use heuristics for solving hard optimization problems, and cannot be guaranteed to find a global minimum. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the methods.

Value

The consensus partition or hierarchy.

References

E. Dimitriadou, A. Weingessel and K. Hornik (2002). A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16, 901–912.
doi:10.1142/S0218001402002052.

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248. doi:10.1007/BF02294837.

A. D. Gordon (1999). Classification (2nd edition). Boca Raton, FL: Chapman & Hall/CRC.

T. Margush and F. R. McMorris (1981). Consensus nn-trees. Bulletin of Mathematical Biology, 43, 239–244. doi:10.1007/BF02459446.

See Also

cl_medoid, consensus

Examples

## Consensus partition for the Rosenberg-Kim kinship terms partition
## data based on co-membership dissimilarities.
data("Kinship82")
m1 <- cl_consensus(Kinship82, method = "GV3",
                   control = list(k = 3, verbose = TRUE))
## (Note that one should really use several replicates of this.)
## Value for criterion function to be minimized:
sum(cl_dissimilarity(Kinship82, m1, "comem") ^ 2)
## Compare to the consensus solution given in Gordon & Vichi (2001).
data("Kinship82_Consensus")
m2 <- Kinship82_Consensus[["JMF"]]
sum(cl_dissimilarity(Kinship82, m2, "comem") ^ 2)
## Seems we get a better solution ...
## How dissimilar are these solutions?
cl_dissimilarity(m1, m2, "comem")
## How "fuzzy" are they?
cl_fuzziness(cl_ensemble(m1, m2))
## Do the "nearest" hard partitions fully agree?
cl_dissimilarity(as.cl_hard_partition(m1),
                 as.cl_hard_partition(m2))

## Consensus partition for the Gordon and Vichi (2001) macroeconomic
## partition data based on Euclidean dissimilarities.
data("GVME")
set.seed(1)
## First, using k = 2 classes.
m1 <- cl_consensus(GVME, method = "GV1",
                   control = list(k = 2, verbose = TRUE))
## (Note that one should really use several replicates of this.)
## Value of criterion function to be minimized:
sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2)
## Compare to the consensus solution given in Gordon & Vichi (2001).
data("GVME_Consensus")
m2 <- GVME_Consensus[["MF1/2"]]
sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2)
## Seems we get a slightly  better solution ...
## But note that
cl_dissimilarity(m1, m2, "GV1")
## and that the maximal deviation of the memberships is
max(abs(cl_membership(m1) - cl_membership(m2)))
## so the differences seem to be due to rounding.
## Do the "nearest" hard partitions fully agree?
table(cl_class_ids(m1), cl_class_ids(m2))

## And now for k = 3 classes.
m1 <- cl_consensus(GVME, method = "GV1",
                   control = list(k = 3, verbose = TRUE))
sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2)
## Compare to the consensus solution given in Gordon & Vichi (2001).
m2 <- GVME_Consensus[["MF1/3"]]
sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2)
## This time we look much better ...
## How dissimilar are these solutions?
cl_dissimilarity(m1, m2, "GV1")
## Do the "nearest" hard partitions fully agree?
table(cl_class_ids(m1), cl_class_ids(m2))

Dissimilarity Between Partitions or Hierarchies

Description

Compute the dissimilarity between (ensembles) of partitions or hierarchies.

Usage

cl_dissimilarity(x, y = NULL, method = "euclidean", ...)

Arguments

x

an ensemble of partitions or hierarchies and dissimilarities, or something coercible to that (see cl_ensemble).

y

NULL (default), or as for x.

method

a character string specifying one of the built-in methods for computing dissimilarity, or a function to be taken as a user-defined method. If a character string, its lower-cased version is matched against the lower-cased names of the available built-in methods using pmatch. See Details for available built-in methods.

...

further arguments to be passed to methods.

Details

If y is given, its components must be of the same kind as those of x (i.e., components must either all be partitions, or all be hierarchies or dissimilarities).

If all components are partitions, the following built-in methods for measuring dissimilarity between two partitions with respective membership matrices uu and vv (brought to a common number of columns) are available:

"euclidean"

the Euclidean dissimilarity of the memberships, i.e., the square root of the minimal sum of the squared differences of uu and all column permutations of vv. See Dimitriadou, Weingessel and Hornik (2002).

"manhattan"

the Manhattan dissimilarity of the memberships, i.e., the minimal sum of the absolute differences of uu and all column permutations of vv.

"comemberships"

the Euclidean dissimilarity of the elements of the co-membership matrices C(u)=uuC(u) = u u' and C(v)C(v), i.e., the square root of the sum of the squared differences of C(u)C(u) and C(v)C(v).

"symdiff"

the cardinality of the symmetric set difference of the sets of co-classified pairs of distinct objects in the partitions. I.e., the number of distinct pairs of objects in the same class in exactly one of the partitions. (Alternatively, the cardinality of the symmetric set difference between the (binary) equivalence relations corresponding to the partitions.) For soft partitions, (currently) the symmetric set difference of the corresponding nearest hard partitions is used.

"Rand"

the Rand distance, i.e., the rate of distinct pairs of objects in the same class in exactly one of the partitions. (Related to the Rand index aa via the linear transformation d=(1a)/2d = (1 - a) / 2.) For soft partitions, (currently) the Rand distance of the corresponding nearest hard partitions is used.

"GV1"

the square root of the dissimilarity Δ1\Delta_1 used for the first model in Gordon and Vichi (2001), i.e., the square root of the minimal sum of the squared differences of the matched non-zero columns of uu and vv.

"BA/d"

distance measures for hard partitions discussed in Boorman and Arabie (1972), with d one of ‘⁠A⁠’, ‘⁠C⁠’, ‘⁠D⁠’, or ‘⁠E⁠’. For soft partitions, the distances of the corresponding nearest hard partitions are used.

"BA/A" is the minimum number of single element moves (move from one class to another or a new one) needed to transform one partition into the other. Introduced in Rubin (1967).

"BA/C" is the minimum number of lattice moves for transforming one partition into the other, where partitions are said to be connected by a lattice move if one is just finer than the other (i.e., there is no other partition between them) in the partition lattice (see cl_meet). Equivalently, with zz the join of x and y and SS giving the number of classes, this can be written as S(x)+S(y)2S(z)S(x) + S(y) - 2 S(z). Attributed to David Pavy.

"BA/D" is the “pair-bonds” distance, which can be defined as S(x)+S(y)2S(z)S(x) + S(y) - 2 S(z), with zz the meet of x and y and SS the supervaluation (i.e., non-decreasing with respect to the partial order on the partition lattice) function i(ni(ni1))/(n(n1))\sum_i (n_i (n_i - 1)) / (n (n - 1)), where the nin_i are the numbers of objects in the respective classes of the partition (such that ni(ni1)/2n_i (n_i - 1) / 2 are the numbers of pair bonds in the classes), and nn the total number of objects.

"BA/E" is the normalized information distance, defined as 1I/H1 - I / H, where II is the average mutual information between the partitions, and HH is the average entropy of the meet zz of the partitions. Introduced in Rajski (1961).

(Boorman and Arabie also discuss a distance measure (BB) based on the minimum number of set moves needed to transform one partition into the other, which, differently from the AA and CC distance measures is hard to compute (Day, 1981) and (currently) not provided.)

"VI"

Variation of Information, see Meila (2003). If ... has an argument named weights, it is taken to specify case weights.

"Mallows"

the Mallows-type distance by Zhou, Li and Zha (2005), which is related to the Monge-Kantorovich mass transfer problem, and given as the pp-th root of the minimal value of the transportation problem wjkiuijvikp\sum w_{jk} \sum_i |u_{ij} - v_{ik}| ^ p with constraints wjk0w_{jk} \ge 0, jwjk=αj\sum_j w_{jk} = \alpha_j, kwjk=βk\sum_k w_{jk} = \beta_k, where jαj=kβk\sum_j \alpha_j = \sum_k \beta_k. The parameters pp, α\alpha and β\beta all default to one (in this case, the Mallows distance coincides with the Manhattan dissimilarity), and can be specified via additional arguments named p, alpha, and beta, respectively.

"CSSD"

the Cluster Similarity Sensitive Distance of Zhou, Li and Zha (2005), which is given as the minimal value of k,l(12wkl/(αk+βl))Lkl\sum_{k,l} (1 - 2 w_{kl} / (\alpha_k + \beta_l)) L_{kl}, where Lkl=iuikvild(px;k,py;l)L_{kl} = \sum_i u_{ik} v_{il} d(p_{x;k}, p_{y;l}) with px;kp_{x;k} and py;lp_{y;l} the prototype of the kk-th class of x and the ll-th class of y, respectively, dd is the distance between these, and the wklw_{kl} as for Mallows distance. If prototypes are matrices, the Euclidean distance between these is used as default. Using the additional argument L, one can give a matrix of LklL_{kl} values, or the function dd. Parameters α\alpha and β\beta all default to one, and can be specified via additional arguments named alpha and beta, respectively.

For hard partitions, both Manhattan and squared Euclidean dissimilarity give twice the transfer distance (Charon et al., 2005), which is the minimum number of objects that must be removed so that the implied partitions (restrictions to the remaining objects) are identical. This is also known as the RR-metric in Day (1981), i.e., the number of augmentations and removals of single objects needed to transform one partition into the other, and the partition-distance in Gusfield (2002), and equals twice the number of single element moves distance of Boorman and Arabie.

For hard partitions, the pair-bonds (Boorman-Arabie DD) distance is identical to the Rand distance, and can also be written as the Manhattan distance between the co-membership matrices corresponding to the partitions, or equivalently, their symdiff distance, normalized by n(n1)n (n - 1).

If all components are hierarchies, available built-in methods for measuring dissimilarity between two hierarchies with respective ultrametrics uu and vv are as follows.

"euclidean"

the Euclidean dissimilarity of the ultrametrics (i.e., the square root of the sum of the squared differences of uu and vv).

"manhattan"

the Manhattan dissimilarity of the ultrametrics (i.e., the sum of the absolute differences of uu and vv).

"cophenetic"

1c21 - c^2, where cc is the cophenetic correlation coefficient (i.e., the product-moment correlation of the ultrametrics).

"gamma"

the rate of inversions between the ultrametrics (i.e., the rate of pairs (i,j)(i,j) and (k,l)(k,l) for which uij<uklu_{ij} < u_{kl} and vij>vklv_{ij} > v_{kl}).

"symdiff"

the cardinality of the symmetric set difference of the sets of classes (hierarchies in the strict sense) induced by the dendrograms. I.e., the number of sets of objects obtained by a split in exactly one of the hierarchies.

"Chebyshev"

the Chebyshev (maximal) dissimilarity of the ultrametrics (i.e., the maximum of the absolute differences of uu and vv).

"Lyapunov"

the logarithm of the product of the maximal and minimal ratios of the ultrametrics. This is also known as the “Hilbert projective metric” on the cone represented by the ultrametrics (e.g., Jardine & Sibson (1971), page 107), and only defined for strict ultrametrics (which are strictly positive for distinct objects).

"BO"

the mδm_\delta family of tree metrics by Boorman and Olivier (1973), which are of the form mδ=0δ(p(h),q(h))dhm_\delta = \int_0^\infty \delta(p(h), q(h)) dh, where p(h)p(h) and q(h)q(h) are the hard partitions obtaining by cutting the trees (dendrograms) at height hh, and δ\delta is a suitably dissimilarity measure for partitions. In particular, when taking δ\delta as symdiff or Rand dissimilarity, mδm_\delta is the Manhattan dissimilarity of the hierarchies.

If ... has an argument named delta it is taken to specify the partition dissimilarity δ\delta to be employed.

"spectral"

the spectral norm (2-norm) of the differences of the ultrametrics, suggested in Mérigot, Durbec, and Gaertner (2010).

The measures based on ultrametrics also allow computing dissimilarity with “raw” dissimilarities on the underlying objects (R objects inheriting from class "dist").

If a user-defined dissimilarity method is to be employed, it must be a function taking two clusterings as its arguments.

Symmetric dissimilarity objects of class "cl_dissimilarity" are implemented as symmetric proximity objects with self-proximities identical to zero, and inherit from class "cl_proximity". They can be coerced to dense square matrices using as.matrix. It is possible to use 2-index matrix-style subscripting for such objects; unless this uses identical row and column indices, this results in a (non-symmetric dissimilarity) object of class "cl_cross_dissimilarity".

Symmetric dissimilarity objects also inherit from class "dist" (although they currently do not “strictly” extend this class), thus making it possible to use them directly for clustering algorithms based on dissimilarity matrices of this class, see the examples.

Value

If y is NULL, an object of class "cl_dissimilarity" containing the dissimilarities between all pairs of components of x. Otherwise, an object of class "cl_cross_dissimilarity" with the dissimilarities between the components of x and the components of y.

References

S. A. Boorman and P. Arabie (1972). Structural measures and the method of sorting. In R. N. Shepard, A. K. Romney, & S. B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, 1: Theory (pages 225–249). New York: Seminar Press.

S. A. Boorman and D. C. Olivier (1973). Metrics on spaces of finite trees. Journal of Mathematical Psychology, 10, 26–59. doi:10.1016/0022-2496(73)90003-5.

I. Charon, L. Denoeud, A. Guénoche and O. Hudry (2006). Maximum Transfer Distance Between Partitions. Journal of Classification, 23, 103–121. doi:10.1007/s00357-006-0006-2.

W. E. H. Day (1981). The complexity of computing metric distances between partitions. Mathematical Social Sciences, 1, 269–287. doi:10.1016/0165-4896(81)90042-1.

E. Dimitriadou, A. Weingessel and K. Hornik (2002). A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16, 901–912.
doi:10.1142/S0218001402002052.

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248. doi:10.1007/BF02294837.

D. Gusfield (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82, 159–164. doi:10.1016/S0020-0190(01)00263-0.

N. Jardine and E. Sibson (1971). Mathematical Taxonomy. London: Wiley.

M. Meila (2003). Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth (eds.), Learning Theory and Kernel Machines, pages 173–187. Springer-Verlag: Lecture Notes in Computer Science 2777.

B. Mérigot, J.-P. Durbec and J.-C. Gaertner (2010). On goodness-of-fit measure for dendrogram-based analyses. Ecology, 91, 1850—-1859. doi:10.1890/09-1387.1.

C. Rajski (1961). A metric space of discrete probability distributions, Information and Control, 4, 371–377. doi:10.1016/S0019-9958(61)80055-7.

J. Rubin (1967). Optimal classification into groups: An approach for solving the taxonomy problem. Journal of Theoretical Biology, 15, 103–144. doi:10.1016/0022-5193(67)90046-X.

D. Zhou, J. Li and H. Zha (2005). A new Mallows distance based metric for comparing clusterings. In Proceedings of the 22nd international Conference on Machine Learning (Bonn, Germany, August 07–11, 2005), pages 1028–1035. ICML '05, volume 119. ACM Press, New York, NY. doi:10.1145/1102351.1102481.

See Also

cl_agreement

Examples

## An ensemble of partitions.
data("CKME")
pens <- CKME[1 : 30]
diss <- cl_dissimilarity(pens)
summary(c(diss))
cl_dissimilarity(pens[1:5], pens[6:7])
## Equivalently, using subscripting.
diss[1:5, 6:7]
## Can use the dissimilarities for "secondary" clustering
## (e.g. obtaining hierarchies of partitions):
hc <- hclust(diss)
plot(hc)

## Example from Boorman and Arabie (1972).
P1 <- as.cl_partition(c(1, 2, 2, 2, 3, 3, 2, 2))
P2 <- as.cl_partition(c(1, 1, 2, 2, 3, 3, 4, 4))
cl_dissimilarity(P1, P2, "BA/A")
cl_dissimilarity(P1, P2, "BA/C")

## Hierarchical clustering.
d <- dist(USArrests)
x <- hclust(d)
cl_dissimilarity(x, d, "cophenetic")
cl_dissimilarity(x, d, "gamma")

Cluster Ensembles

Description

Creation and manipulation of cluster ensembles.

Usage

cl_ensemble(..., list = NULL)
as.cl_ensemble(x)
is.cl_ensemble(x)

Arguments

...

R objects representing clusterings of or dissimilarities between the same objects.

list

a list of R objects as in ....

x

for as.cl_ensemble, an R object as in ...; for is.cl_ensemble, an arbitrary R object.

Details

cl_ensemble creates “cluster ensembles”, which are realized as lists of clusterings (or dissimilarities) with additional class information, always inheriting from "cl_ensemble". All elements of the ensemble must have the same number of objects.

If all elements are partitions, the ensemble has class "cl_partition_ensemble"; if all elements are dendrograms, it has class "cl_dendrogram_ensemble" and inherits from "cl_hierarchy_ensemble"; if all elements are hierarchies (but not always dendrograms), it has class "cl_hierarchy_ensemble". Note that empty or “mixed” ensembles cannot be categorized according to the kind of elements they contain, and hence only have class "cl_ensemble".

The list representation makes it possible to use lapply for computations on the individual clusterings in (i.e., the components of) a cluster ensemble.

Available methods for cluster ensembles include those for subscripting, c, rep, and print. There is also a plot method for ensembles for which all elements can be plotted (currently, additive trees, dendrograms and ultrametrics).

Value

cl_ensemble returns a list of the given clusterings or dissimilarities, with additional class information (see Details).

Examples

d <- dist(USArrests)
hclust_methods <-
    c("ward", "single", "complete", "average", "mcquitty")
hclust_results <- lapply(hclust_methods, function(m) hclust(d, m))
names(hclust_results) <- hclust_methods 
## Now create an ensemble from the results.
hens <- cl_ensemble(list = hclust_results)
hens
## Subscripting.
hens[1 : 3]
## Replication.
rep(hens, 3)
## Plotting.
plot(hens, main = names(hens))
## And continue to analyze the ensemble, e.g.
round(cl_dissimilarity(hens, method = "gamma"), 4)

Partition Fuzziness

Description

Compute the fuzziness of partitions.

Usage

cl_fuzziness(x, method = NULL, normalize = TRUE)

Arguments

x

a cluster ensemble of partitions, or an R object coercible to such.

method

a character string indicating the fuzziness measure to be employed, or NULL (default), or a function to be taken as a user-defined method. Currently available built-in methods are "PC" (Partition Coefficient) and "PE" (Partition Entropy), with the default corresponding to the first one. If method is a character string, its lower-cased version is matched against the lower-cased names of the available built-in methods using pmatch.

normalize

a logical indicating whether the fuzziness measure should be normalized in a way that hard partitions have value 0, and “completely fuzzy” partitions (where for all objects, all classes get the same membership) have value 1.

Details

If mm contains the membership values of a partition, the (unnormalized) Partition Coefficient and Partition Entropy are given by n,imn,i2\sum_{n,i} m_{n,i}^2 and n,iH(mn,i)\sum_{n,i} H(m_{n,i}), respectively, where H(u)=ulogu(1u)log(1u)H(u) = u \log u - (1-u) \log(1-u).

Note that the normalization used here is different from the normalizations typically found in the literature.

If a user-defined fuzziness method is to be employed, is must be a function taking a matrix of membership values and a logical to indicate whether normalization is to be performed as its arguments (in that order; argument names are not used).

Value

An object of class "cl_fuzziness" giving the fuzziness values.

References

J. C. Bezdek (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum.

See Also

Function fclustIndex in package e1071, which also computes several other “fuzzy cluster indexes” (typically based on more information than just the membership values).

Examples

if(require("e1071", quietly = TRUE)) {
    ## Use an on-line version of fuzzy c-means from package e1071 if
    ## available.
    data("Cassini")
    pens <- cl_boot(Cassini$x, B = 15, k = 3, algorithm = "cmeans",
                    parameters = list(method = "ufcl"))
    pens
    summary(cl_fuzziness(pens, "PC"))
    summary(cl_fuzziness(pens, "PE"))
}

Membership Margins

Description

Compute the margin of the memberships of a partition, i.e., the difference between the largest and second largest membership values of the respective objects.

Usage

cl_margin(x)

Arguments

x

an R object representing a partition of objects.

Details

For hard partitions, the margins are always 1.

For soft partitions, the margins may be taken as an indication of the “sureness” of classifying an object to the class with maximum membership value.

Examples

data("GVME")
## Look at the classes obtained for 1980:
split(cl_object_names(GVME[["1980"]]), cl_class_ids(GVME[["1980"]]))
## Margins:
x <- cl_margin(GVME[["1980"]])
## Add names, and sort:
names(x) <- cl_object_names(GVME[["1980"]])
sort(x)
## Note the "uncertainty" of assigning Egypt to the "intermediate" class
## of nations.

Medoid Partitions and Hierarchies

Description

Compute the medoid of an ensemble of partitions or hierarchies, i.e., the element of the ensemble minimizing the sum of dissimilarities to all other elements.

Usage

cl_medoid(x, method = "euclidean")

Arguments

x

an ensemble of partitions or hierarchies, or something coercible to that (see cl_ensemble).

method

a character string or a function, as for argument method of function cl_dissimilarity.

Details

Medoid clusterings are special cases of “consensus” clusterings characterized as the solutions of an optimization problem. See Gordon (2001) for more information.

The dissimilarities d for determining the medoid are obtained by calling cl_dissimilarity with arguments x and method. The medoid can then be found as the (first) row index for which the row sum of as.matrix(d) is minimal. Modulo possible differences in the case of ties, this gives the same results as (the medoid obtained by) pam in package cluster.

Value

The medoid partition or hierarchy.

References

A. D. Gordon (1999). Classification (2nd edition). Boca Raton, FL: Chapman & Hall/CRC.

See Also

cl_consensus

Examples

## An ensemble of partitions.
data("CKME")
pens <- CKME[1 : 20]
m1 <- cl_medoid(pens)
diss <- cl_dissimilarity(pens)
require("cluster")
m2 <- pens[[pam(diss, 1)$medoids]]
## Agreement of medoid consensus partitions.
cl_agreement(m1, m2)
## Or, more straightforwardly:
table(cl_class_ids(m1), cl_class_ids(m2))

Memberships of Partitions

Description

Compute the memberships values for objects representing partitions.

Usage

cl_membership(x, k = n_of_classes(x))
as.cl_membership(x)

Arguments

x

an R object representing a partition of objects (for cl_membership) or raw memberships or class ids (for as.cl_membership).

k

an integer giving the number of columns (corresponding to class ids) to be used in the membership matrix. Must not be less, and default to, the number of classes in the partition.

Details

cl_membership is a generic function.

The methods provided in package clue handle the partitions obtained from clustering functions in the base R distribution, as well as packages RWeka, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself).

as.cl_membership can be used for coercing “raw” class ids (given as atomic vectors) or membership values (given as numeric matrices) to membership objects.

Value

An object of class "cl_membership" with the matrix of membership values.

See Also

is.cl_partition

Examples

## Getting the memberships of a single soft partition.
d <- dist(USArrests)
hclust_methods <-
    c("ward", "single", "complete", "average", "mcquitty")
hclust_results <- lapply(hclust_methods, function(m) hclust(d, m))
names(hclust_results) <- hclust_methods 
## Now create an ensemble from the results.
hens <- cl_ensemble(list = hclust_results)
## And add the results of agnes and diana.
require("cluster")
hens <- c(hens, list(agnes = agnes(d), diana = diana(d)))
## Create a dissimilarity object from this.
d1 <- cl_dissimilarity(hens)
## And compute a soft partition.
party <- fanny(d1, 2)
round(cl_membership(party), 5)
## The "nearest" hard partition to this:
as.cl_hard_partition(party)
## (which has the same class ids as cl_class_ids(party)).

## Extracting the memberships from the elements of an ensemble of
## partitions.
pens <- cl_boot(USArrests, 30, 3)
pens
mems <- lapply(pens, cl_membership)
## And turning these raw memberships into an ensemble of partitions.
pens <- cl_ensemble(list = lapply(mems, as.cl_partition))
pens
pens[[length(pens)]]

Find Object Names

Description

Find the names of the objects from which a taxonomy (partition or hierarchy) or proximity was obtained.

Usage

cl_object_names(x)

Arguments

x

an R object representing a taxonomy or proximity.

Details

This is a generic function.

The methods provided in package clue handle the partitions and hierarchies obtained from clustering functions in the base R distribution, as well as packages RWeka, ape, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself), in as much as possible.

There is also a method for object dissimilarities which inherit from class "dist".

Value

A character vector of length n_of_objects(x) in case the names of the objects could be determined, or NULL.


K-Medoids Partitions of Clusterings

Description

Compute kk-medoids partitions of clusterings.

Usage

cl_pam(x, k, method = "euclidean", solver = c("pam", "kmedoids"))

Arguments

x

an ensemble of partitions or hierarchies, or something coercible to that (see cl_ensemble).

k

an integer giving the number of classes to be used in the partition.

method

a character string or a function, as for argument method of function cl_dissimilarity.

solver

a character string indicating the kk-medoids solver to be employed. May be abbreviated. If "pam" (default), the Partitioning Around Medoids (Kaufman & Rousseeuw (1990), Chapter 2) heuristic pam of package cluster is used. Otherwise, the exact algorithm of kmedoids is employed.

Details

An optimal kk-medoids partition of the given cluster ensemble is defined as a partition of the objects xix_i (the elements of the ensemble) into kk classes C1,,CkC_1, \ldots, C_k such that the criterion function L=l=1kminjCliCld(xi,xj)L = \sum_{l=1}^k \min_{j \in C_l} \sum_{i \in C_l} d(x_i, x_j) is minimized.

Such secondary partitions (e.g., Gordon & Vichi, 1998) are obtained by computing the dissimilarities dd of the objects in the ensemble for the given dissimilarity method, and applying a dissimilarity-based kk-medoids solver to dd.

Value

An object of class "cl_pam" representing the obtained “secondary” partition, which is a list with the following components.

cluster

the class ids of the partition.

medoid_ids

the indices of the medoids.

prototypes

a cluster ensemble with the kk prototypes (medoids).

criterion

the value of the criterion function of the partition.

description

a character string indicating the dissimilarity method employed.

References

L. Kaufman and P. J. Rousseeuw (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

A. D. Gordon and M. Vichi (1998). Partitions of partitions. Journal of Classification, 15, 265–285. doi:10.1007/s003579900034.

See Also

cl_pclust for more general prototype-based partitions of clusterings.

Examples

data("Kinship82")
party <- cl_pam(Kinship82, 3, "symdiff")
## Compare results with tables 5 and 6 in Gordon & Vichi (1998).
party
lapply(cl_prototypes(party), cl_classes)
table(cl_class_ids(party))

Prototype-Based Partitions of Clusterings

Description

Compute prototype-based partitions of a cluster ensemble by minimizing wbubjmd(xb,pj)e\sum w_b u_{bj}^m d(x_b, p_j)^e, the sum of the case-weighted and membership-weighted ee-th powers of the dissimilarities between the elements xbx_b of the ensemble and the prototypes pjp_j, for suitable dissimilarities dd and exponents ee.

Usage

cl_pclust(x, k, method = NULL, m = 1, weights = 1,
          control = list())

Arguments

x

an ensemble of partitions or hierarchies, or something coercible to that (see cl_ensemble).

k

an integer giving the number of classes to be used in the partition.

method

the consensus method to be employed, see cl_consensus.

m

a number not less than 1 controlling the softness of the partition (as the “fuzzification parameter” of the fuzzy cc-means algorithm). The default value of 1 corresponds to hard partitions obtained from a generalized kk-means problem; values greater than one give partitions of increasing softness obtained from a generalized fuzzy cc-means problem.

weights

a numeric vector of non-negative case weights. Recycled to the number of elements in the ensemble given by x if necessary.

control

a list of control parameters. See Details.

Details

Partitioning is performed using pclust via a family constructed from method. The dissimilarities dd and exponent ee are implied by the consensus method employed, and inferred via a registration mechanism currently only made available to built-in consensus methods. The default methods compute Least Squares Euclidean consensus clusterings, i.e., use Euclidean dissimilarity dd and e=2e = 2.

For m=1m = 1, the partitioning procedure was introduced by Gaul and Schader (1988) for “Clusterwise Aggregation of Relations” (with the same domains), containing equivalence relations, i.e., hard partitions, as a special case.

Available control parameters are as for pclust.

The fixed point approach employed is a heuristic which cannot be guaranteed to find the global minimum (as this is already true for the computation of consensus clusterings). Standard practice would recommend to use the best solution found in “sufficiently many” replications of the base algorithm.

Value

An object of class "cl_partition" representing the obtained “secondary” partition by an object of class "cl_pclust", which is a list containing at least the following components.

prototypes

a cluster ensemble with the kk prototypes.

membership

an object of class "cl_membership" with the membership values ubju_{bj}.

cluster

the class ids of the nearest hard partition.

silhouette

Silhouette information for the partition, see silhouette.

validity

precomputed validity measures for the partition.

m

the softness control argument.

call

the matched call.

d

the dissimilarity function d=d(x,p)d = d(x, p) employed.

e

the exponent ee employed.

References

J. C. Bezdek (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.

W. Gaul and M. Schader (1988). Clusterwise aggregation of relations. Applied Stochastic Models and Data Analysis, 4:273–282. doi:10.1002/asm.3150040406.

Examples

## Use a precomputed ensemble of 50 k-means partitions of the
## Cassini data.
data("CKME")
CKME <- CKME[1 : 30]		# for saving precious time ...
diss <- cl_dissimilarity(CKME)
hc <- hclust(diss)
plot(hc)
## This suggests using a partition with three classes, which can be
## obtained using cutree(hc, 3).  Could use cl_consensus() to compute
## prototypes as the least squares consensus clusterings of the classes,
## or alternatively:
set.seed(123)
x1 <- cl_pclust(CKME, 3, m = 1)
x2 <- cl_pclust(CKME, 3, m = 2)
## Agreement of solutions.
cl_dissimilarity(x1, x2)
table(cl_class_ids(x1), cl_class_ids(x2))

Predict Memberships

Description

Predict class ids or memberships from R objects representing partitions.

Usage

cl_predict(object, newdata = NULL,
           type = c("class_ids", "memberships"), ...)

Arguments

object

an R object representing a partition of objects.

newdata

an optional data set giving the objects to make predictions for. This must be of the same “kind” as the data set employed for obtaining the partition. If omitted, the original data are used.

type

a character string indicating whether class ids or memberships should be returned. May be abbreviated.

...

arguments to be passed to and from methods.

Details

Many algorithms resulting in partitions of a given set of objects can be taken to induce a partition of the underlying feature space for the measurements on the objects, so that class memberships for “new” objects can be obtained from the induced partition. Examples include partitions based on assigning objects to their “closest” prototypes, or providing mixture models for the distribution of objects in feature space.

This is a generic function. The methods provided in package clue handle the partitions obtained from clustering functions in the base R distribution, as well as packages RWeka, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself).

Value

Depending on type, an object of class "cl_class_ids" with the predicted class ids, or of class "cl_membership" with the matrix of predicted membership values.

Examples

## Run kmeans on a random subset of the Cassini data, and predict the
## memberships for the "test" data set.
data("Cassini")
nr <- NROW(Cassini$x)
ind <- sample(nr, 0.9 * nr, replace = FALSE)
party <- kmeans(Cassini$x[ind, ], 3)
table(cl_predict(party, Cassini$x[-ind, ]),
      Cassini$classes[-ind])

Partition Prototypes

Description

Determine prototypes for the classes of an R object representing a partition.

Usage

cl_prototypes(x)

Arguments

x

an R object representing a partition of objects.

Details

Many partitioning methods are based on prototypes (“centers”, “centroids”, “medoids”, ...). In typical cases, these are points in the feature space for the measurements on the objects to be partitioned, such that one can quantify the distance between the objects and the prototypes, and, e.g., classify objects to their closest prototype.

This is a generic function. The methods provided in package clue handle the partitions obtained from clustering functions in the base R distribution, as well as packages cba, cclust, cluster, e1071, flexclust, kernlab, and mclust (and of course, clue itself).

Examples

## Show how prototypes ("centers") vary across k-means runs on
## bootstrap samples from the Cassini data.
data("Cassini")
nr <- NROW(Cassini$x)
out <- replicate(50,
                 { kmeans(Cassini$x[sample(nr, replace = TRUE), ], 3) },
                 simplify = FALSE)
## Plot the data points in light gray, and the prototypes found.
plot(Cassini$x, col = gray(0.8))
points(do.call("rbind", lapply(out, cl_prototypes)), pch = 19)

Tabulate Vector Objects

Description

Tabulate the unique values in vector objects.

Usage

cl_tabulate(x)

Arguments

x

a vector.

Value

A data frame with components:

values

the unique values.

counts

an integer vector with the number of times each of the unique values occurs in x.

Examples

data("Kinship82")
tab <- cl_tabulate(Kinship82)
## The counts:
tab$counts
## The most frequent partition:
tab$values[[which.max(tab$counts)]]

Ultrametrics of Hierarchies

Description

Compute the ultrametric distances for objects representing (total indexed) hierarchies.

Usage

cl_ultrametric(x, size = NULL, labels = NULL)
as.cl_ultrametric(x)

Arguments

x

an R object representing a (total indexed) hierarchy of objects.

size

an integer giving the number of objects in the hierarchy.

labels

a character vector giving the names of the objects in the hierarchy.

Details

If x is not an ultrametric or a hierarchy with an ultrametric representation, cl_ultrametric uses cophenetic to obtain the ultrametric (also known as cophenetic) distances from the hierarchy, which in turn by default calls the S3 generic as.hclust on the hierarchy. Support for a class which represents hierarchies can thus be added by providing as.hclust methods for this class. In R 2.1.0 or better, cophenetic is an S3 generic as well, and one can also more directly provide methods for this if necessary.

as.cl_ultrametric is a generic function which can be used for coercing raw (non-classed) ultrametrics, represented as numeric vectors (of the lower-half entries) or numeric matrices, to ultrametric objects.

Ultrametric objects are implemented as symmetric proximity objects with a dissimilarity interpretation so that self-proximities are zero, and inherit from classes "cl_dissimilarity" and "cl_proximity". See section Details in the documentation for cl_dissimilarity for implications.

Ultrametric objects can also be coerced to classes "dendrogram" and "hclust", and hence in particular use the plot methods for these classes. By default, plotting an ultrametric object uses the plot method for dendrograms.

Value

An object of class "cl_ultrametric" containing the ultrametric distances.

See Also

is.cl_hierarchy

Examples

hc <- hclust(dist(USArrests))
u <- cl_ultrametric(hc)
## Subscripting.
u[1 : 5, 1 : 5]
u[1 : 5, 6 : 7]
## Plotting.
plot(u)

Validity Measures for Partitions and Hierarchies

Description

Compute validity measures for partitions and hierarchies, attempting to measure how well these clusterings capture the underlying structure in the data they were obtained from.

Usage

cl_validity(x, ...)
## Default S3 method:
cl_validity(x, d, ...)

Arguments

x

an object representing a partition or hierarchy.

d

a dissimilarity object from which x was obtained.

...

arguments to be passed to or from methods.

Details

cl_validity is a generic function.

For partitions, its default method gives the “dissimilarity accounted for”, defined as 1aw/at1 - a_w / a_t, where ata_t is the average total dissimilarity, and the “average within dissimilarity” awa_w is given by

i,jkmikmjkdiji,jkmikmjk\frac{\sum_{i,j} \sum_k m_{ik}m_{jk} d_{ij}}{ \sum_{i,j} \sum_k m_{ik}m_{jk}}

where dd and mm are the dissimilarities and memberships, respectively, and the sums are over all pairs of objects and all classes.

For hierarchies, the validity measures computed by default are “variance accounted for” (VAF, e.g., Hubert, Arabie & Meulman, 2006) and “deviance accounted for” (DEV, e.g., Smith, 2001). If u is the ultrametric corresponding to the hierarchy x and d the dissimilarity x was obtained from, these validity measures are given by

VAF=max(0,1i,j(dijuij)2i,j(dijmean(d))2)\mathrm{VAF} = \max\left(0, 1 - \frac{\sum_{i,j} (d_{ij} - u_{ij})^2}{ \sum_{i,j} (d_{ij} - \mathrm{mean}(d)) ^ 2}\right)

and

DEV=max(0,1i,jdijuiji,jdijmedian(d))\mathrm{DEV} = \max\left(0, 1 - \frac{\sum_{i,j} |d_{ij} - u_{ij}|}{ \sum_{i,j} |d_{ij} - \mathrm{median}(d)|}\right)

respectively. Note that VAF and DEV are not invariant under rescaling u, and may be “arbitrarily small” (i.e., 0 using the above definitions) even though u and d are “structurally close” in some sense.

For the results of using agnes and diana, the agglomerative and divisive coefficients are provided in addition to the default ones.

Value

A list of class "cl_validity" with the computed validity measures.

References

L. Hubert, P. Arabie and J. Meulman (2006). The structural representation of proximity matrices with MATLAB. Philadelphia, PA: SIAM.

T. J. Smith (2001). Constructing ultrametric and additive trees based on the L1L_1 norm. Journal of Classification, 18/2, 185–207. https://link.springer.com/article/10.1007/s00357-001-0015-0.

See Also

cluster.stats in package fpc for a variety of cluster validation statistics; fclustIndex in package e1071 for several fuzzy cluster indexes; clustIndex in package cclust; silhouette in package cluster.


Fit Dissimilarities to a Hierarchy

Description

Find the ultrametric from a target equivalence class of hierarchies which minimizes weighted Euclidean or Manhattan dissimilarity to a given dissimilarity object.

Usage

ls_fit_ultrametric_target(x, y, weights = 1)
l1_fit_ultrametric_target(x, y, weights = 1)

Arguments

x

a dissimilarity object inheriting from class "dist".

y

a target hierarchy.

weights

a numeric vector or matrix with non-negative weights for obtaining a weighted fit. If a matrix, its numbers of rows and columns must be the same as the number of objects in x. Otherwise, it is recycled to the number of elements in x.

Details

The target equivalence class consists of all dendrograms for which the corresponding nn-trees are the same as the one corresponding to y. I.e., all splits are the same as for y, and optimization is over the height of the splits.

The criterion function to be optimized over all ultrametrics from the equivalence class is wijxijuijp\sum w_{ij} |x_{ij} - u_{ij}|^p, where p=2p = 2 in the Euclidean and p=1p = 1 in the Manhattan case, respectively.

The optimum can be computed as follows. Suppose split ss joins object classes AA and BB. As the ultrametric dissimilarities of all objects in AA to all objects in BB must be the same value, say, uA,B=usu_{A,B} = u_s, the contribution from the split to the criterion function is of the form fs(us)=iA,jBwijxijuspf_s(u_s) = \sum_{i \in A, j \in B} w_{ij} |x_{ij} - u_s|^p. We need to minimize sfs(us)\sum_s f_s(u_s) under the constraint that the usu_s form a non-decreasing sequence, which is accomplished by using the Pool Adjacent Violator Algorithm (PAVA) using the weighted mean (p=2p = 2) or weighted median (p=1p = 1) for solving the blockwise optimization problems.

Value

An object of class "cl_ultrametric" containing the optimal ultrametric distances.

See Also

ls_fit_ultrametric for finding the ultrametric minimizing Euclidean dissimilarity (without fixing the splits).

Examples

data("Phonemes")
## Note that the Phonemes data set has the consonant misclassification
## probabilities, i.e., the similarities between the phonemes.
d <- as.dist(1 - Phonemes)
## Find the maximal dominated and miminal dominating ultrametrics by
## hclust() with single and complete linkage:
y1 <- hclust(d, "single")
y2 <- hclust(d, "complete")
## Note that these are quite different:
cl_dissimilarity(y1, y2, "gamma")
## Now find the L2 optimal members of the respective dendrogram
## equivalence classes.
u1 <- ls_fit_ultrametric_target(d, y1)
u2 <- ls_fit_ultrametric_target(d, y2)
## Compute the L2 optimal ultrametric approximation to d.
u <- ls_fit_ultrametric(d)
## And compare ...
cl_dissimilarity(cl_ensemble(Opt = u, Single = u1, Complete = u2), d)
## The solution obtained via complete linkage is quite close:
cl_agreement(u2, u, "cophenetic")

Gordon-Vichi Macroeconomic Partition Ensemble Data

Description

Soft partitions of 21 countries based on macroeconomic data for the years 1975, 1980, 1985, 1990, and 1995.

Usage

data("GVME")

Format

A named cluster ensemble of 5 soft partitions of 21 countries into 2 or 3 classes. The names are the years to which the partitions correspond.

Details

The partitions were obtained using fuzzy cc-means on measurements of the following variables: the annual per capita gross domestic product (GDP) in USD (converted to 1987 prices); the percentage of GDP provided by agriculture; the percentage of employees who worked in agriculture; and gross domestic investment, expressed as a percentage of the GDP. See Gordon and Vichi (2001), page 230, for more details.

Source

Table 1 in Gordon and Vichi (2001).

References

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248. doi:10.1007/BF02294837.


Gordon-Vichi Macroeconomic Consensus Partition Data

Description

The soft (“fuzzy”) consensus partitions for the macroeconomic partition data given in Gordon and Vichi (2001).

Usage

data("GVME_Consensus")

Format

A named cluster ensemble of eight soft partitions of 21 countries terms into two or three classes.

Details

The elements of the ensemble are consensus partitions for the macroeconomic partition data in Gordon and Vichi (2001), which are available as data set GVME. Element names are of the form "m/k", where m indicates the consensus method employed (one of ‘⁠MF1⁠’, ‘⁠MF2⁠’, ‘⁠JMF⁠’, and ‘⁠S&S⁠’, corresponding to the application of models 1, 2, and 3 in Gordon and Vichi (2001) and the approach in Sato and Sato (1994), respectively), and k denotes the number classes (2 or 3).

Source

Tables 4 and 5 in Gordon and Vichi (2001).

References

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248. doi:10.1007/BF02294837.

M. Sato and Y. Sato (1994). On a multicriteria fuzzy clustering method for 3-way data. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2, 127–142.
doi:10.1142/S0218488594000122.

Examples

## Load the consensus partitions.
data("GVME_Consensus")
## Pick the partitions into 2 classes.
GVME_Consensus_2 <- GVME_Consensus[1 : 4]
## Fuzziness using the Partition Coefficient.
cl_fuzziness(GVME_Consensus_2)
## (Corresponds to 1 - F in the source.)
## Dissimilarities:
cl_dissimilarity(GVME_Consensus_2)
cl_dissimilarity(GVME_Consensus_2, method = "comem")

Hierarchies

Description

Determine whether an R object represents a hierarchy of objects, or coerce to an R object representing such.

Usage

is.cl_hierarchy(x)
is.cl_dendrogram(x)

as.cl_hierarchy(x)
as.cl_dendrogram(x)

Arguments

x

an R object.

Details

These functions are generic functions.

The methods provided in package clue handle the partitions and hierarchies obtained from clustering functions in the base R distribution, as well as packages RWeka, ape, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself).

The hierarchies considered by clue are nn-trees (hierarchies in the strict sense) and dendrograms (also known as valued nn-trees or total indexed hierarchies), which are represented by the virtual classes "cl_hierarchy" and "cl_dendrogram" (which inherits from the former), respectively.

nn-trees on a set XX of objects correspond to collections HH of subsets of XX, usually called classes of the hierarchy, which satisfy the following properties:

  • HH contains all singletons with objects of XX, XX itself, but not the empty set;

  • The intersection of two sets AA and BB in HH is either empty or one of the sets.

The classes of a hierarchy can be obtained by cl_classes.

Dendrograms are nn-trees where additionally a height hh is associated with each of the classes, so that for two classes AA and BB with non-empty intersection we have h(A)h(B)h(A) \le h(B) iff AA is a subset of BB. For each pair of objects one can then define uiju_{ij} as the height of the smallest class containing both ii and jj: this results in a dissimilarity on XX which satisfies the ultrametric (3-point) conditions uijmax(uik,ujk)u_{ij} \le \max(u_{ik}, u_{jk}) for all triples (i,j,k)(i, j, k) of objects. Conversely, an ultrametric dissimilarity induces a unique dendrogram.

The ultrametric dissimilarities of a dendrogram can be obtained by cl_ultrametric.

as.cl_hierarchy returns an object of class "cl_hierarchy" “containing” the given object x if this already represents a hierarchy (i.e., is.cl_hierarchy(x) is true), or the ultrametric obtained from x via as.cl_ultrametric.

as.cl_dendrogram returns an object which has class "cl_dendrogram" and inherits from "cl_hierarchy", and contains x if it represents a dendrogram (i.e., is.cl_dendrogram(x) is true), or the ultrametric obtained from x.

Conceptually, hierarchies and dendrograms are virtual classes, allowing for a variety of representations.

There are group methods for comparing dendrograms and computing their minimum, maximum, and range based on the meet and join operations, see cl_meet. There is also a plot method.

Value

For the testing functions, a logical indicating whether the given object represents a clustering of objects of the respective kind.

For the coercion functions, a container object inheriting from "cl_hierarchy", with a suitable representation of the hierarchy given by x.

Examples

hcl <- hclust(dist(USArrests))
is.cl_dendrogram(hcl)
is.cl_hierarchy(hcl)

Rosenberg-Kim Kinship Terms Partition Data

Description

Partitions of 15 kinship terms given by 85 female undergraduates at Rutgers University who were asked to sort the terms into classes “on the basis of some aspect of meaning”.

Usage

data("Kinship82")

Format

A cluster ensemble of 85 hard partitions of the 15 kinship terms.

Details

Rosenberg and Kim (1975) describe an experiment where perceived similarities of the kinship terms were obtained from six different “sorting” experiments. These “original” Rosenberg-Kim kinship terms data were published in Arabie, Carroll and de Sarbo (1987), and are also contained in file ‘indclus.data’ in the shell archive https://netlib.org/mds/indclus.shar.

For one of the experiments, partitions of the terms were printed in Rosenberg (1982). Comparison with the original data indicates that the partition data have the “nephew” and “niece” columns interchanged, which is corrected in the data set at hand.

Source

Table 7.1 in Rosenberg (1982), with the “nephew” and “niece” columns interchanged.

References

P. Arabie, J. D. Carroll and W. S. de Sarbo (1987). Three-way scaling and clustering. Newbury Park, CA: Sage.

S. Rosenberg and M. P. Kim (1975). The method of sorting as a data-gathering procedure in multivariate research. Multivariate Behavioral Research, 10, 489–502.
doi:10.1207/s15327906mbr1004_7.

S. Rosenberg (1982). The method of sorting in multivariate research with applications selected from cognitive psychology and person perception. In N. Hirschberg and L. G. Humphreys (eds.), Multivariate Applications in the Social Sciences, 117–142. Hillsdale, NJ: Erlbaum.


Gordon-Vichi Kinship82 Consensus Partition Data

Description

The soft (“fuzzy”) consensus partitions for the Rosenberg-Kim kinship terms partition data given in Gordon and Vichi (2001).

Usage

data("Kinship82_Consensus")

Format

A named cluster ensemble of three soft partitions of the 15 kinship terms into three classes.

Details

The elements of the ensemble are named "MF1", "MF2", and "JMF", and correspond to the consensus partitions obtained by applying models 1, 2, and 3 in Gordon and Vichi (2001) to the kinship terms partition data in Rosenberg (1982), which are available as data set Kinship82.

Source

Table 6 in Gordon and Vichi (2001).

References

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248. doi:10.1007/BF02294837.

S. Rosenberg (1982). The method of sorting in multivariate research with applications selected from cognitive psychology and person perception. In N. Hirschberg and L. G. Humphreys (eds.), Multivariate Applications in the Social Sciences, 117–142. Hillsdale, NJ: Erlbaum.

Examples

## Load the consensus partitions.
data("Kinship82_Consensus")
## Fuzziness using the Partition Coefficient.
cl_fuzziness(Kinship82_Consensus)
## (Corresponds to 1 - F in the source.)
## Dissimilarities:
cl_dissimilarity(Kinship82_Consensus)
cl_dissimilarity(Kinship82_Consensus, method = "comem")

K-Medoids Clustering

Description

Compute a kk-medoids partition of a dissimilarity object.

Usage

kmedoids(x, k)

Arguments

x

a dissimilarity object inheriting from class "dist", or a square matrix of pairwise object-to-object dissimilarity values.

k

an integer giving the number of classes to be used in the partition.

Details

Let dd denote the pairwise object-to-object dissimilarity matrix corresponding to x. A kk-medoids partition of x is defined as a partition of the numbers from 1 to nn, the number of objects in x, into kk classes C1,,CkC_1, \ldots, C_k such that the criterion function L=lminjCliCldijL = \sum_l \min_{j \in C_l} \sum_{i \in C_l} d_{ij} is minimized.

This is an NP-hard optimization problem. PAM (Partitioning Around Medoids, see Kaufman & Rousseeuw (1990), Chapter 2) is a very popular heuristic for obtaining optimal kk-medoids partitions, and provided by pam in package cluster.

kmedoids is an exact algorithm based on a binary linear programming formulation of the optimization problem (e.g., Gordon & Vichi (1998), [P4']), using lp from package lpSolve as solver. Depending on available hardware resources (the number of constraints of the program is of the order n2n^2), it may not be possible to obtain a solution.

Value

An object of class "kmedoids" representing the obtained partition, which is a list with the following components.

cluster

the class ids of the partition.

medoid_ids

the indices of the medoids.

criterion

the value of the criterion function of the partition.

References

L. Kaufman and P. J. Rousseeuw (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

A. D. Gordon and M. Vichi (1998). Partitions of partitions. Journal of Classification, 15, 265–285. doi:10.1007/s003579900034.


Least Absolute Deviation Fit of Ultrametrics to Dissimilarities

Description

Find the ultrametric with minimal absolute distance (Manhattan dissimilarity) to a given dissimilarity object.

Usage

l1_fit_ultrametric(x, method = c("SUMT", "IRIP"), weights = 1,
                   control = list())

Arguments

x

a dissimilarity object inheriting from or coercible to class "dist".

method

a character string indicating the fitting method to be employed. Must be one of "SUMT" (default) or "IRIP", or a unique abbreviation thereof.

weights

a numeric vector or matrix with non-negative weights for obtaining a weighted least squares fit. If a matrix, its numbers of rows and columns must be the same as the number of objects in x, and the lower diagonal part is used. Otherwise, it is recycled to the number of elements in x.

control

a list of control parameters. See Details.

Details

The problem to be solved is minimizing

L(u)=i,jwijxijuijL(u) = \sum_{i,j} w_{ij} |x_{ij} - u_{ij}|

over all uu satisfying the ultrametric constraints (i.e., for all i,j,ki, j, k, uijmax(uik,ujk)u_{ij} \le \max(u_{ik}, u_{jk})). This problem is known to be NP hard (Krivanek and Moravek, 1986).

We provide two heuristics for solving this problem.

Method "SUMT" implements a SUMT (Sequential Unconstrained Minimization Technique, see sumt) approach using the sign function for the gradients of the absolute value function.

Available control parameters are method, control, eps, q, and verbose, which have the same roles as for sumt, and the following.

nruns

an integer giving the number of runs to be performed. Defaults to 1.

start

a single dissimilarity, or a list of dissimilarities to be employed as starting values.

Method "IRIP" implements a variant of the Iteratively Reweighted Iterative Projection approach of Smith (2001), which attempts to solve the L1L_1 problem via a sequence of weighted L2L_2 problems, determining u(t+1)u(t+1) by minimizing the criterion function

i,jwij(xijuij)2/max(xijuij(t),m)\sum_{i,j} w_{ij} (x_{ij} - u_{ij})^2 / \max(|x_{ij} - u_{ij}(t)|, m)

with mm a “small” non-zero value to avoid zero divisors. We use the SUMT method of ls_fit_ultrametric for solving the weighted least squares problems.

Available control parameters are as follows.

maxiter

an integer giving the maximal number of iteration steps to be performed. Defaults to 100.

eps

a nonnegative number controlling the iteration, which stops when the maximal change in uu is less than eps. Defaults to 10610^{-6}.

reltol

the relative convergence tolerance. Iteration stops when the relative change in the L1L_1 criterion is less than reltol. Defaults to 10610^{-6}.

MIN

the cutoff mm. Defaults to 10310^{-3}.

start

a dissimilarity object to be used as the starting value for uu.

control

a list of control parameters to be used by the method of ls_fit_ultrametric employed for solving the weighted L2L_2 problems.

One may need to adjust the default control parameters to achieve convergence.

It should be noted that all methods are heuristics which can not be guaranteed to find the global minimum.

Value

An object of class "cl_ultrametric" containing the fitted ultrametric distances.

References

M. Krivanek and J. Moravek (1986). NP-hard problems in hierarchical tree clustering. Acta Informatica, 23, 311–323. doi:10.1007/BF00289116.

T. J. Smith (2001). Constructing ultrametric and additive trees based on the L1L_1 norm. Journal of Classification, 18, 185–207. https://link.springer.com/article/10.1007/s00357-001-0015-0.

See Also

cl_consensus for computing least absolute deviation (Manhattan) consensus hierarchies; ls_fit_ultrametric.


Cluster Lattices

Description

Computations on the lattice of all (hard) partitions, or the lattice of all dendrograms, or the meet semilattice of all hierarchies (nn-trees) of/on a set of objects: meet, join, and comparisons.

Usage

cl_meet(x, y)
cl_join(x, y)

Arguments

x

an ensemble of partitions or dendrograms or hierarchies, or an R object representing a partition or dendrogram or hierarchy.

y

an R object representing a partition or dendrogram or hierarchy. Ignored if x is an ensemble.

Details

For a given finite set of objects XX, the set H(X)H(X) of all (hard) partitions of XX can be partially ordered by defining a partition PP to be “finer” than a partition QQ, i.e., PQP \le Q, if each class of PP is contained in some class of QQ. With this partial order, H(X)H(X) becomes a bounded lattice, with intersection and union of two elements given by their greatest lower bound (meet) and their least upper bound (join), respectively.

Specifically, the meet of two partitions computed by cl_meet is the partition obtained by intersecting the classes of the partitions; the classes of the join computed by cl_join are obtained by joining all elements in the same class in at least one of the partitions. Obviously, the least and greatest elements of the partition lattice are the partitions where each object is in a single class (sometimes referred to as the “splitter” partition) or in the same class (the “lumper” partition), respectively. Meet and join of an arbitrary number of partitions can be defined recursively.

In addition to computing the meet and join, the comparison operations corresponding to the above partial order as well as min, max, and range are available at least for R objects representing partitions inheriting from "cl_partition". The summary methods give the meet and join of the given partitions (for min and max), or a partition ensemble with the meet and join (for range).

If the partitions specified by x and y are soft partitions, the corresponding nearest hard partitions are used. Future versions may optionally provide suitable “soft” (fuzzy) extensions for computing meets and joins.

The set of all dendrograms on XX can be ordered using pointwise inequality of the associated ultrametric dissimilarities: i.e., if DD and EE are the dendrograms with ultrametrics uu and vv, respectively, then DED \le E if uijviju_{ij} \le v_{ij} for all pairs (i,j)(i, j) of objects. This again yields a lattice (of dendrograms). The join of DD and EE is the dendrogram with ultrametrics given by max(uij,vij)\max(u_{ij}, v_{ij}) (as this gives an ultrametric); the meet is the dendrogram with the maximal ultrametric dominated by min(uij,vij)\min(u_{ij}, v_{ij}), and can be obtained by applying single linkage hierarchical clustering to the minima.

The set of all hierarchies on XX can be ordered by set-wise inclusion of the classes: i.e., if HH and GG are two hierarchies, then HGH \le G if all classes of HH are also classes of GG. This yields a meet semilattice, with meet given by the classes contained in both hierarchies. The join only exists if the union of the classes is a hierarchy.

In each case, a modular semilattice is obtained, which allows for a natural metrization via least element (semi)lattice move distances, see Barthélémy, Leclerc and Monjardet (1981). These latticial metrics are given by the BA/C (partitions), Manhattan (dendrograms), and symdiff (hierarchies) dissimilarities, respectively (see cl_dissimilarity).

Value

For cl_meet and cl_join, an object of class "cl_partition" or "cl_dendrogram" with the class ids or ultrametric dissimilarities of the meet and join of the partitions or dendrograms, respectively.

References

J.-P. Barthélémy, B. Leclerc and B. Monjardet (1981). On the use of ordered sets in problems of comparison and consensus of classification. Journal of Classification, 3, 187–224. doi:10.1007/BF01894188.

Examples

## Two simple partitions of 7 objects.
A <- as.cl_partition(c(1, 1, 2, 3, 3, 5, 5))
B <- as.cl_partition(c(1, 2, 2, 3, 4, 5, 5))
## These disagree on objects 1-3, A splits objects 4 and 5 into
## separate classes.  Objects 6 and 7 are always in the same class.
(A <= B) || (B <= A)
## (Neither partition is finer than the other.)
cl_meet(A, B)
cl_join(A, B)
## Meeting with the lumper (greatest) or joining with the splitter
## (least) partition does not make a difference: 
C_lumper <- as.cl_partition(rep(1, n_of_objects(A)))
cl_meet(cl_ensemble(A, B, C_lumper))
C_splitter <- as.cl_partition(seq_len(n_of_objects(A)))
cl_join(cl_ensemble(A, B, C_splitter))
## Another way of computing the join:
range(A, B, C_splitter)$max

Least Squares Fit of Additive Tree Distances to Dissimilarities

Description

Find the additive tree distance or centroid distance minimizing least squares distance (Euclidean dissimilarity) to a given dissimilarity object.

Usage

ls_fit_addtree(x, method = c("SUMT", "IP", "IR"), weights = 1,
               control = list())
ls_fit_centroid(x)

Arguments

x

a dissimilarity object inheriting from class "dist".

method

a character string indicating the fitting method to be employed. Must be one of "SUMT" (default), "IP", or "IR", or a unique abbreviation thereof.

weights

a numeric vector or matrix with non-negative weights for obtaining a weighted least squares fit. If a matrix, its numbers of rows and columns must be the same as the number of objects in x, and the lower diagonal part is used. Otherwise, it is recycled to the number of elements in x.

control

a list of control parameters. See Details.

Details

See as.cl_addtree for details on additive tree distances and centroid distances.

With L(d)=wij(xijdij)2L(d) = \sum w_{ij} (x_{ij} - d_{ij})^2, the problem to be solved by ls_fit_addtree is minimizing LL over all additive tree distances dd. This problem is known to be NP hard.

We provide three heuristics for solving this problem.

Method "SUMT" implements the SUMT (Sequential Unconstrained Minimization Technique, Fiacco and McCormick, 1968) approach of de Soete (1983). Incomplete dissimilarities are currently not supported.

Methods "IP" and "IR" implement the Iterative Projection and Iterative Reduction approaches of Hubert and Arabie (1995) and Roux (1988), respectively. Non-identical weights and incomplete dissimilarities are currently not supported.

See ls_fit_ultrametric for details on these methods and available control parameters.

It should be noted that all methods are heuristics which can not be guaranteed to find the global minimum. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the base algorithm.

ls_fit_centroid finds the centroid distance dd minimizing L(d)L(d) (currently, only for the case of identical weights). This optimization problem has a closed-form solution.

Value

An object of class "cl_addtree" containing the optimal additive tree distances.

References

A. V. Fiacco and G. P. McCormick (1968). Nonlinear programming: Sequential unconstrained minimization techniques. New York: John Wiley & Sons.

L. Hubert and P. Arabie (1995). Iterative projection strategies for the least squares fitting of tree structures to proximity data. British Journal of Mathematical and Statistical Psychology, 48, 281–317. doi:10.1111/j.2044-8317.1995.tb01065.x.

M. Roux (1988). Techniques of approximation for building two tree structures. In C. Hayashi and E. Diday and M. Jambu and N. Ohsumi (Eds.), Recent Developments in Clustering and Data Analysis, pages 151–170. New York: Academic Press.

G. de Soete (1983). A least squares algorithm for fitting additive trees to proximity data. Psychometrika, 48, 621–626. doi:10.1007/BF02293884.


Least Squares Fit of Sums of Ultrametrics to Dissimilarities

Description

Find a sequence of ultrametrics with sum minimizing square distance (Euclidean dissimilarity) to a given dissimilarity object.

Usage

ls_fit_sum_of_ultrametrics(x, nterms = 1, weights = 1,
                           control = list())

Arguments

x

a dissimilarity object inheriting from or coercible to class "dist".

nterms

an integer giving the number of ultrametrics to be fitted.

weights

a numeric vector or matrix with non-negative weights for obtaining a weighted least squares fit. If a matrix, its numbers of rows and columns must be the same as the number of objects in x, and the lower diagonal part is used. Otherwise, it is recycled to the number of elements in x.

control

a list of control parameters. See Details.

Details

The problem to be solved is minimizing the criterion function

L(u(1),,u(n))=i,jwij(xijk=1nuij(k))2L(u(1), \dots, u(n)) = \sum_{i,j} w_{ij} \left(x_{ij} - \sum_{k=1}^n u_{ij}(k)\right)^2

over all u(1),,u(n)u(1), \ldots, u(n) satisfying the ultrametric constraints.

We provide an implementation of the iterative heuristic suggested in Carroll & Pruzansky (1980) which in each step tt sequentially refits the u(k)u(k) as the least squares ultrametric fit to the “residuals” xlku(l)x - \sum_{l \ne k} u(l) using ls_fit_ultrametric.

Available control parameters include

maxiter

an integer giving the maximal number of iteration steps to be performed. Defaults to 100.

eps

a nonnegative number controlling the iteration, which stops when the maximal change in all u(k)u(k) is less than eps. Defaults to 10610^{-6}.

reltol

the relative convergence tolerance. Iteration stops when the relative change in the criterion function is less than reltol. Defaults to 10610^{-6}.

method

a character string indicating the fitting method to be employed by the individual least squares fits.

control

a list of control parameters to be used by the method of ls_fit_ultrametric employed. By default, if the SUMT method method is used, 10 inner SUMT runs are performed for each refitting.

It should be noted that the method used is a heuristic which can not be guaranteed to find the global minimum.

Value

A list of objects of class "cl_ultrametric" containing the fitted ultrametric distances.

References

J. D. Carroll and S. Pruzansky (1980). Discrete and hybrid scaling models. In E. D. Lantermann and H. Feger (eds.), Similarity and Choice. Bern (Switzerland): Huber.


Least Squares Fit of Ultrametrics to Dissimilarities

Description

Find the ultrametric with minimal square distance (Euclidean dissimilarity) to given dissimilarity objects.

Usage

ls_fit_ultrametric(x, method = c("SUMT", "IP", "IR"), weights = 1,
                   control = list())

Arguments

x

a dissimilarity object inheriting from or coercible to class "dist", or an ensemble of such objects.

method

a character string indicating the fitting method to be employed. Must be one of "SUMT" (default), "IP", or "IR", or a unique abbreviation thereof.

weights

a numeric vector or matrix with non-negative weights for obtaining a weighted least squares fit. If a matrix, its numbers of rows and columns must be the same as the number of objects in x, and the lower diagonal part is used. Otherwise, it is recycled to the number of elements in x.

control

a list of control parameters. See Details.

Details

For a single dissimilarity object x, the problem to be solved is minimizing

L(u)=i,jwij(xijuij)2L(u) = \sum_{i,j} w_{ij} (x_{ij} - u_{ij})^2

over all uu satisfying the ultrametric constraints (i.e., for all i,j,ki, j, k, uijmax(uik,ujk)u_{ij} \le \max(u_{ik}, u_{jk})). This problem is known to be NP hard (Krivanek and Moravek, 1986).

For an ensemble of dissimilarity objects, the criterion function is

L(u)=bwbi,jwij(xij(b)uij)2,L(u) = \sum_b w_b \sum_{i,j} w_{ij} (x_{ij}(b) - u_{ij})^2,

where wbw_b is the weight given to element xbx_b of the ensemble and can be specified via control parameter weights (default: all ones). This problem reduces to the above basic problem with xx as the wbw_b-weighted mean of the xbx_b.

We provide three heuristics for solving the basic problem.

Method "SUMT" implements the SUMT (Sequential Unconstrained Minimization Technique, Fiacco and McCormick, 1968) approach of de Soete (1986) which in turn simplifies the suggestions in Carroll and Pruzansky (1980). (See sumt for more information on the SUMT approach.) We then use a final single linkage hierarchical clustering step to ensure that the returned object exactly satisfies the ultrametric constraints. The starting value u0u_0 is obtained by “random shaking” of the given dissimilarity object (if not given). If there are missing values in x, i.e., the given dissimilarities are incomplete, we follow a suggestion of de Soete (1984), imputing the missing values by the weighted mean of the non-missing ones, and setting the corresponding weights to zero.

Available control parameters are method, control, eps, q, and verbose, which have the same roles as for sumt, and the following.

nruns

an integer giving the number of runs to be performed. Defaults to 1.

start

a single dissimilarity, or a list of dissimilarities to be employed as starting values.

The default optimization using conjugate gradients should work reasonably well for medium to large size problems. For “small” ones, using nlm is usually faster. Note that the number of ultrametric constraints is of the order n3n^3, where nn is the number of objects in the dissimilarity object, suggesting to use the SUMT approach in favor of constrOptim.

If starting values for the SUMT are provided via start, the number of starting values gives the number of runs to be performed, and control option nruns is ignored. Otherwise, nruns starting values are obtained by random shaking of the dissimilarity to be fitted. In the case of multiple SUMT runs, the (first) best solution found is returned.

Method "IP" implements the Iterative Projection approach of Hubert and Arabie (1995). This iteratively projects the current dissimilarities to the closed convex set given by the ultrametric constraints (3-point conditions) for a single index triple (i,j,k)(i, j, k), in fact replacing the two largest values among dij,dik,djkd_{ij}, d_{ik}, d_{jk} by their mean. The following control parameters can be provided via the control argument.

nruns

an integer giving the number of runs to be performed. Defaults to 1.

order

a permutation of the numbers from 1 to the number of objects in x, specifying the order in which the ultrametric constraints are considered, or a list of such permutations.

maxiter

an integer giving the maximal number of iterations to be employed.

tol

a double indicating the maximal convergence tolerance. The algorithm stops if the total absolute change in the dissimilarities in an iteration is less than tol.

verbose

a logical indicating whether to provide some output on minimization progress. Defaults to getOption("verbose").

If permutations are provided via order, the number of these gives the number of runs to be performed, and control option nruns is ignored. Otherwise, nruns randomly generated orders are tried. In the case of multiple runs, the (first) best solution found is returned.

Non-identical weights and incomplete dissimilarities are currently not supported.

Method "IR" implements the Iterative Reduction approach suggested by Roux (1988), see also Barthélémy and Guénoche (1991). This is similar to the Iterative Projection method, but modifies the dissimilarities between objects proportionally to the aggregated change incurred from the ultrametric projections. Available control parameters are identical to those of method "IP".

Non-identical weights and incomplete dissimilarities are currently not supported.

It should be noted that all methods are heuristics which can not be guaranteed to find the global minimum. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the base algorithm.

Value

An object of class "cl_ultrametric" containing the fitted ultrametric distances.

References

J.-P. Barthélémy and A. Guénoche (1991). Trees and proximity representations. Chichester: John Wiley & Sons. ISBN 0-471-92263-3.

J. D. Carroll and S. Pruzansky (1980). Discrete and hybrid scaling models. In E. D. Lantermann and H. Feger (eds.), Similarity and Choice. Bern (Switzerland): Huber.

L. Hubert and P. Arabie (1995). Iterative projection strategies for the least squares fitting of tree structures to proximity data. British Journal of Mathematical and Statistical Psychology, 48, 281–317. doi:10.1111/j.2044-8317.1995.tb01065.x.

M. Krivanek and J. Moravek (1986). NP-hard problems in hierarchical tree clustering. Acta Informatica, 23, 311–323. doi:10.1007/BF00289116.

M. Roux (1988). Techniques of approximation for building two tree structures. In C. Hayashi and E. Diday and M. Jambu and N. Ohsumi (Eds.), Recent Developments in Clustering and Data Analysis, pages 151–170. New York: Academic Press.

G. de Soete (1984). Ultrametric tree representations of incomplete dissimilarity data. Journal of Classification, 1, 235–242. doi:10.1007/BF01890124.

G. de Soete (1986). A least squares algorithm for fitting an ultrametric tree to a dissimilarity matrix. Pattern Recognition Letters, 2, 133–137. doi:10.1016/0167-8655(84)90036-9.

See Also

cl_consensus for computing least squares (Euclidean) consensus hierarchies by least squares fitting of average ultrametric distances; l1_fit_ultrametric.

Examples

## Least squares fit of an ultrametric to the Miller-Nicely consonant
## phoneme confusion data.
data("Phonemes")
## Note that the Phonemes data set has the consonant misclassification
## probabilities, i.e., the similarities between the phonemes.
d <- as.dist(1 - Phonemes)
u <- ls_fit_ultrametric(d, control = list(verbose = TRUE))
## Cophenetic correlation:
cor(d, u)
## Plot:
plot(u)
## ("Basically" the same as Figure 1 in de Soete (1986).)

Classes in a Partition

Description

Determine the number of classes and the class ids in a partition of objects.

Usage

n_of_classes(x)
cl_class_ids(x)
as.cl_class_ids(x)

Arguments

x

an object representing a (hard or soft) partition (for n_of_classes and cl_class_ids), or raw class ids (for as.cl_class_ids).

Details

These function are generic functions.

The methods provided in package clue handle the partitions obtained from clustering functions in the base R distribution, as well as packages RWeka, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself).

Note that the number of classes is taken as the number of distinct class ids actually used in the partition, and may differ from the number of columns in a membership matrix representing the partition.

as.cl_class_ids can be used for coercing “raw” class ids (given as atomic vectors) to class id objects.

Value

For n_of_classes, an integer giving the number of classes in the partition.

For cl_class_ids, a vector of integers with the corresponding class ids. For soft partitions, the class ids returned are those of the nearest hard partition obtained by taking the class ids of the (first) maximal membership values.

See Also

is.cl_partition

Examples

data("Cassini")
party <- kmeans(Cassini$x, 3)
n_of_classes(party)
## A simple confusion matrix:
table(cl_class_ids(party), Cassini$classes)
## For an "oversize" membership matrix representation:
n_of_classes(cl_membership(party, 6))

Number of Objects in a Partition or Hierarchy

Description

Determine the number of objects from which a partition or hierarchy was obtained.

Usage

n_of_objects(x)

Arguments

x

an R object representing a (hard of soft) partition or a hierarchy of objects, or dissimilarities between objects.

Details

This is a generic function.

The methods provided in package clue handle the partitions and hierarchies obtained from clustering functions in the base R distribution, as well as packages RWeka, ape, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself).

There is also a method for object dissimilarities which inherit from class "dist".

Value

An integer giving the number of objects.

See Also

is.cl_partition, is.cl_hierarchy

Examples

data("Cassini")
pcl <- kmeans(Cassini$x, 3)
n_of_objects(pcl)
hcl <- hclust(dist(USArrests))
n_of_objects(hcl)

Partitions

Description

Determine whether an R object represents a partition of objects, or coerce to an R object representing such.

Usage

is.cl_partition(x)
is.cl_hard_partition(x)
is.cl_soft_partition(x)

as.cl_partition(x)
as.cl_hard_partition(x)

Arguments

x

an R object.

Details

is.cl_partition and is.cl_hard_partition are generic functions.

The methods provided in package clue handle the partitions obtained from clustering functions in the base R distribution, as well as packages RWeka, cba, cclust, cluster, e1071, flexclust, flexmix, kernlab, mclust, movMF and skmeans (and of course, clue itself).

is.cl_soft_partition gives true iff is.cl_partition is true and is.cl_hard_partition is false.

as.cl_partition returns an object of class "cl_partition" “containing” the given object x if this already represents a partition (i.e., is.cl_partition(x) is true), or the memberships obtained from x via as.cl_membership.

as.cl_hard_partition(x) returns an object which has class "cl_hard_partition" and inherits from "cl_partition", and contains x if it already represents a hard partition (i.e., provided that is.cl_hard_partition(x) is true), or the class ids obtained from x, using x if this is an atomic vector of raw class ids, or, if x represents a soft partition or is a raw matrix of membership values, using the class ids of the nearest hard partition, defined by taking the class ids of the (first) maximal membership values.

Conceptually, partitions and hard partitions are virtual classes, allowing for a variety of representations.

There are group methods for comparing partitions and computing their minimum, maximum, and range based on the meet and join operations, see cl_meet.

Value

For the testing functions, a logical indicating whether the given object represents a clustering of objects of the respective kind.

For the coercion functions, a container object inheriting from "cl_partition", with a suitable representation of the partition given by x.

Examples

data("Cassini")
pcl <- kmeans(Cassini$x, 3)
is.cl_partition(pcl)
is.cl_hard_partition(pcl)
is.cl_soft_partition(pcl)

Prototype-Based Partitioning

Description

Obtain prototype-based partitions of objects by minimizing the criterion wbubjmd(xb,pj)e\sum w_b u_{bj}^m d(x_b, p_j)^e, the sum of the case-weighted and membership-weighted ee-th powers of the dissimilarities between the objects xbx_b and the prototypes pjp_j, for suitable dissimilarities dd and exponents ee.

Usage

pclust(x, k, family, m = 1, weights = 1, control = list())
pclust_family(D, C, init = NULL, description = NULL, e = 1,
              .modify = NULL, .subset = NULL)
pclust_object(prototypes, membership, cluster, family, m = 1,
              value, ..., classes = NULL, attributes = NULL)

Arguments

x

the object to be partitioned.

k

an integer giving the number of classes to be used in the partition.

family

an object of class "pclust_family" as generated by pclust_family, containing the information about dd and ee.

m

a number not less than 1 controlling the softness of the partition (as the “fuzzification parameter” of the fuzzy cc-means algorithm). The default value of 1 corresponds to hard partitions obtained from a generalized kk-means problem; values greater than one give partitions of increasing softness obtained from a generalized fuzzy cc-means problem.

weights

a numeric vector of non-negative case weights. Recycled to the number of elements given by x if necessary.

control

a list of control parameters. See Details.

D

a function for computing dissimilarities dd between objects and prototypes.

C

a ‘consensus’ function with formals x, weights and control for computing a consensus prototype pp minimizing bwbd(xb,p)e\sum_b w_b d(x_b, p) ^ e.

init

a function with formals x and k initializing an object with kk prototypes from the object x to be partitioned.

description

a character string describing the family.

e

a number giving the exponent ee of the criterion.

.modify

a function with formals x, i and value for modifying a single prototype, or NULL (default).

.subset

a function with formals x and i for subsetting prototypes, or NULL (default).

prototypes

an object representing the prototypes of the partition.

membership

an object of class "cl_membership" with the membership values ubju_{bj}.

cluster

the class ids of the nearest hard partition.

value

the value of the criterion to be minimized.

...

further elements to be included in the generated pclust object.

classes

a character vector giving further classes to be given to the generated pclust object in addition to "pclust", or NULL (default).

attributes

a list of attributes, or NULL (default).

Details

For m=1m = 1, a generalization of the Lloyd-Forgy variant of the kk-means algorithm is used, which iterates between reclassifying objects to their closest prototypes (according to the dissimilarities given by D), and computing new prototypes as the consensus for the classes (using C).

For m>1m > 1, a generalization of the fuzzy cc-means recipe (e.g., Bezdek (1981)) is used, which alternates between computing optimal memberships for fixed prototypes, and computing new prototypes as the suitably weighted consensus clusterings for the classes.

This procedure is repeated until convergence occurs, or the maximal number of iterations is reached.

Currently, no local improvement heuristics are provided.

It is possible to perform several runs of the procedure via control arguments nruns or start (the default is to perform a single run), in which case the first partition with the smallest value of the criterion is returned.

The dissimilarity and consensus functions as well as the exponent ee are specified via family. In principle, arbitrary representations of objects to be partitioned and prototypes (which do not necessarily have to be “of the same kind”) can be employed. In addition to D and C, what is needed are means to obtain an initial collection of kk prototypes (init), to modify a single prototype (.modify), and subset the prototypes (.subset). By default, list and (currently, only dense) matrix (with the usual convention that the rows correspond to the objects) are supported. Otherwise, the family has to provide the functions needed.

Available control parameters are as follows.

maxiter

an integer giving the maximal number of iterations to be performed. Defaults to 100.

nruns

an integer giving the number of runs to be performed. Defaults to 1.

reltol

the relative convergence tolerance. Defaults to sqrt(.Machine$double.eps).

start

a list of prototype objects to be used as starting values.

verbose

a logical indicating whether to provide some output on minimization progress. Defaults to getOption("verbose").

control

control parameters to be used in the consensus function.

The fixed point approach employed is a heuristic which cannot be guaranteed to find the global minimum, in particular if C is not an exact minimizer. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the base algorithm.

Value

pclust returns the partition found as an object of class "pclust" (as obtained by calling pclust_object) which in addition to the default components contains call (the matched call) and a converged attribute indicating convergence status (i.e., whether the maximal number of iterations was reached).

pclust_family returns an object of class "pclust_family", which is a list with components corresponding to the formals of pclust_family.

pclust_object returns an object inheriting from class "pclust", which is a list with components corresponding to the formals (up to and including ...) of pclust_object, and additional classes and attributes specified by classes and attributes, respectively.

References

J. C. Bezdek (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.

See Also

kmeans, cmeans.


Miller-Nicely Consonant Phoneme Confusion Data

Description

Miller-Nicely data on the auditory confusion of 16 consonant phonemes.

Usage

data("Phonemes")

Format

A symmetric matrix of the misclassification probabilities of 16 English consonant phonemes.

Details

Miller and Nicely (1955) obtained the confusions by exposing female subjects to a series of syllables consisting of one of the 16 consonants followed by the vowel ‘⁠a⁠’ under 17 different experimental conditions. The data provided are obtained from aggregating the six so-called flat-noise conditions in which only the speech-to-noise ratio was varied into a single matrix of misclassification frequencies.

Source

The data set is also contained in file ‘mapclus.data’ in the shell archive https://netlib.org/mds/mapclus.shar.

References

G. A. Miller and P. E. Nicely (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338–352. doi:10.1121/1.1907526.


Solve Linear Sum Assignment Problem

Description

Solve the linear sum assignment problem using the Hungarian method.

Usage

solve_LSAP(x, maximum = FALSE)

Arguments

x

a matrix with nonnegative entries and at least as many columns as rows.

maximum

a logical indicating whether to minimize of maximize the sum of assigned costs.

Details

If nrnr and ncnc are the numbers of rows and columns of x, solve_LSAP finds an optimal assignment of rows to columns, i.e., a one-to-one map p of the numbers from 1 to nrnr to the numbers from 1 to ncnc (a permutation of these numbers in case x is a square matrix) such that i=1nrx[i,p[i]]\sum_{i=1}^{nr} x[i, p[i]] is minimized or maximized.

This assignment can be found using a linear program (and package lpSolve provides a function lp.assign for doing so), but typically more efficiently and provably in polynomial time O(n3)O(n^3) using primal-dual methods such as the so-called Hungarian method (see the references).

Value

An object of class "solve_LSAP" with the optimal assignment of rows to columns.

Author(s)

Walter Böhm [email protected] kindly provided C code implementing the Hungarian method.

References

C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.

Examples

x <- matrix(c(5, 1, 4, 3, 5, 2, 2, 4, 4), nrow = 3)
solve_LSAP(x)
solve_LSAP(x, maximum = TRUE)
## To get the optimal value (for now):
y <- solve_LSAP(x)
sum(x[cbind(seq_along(y), y)])

Sequential Unconstrained Minimization Technique

Description

Solve constrained optimization problems via the Sequential Unconstrained Minimization Technique (SUMT).

Usage

sumt(x0, L, P, grad_L = NULL, grad_P = NULL, method = NULL,
     eps = NULL, q = NULL, verbose = NULL, control = list())

Arguments

x0

a list of starting values, or a single starting value.

L

a function to minimize.

P

a non-negative penalty function such that P(x)P(x) is zero iff the constraints are satisfied.

grad_L

a function giving the gradient of L, or NULL (default).

grad_P

a function giving the gradient of P, or NULL (default).

method

a character string, or NULL. If not given, "CG" is used. If equal to "nlm", minimization is carried out using nlm. Otherwise, optim is used with method as the given method.

eps

the absolute convergence tolerance. The algorithm stops if the (maximum) distance between successive x values is less than eps.

Defaults to sqrt(.Machine$double.eps).

q

a double greater than one controlling the growth of the ρk\rho_k as described in Details.

Defaults to 10.

verbose

a logical indicating whether to provide some output on minimization progress.

Defaults to getOption("verbose").

control

a list of control parameters to be passed to the minimization routine in case optim is used.

Details

The Sequential Unconstrained Minimization Technique is a heuristic for constrained optimization. To minimize a function LL subject to constraints, one employs a non-negative function PP penalizing violations of the constraints, such that P(x)P(x) is zero iff xx satisfies the constraints. One iteratively minimizes L(x)+ρkP(x)L(x) + \rho_k P(x), where the ρ\rho values are increased according to the rule ρk+1=qρk\rho_{k+1} = q \rho_k for some constant q>1q > 1, until convergence is obtained in the sense that the Euclidean distance between successive solutions xkx_k and xk+1x_{k+1} is small enough. Note that the “solution” xx obtained does not necessarily satisfy the constraints, i.e., has zero P(x)P(x). Note also that there is no guarantee that global (approximately) constrained optima are found. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the algorithm.

The unconstrained minimizations are carried out by either optim or nlm, using analytic gradients if both grad_L and grad_P are given, and numeric ones otherwise.

If more than one starting value is given, the solution with the minimal augmented criterion function value is returned.

Value

A list inheriting from class "sumt", with components x, L, P, and rho giving the solution obtained, the value of the criterion and penalty function at x, and the final ρ\rho value used in the augmented criterion function.

References

A. V. Fiacco and G. P. McCormick (1968). Nonlinear programming: Sequential unconstrained minimization techniques. New York: John Wiley & Sons.