Title: | Silly Putty Clustering |
---|---|
Description: | Implements a simple, novel clustering algorithm based on optimizing the silhouette width. See <doi:10.1101/2023.11.07.566055> for details. |
Authors: | Kevin R. Coombes, Dwayne Tally |
Maintainer: | Kevin R. Coombes <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 0.4.1 |
Built: | 2024-11-05 06:33:50 UTC |
Source: | CRAN |
The Euclidean distance matrix between 300150 objects, used to illustrate the SillyPutty algorithms.
data(eucdist)
data(eucdist)
The binary R data file ocntains two objects, First, a dist
object representing Euclidean distances between 150 samples. Second, a
vector of the known (simualted) true groups to which each sample
belongs.
This data set was generated in the SillyPutty
vignette from
tools in the Umpire
R package. The simulated data was intended
to have five different clusters, all of approximately the same size.
Noise ws added to make the clusters somehwat harder for most
algorithms to distinguish.TRhe same data set is used in most of the
examples in the man pages.
This data set was generated in the SillyPutty
vignette from the
tools in the Umpire
R package, and saved using code that is now
hidden and disabled in the vignette source.
data(eucdist) class(eucdist) attr(eucdist, "Size")
data(eucdist) class(eucdist) attr(eucdist, "Size")
A function that is designed to find an approximation of the true
number. K, of clusters in a dataset. the findClusterNumber
function calls RandomSillyPutty
for each value of K in the
range from start
to end
, performing N
random
starts each time.
NOTE: start must be > 1, and the function can be slow depending on how complex the dataset is and the number of N iterations.
findClusterNumber(distobj, start,end, N = 100, method = c("SillyPutty", "HCSP"), ...)
findClusterNumber(distobj, start,end, N = 100, method = c("SillyPutty", "HCSP"), ...)
distobj |
An object of class |
start |
The minimum cluster number for the range of clusters |
end |
The maximum cluster number for the range of clusters |
N |
Number of iterations |
method |
whether to use the full |
... |
Extra arguments to the |
The findClusterNumber
function processes one distance matrix at
a time, through N iterations. It returns a list. The list
is a
list of the maximum silhoutte width values obtained from N iterations
with their associated cluster number.
A list containing the maximum silhouette width values per K clusters for each K in the range of possible cluster numbers.
Kevin R. Coombes [email protected], Dwayne G. Tally [email protected]
Pending.
data(eucdist) set.seed(12) y <- findClusterNumber(eucdist, start = 3, end = 7, method = "HCSP") plot(names(y), y, xlab = "K", ylab = "Mean Silhouette Width", type = "b", lwd = 2, pch = 16)
data(eucdist) set.seed(12) y <- findClusterNumber(eucdist, start = 3, end = 7, method = "HCSP") plot(names(y), y, xlab = "K", ylab = "Mean Silhouette Width", type = "b", lwd = 2, pch = 16)
Our simulations revealed that the fastest and most accuirate
clustering algorithm for modest-sized contiuous data sets is the
combination of hierarchical clustering (with Ward's linkage rule)
followed by SillyPutty. The function HCSP
implements this
combination.
HCSP(dis, K, method = "ward.D2", ...)
HCSP(dis, K, method = "ward.D2", ...)
dis |
An object of class |
K |
The desired number of clusters. |
method |
Sane as the corresponding argument for |
... |
Extra arguments to the |
The HCSP
function that first runs hierarchical clustering, then
applies the SillyPutty
algorithm.
A list containing two items: hc
, the results of hierarchical
clustering, and sp
, a SillyPutty
object by applying the
algorithm to the result of cutting the dendrogram to produce K
clusters.
Kevin R. Coombes [email protected]
Polina Bombina, Dwayne Tally, Zachary B. Abrams, Kevin R. Coombes. SillyPutty: Improved clustering by optimizing the silhouette width, bioRxiv 2023.11.07.566055; doi: https://doi.org/10.1101/2023.11.07.566055
data(eucdist) set.seed(1234) twostep <- HCSP(eucdist, K=5) sw <- cluster::silhouette(twostep$sp@cluster, eucdist) plot(sw)
data(eucdist) set.seed(1234) twostep <- HCSP(eucdist, K=5) sw <- cluster::silhouette(twostep$sp@cluster, eucdist) plot(sw)
A function to perform cluster assignments on a distance matrix based on optimizing silhouette width. The cluster assignments are based on maximum and minimum silhouette width scores obtained from N iterations.
RandomSillyPutty(distobj, K, N = 100, verbose = FALSE, ...) ## S4 method for signature 'RandomSillyPutty,matrix' plot(x, y, distobj, col = NULL, ...) ## S4 method for signature 'RandomSillyPutty,missing' plot(x, y, ...) ## S4 method for signature 'RandomSillyPutty' summary(object, ...) ## S4 method for signature 'RSPSummary,missing' plot(x, y, ...)
RandomSillyPutty(distobj, K, N = 100, verbose = FALSE, ...) ## S4 method for signature 'RandomSillyPutty,matrix' plot(x, y, distobj, col = NULL, ...) ## S4 method for signature 'RandomSillyPutty,missing' plot(x, y, ...) ## S4 method for signature 'RandomSillyPutty' summary(object, ...) ## S4 method for signature 'RSPSummary,missing' plot(x, y, ...)
distobj |
An object of class |
K |
The number of clusters. |
N |
The number of iterations you want to run. |
verbose |
A logical value; should you print info while working |
... |
Extra arguments to the |
x |
An object of the |
object |
An object of the |
y |
A layout matrix. |
col |
A character vector containing color names. |
The RandomSillyPutty
function reads and processes one distance matrix
at a time, with a precomputed cluster number, and a number N iterations.
RandomSillyPutty returns an s4 object. The MX
component of this
structure contains an integer vector that has a cluster assignment based on the
best scoring silhouette width score from N iterations. The MN
contains
an integer vector that has a cluster assignment based on the worst scoring
silhouette score from N iterations. The ave
contains the average
silhouette width of all N iteration. The labels
is a dataframe containing
the cluster assignment of the best scoring slihouette width score per iteration.
The minItemSW
is a list containing the silhouette width score of all N
iterations.
The constructor fnuction, RandomSillyPutty
, returns an object
of the RandomSillyPutty
class.
MX
:An integer vector containing cluster assignment that had the best silhouette width from running the iterations
MN
:An integer vector containing cluster assignment that had the worst silhouette width from running the iterations
ave
:An integer vector of average distribution of the silhouette width scores throughout N iterations
labels
:A data frame of the cluster assignments of the best silhouette width score.
minItemSW
:A list of silhouette width scores of all N iterations
signature(x = "RandomSillyPutty", y = "matrix")
: Plot the
clusterings with the maximum and minimum global silhouette widths.
signature(x = "RandomSillyPutty")
: .
Kevin R. Coombes [email protected], Dwayne G. Tally [email protected]
Pending.
data(eucdist) # 'eucdist' is the Euclidean distane matrix (i.e., 'dist' object) from # a simulated data set with 500 elements and 5 groups. set.seed(12) y <- RandomSillyPutty(eucdist, 6, N = 100) summary(y)
data(eucdist) # 'eucdist' is the Euclidean distane matrix (i.e., 'dist' object) from # a simulated data set with 500 elements and 5 groups. set.seed(12) y <- RandomSillyPutty(eucdist, 6, N = 100) summary(y)
A function that takes in an already existing starting location based on unsupervised clustering attempts. I.G. Kmeans or Hieriarchical cluster assignment. SillyPutty optimizes the pre-exisitng cluster assignments based on the best silhouette width score.
SillyPutty(labels, dissim, maxIter = 1000, loopSize = 15, verbose = FALSE)
SillyPutty(labels, dissim, maxIter = 1000, loopSize = 15, verbose = FALSE)
labels |
A numeric vector containing pre-computed cluster labels |
dissim |
An object of class |
maxIter |
A numneric vetor of length one; the maximum number of individual steps, each of which reclassifies only one object |
loopSize |
How many steps to retain in momry to test if you have entered an infinite loop while rearranging objects. |
verbose |
A logical vector of length one; should you output a lot of information while running? |
The SillyPutty
function processes a pre-computed cluster assignment
along with a distance metric and returns a s4 object. The cluster
component is a list of the new cluster assignments based on best
silhouette width score. The silhouette
is a dataframe containing the
silhouette width score calculated by SillyPutty. The minSw
contains
a positive and negative version of the minimum silhouette width score.
The meanSW
is a double vector that shows the average silhouette width
score after applying SillyPutty to the cluster assignment.
The constructor function SillyPutty
, returns an object of
the SillyPutty
class.
cluster
:A list containing the adjusted cluster assignment that had the best silhouette width.
silhouette
:A dataframe containing the silhouette width scores.
minSW
:A silhouette double vector that contains the positive and negative version of the minimum silhouette width value.
meanSW
:A double vector that contains the average silhouette width value.
Kevin R. Coombes [email protected], Dwayne G. Tally [email protected]
Pending
data(eucdist) set.seed(12) hc <- hclust(eucdist, "ward.D2") clues <- cutree(hc, k = 5) hcSilly <- SillyPutty(clues, eucdist)
data(eucdist) set.seed(12) hc <- hclust(eucdist, "ward.D2") clues <- cutree(hc, k = 5) hcSilly <- SillyPutty(clues, eucdist)