title: “Introduction to Silly Putty” author: “Dwayne Tally, Zachary B. Abrams, and Kevin R. Coombes” date: “2024-11-05” output: rmarkdown::html_document: theme: journal highlight: kate vignette: > % % % % % —

Introduction

In many diseases, such as cancer, it is important to have a clear understanding of what potential clinical subgroup an individual patient belongs to. Unsupervised clustering is a useful analytic tool to address this problem. A variety of clustering methods already exist and are differentiated by the kinds of outcome measures they are intended to optimize. For example, K-means is designed to minimize the within-cluster sum of square errors. Partitioning around medoids generalizes this idea from the Euclidean distance metric defined by sums of squares to an arbitrary distance metric. The different linkage rules used in hierarchical clustering methods also change the nature of the value being optimized.

Ever since Kaufmann and Rooseeuw introduced the idea of the silhouette width, researches have used its average value to select the best method when applying different clustering methods to the same data set. To out knowledge, no one has tried to use silhouette width as a quantity to be optimized directly when finding clusters. To test the idea that optimizing the silhouette width could be used to cluster elements, we developed a novel algorithm that we call “SillyPutty”. In brief, after elements have been assigned to clusters, we can calculate the silhouette width (SW) of each element, yielding numbers between -1 and +1. A positive value of SW indicates that an element is likely to be properly clustered, while a negative value of SW indicates the element is probably not in the correct cluster. The repeated step in the SillyPutty algorithm is to reclassify the element with the most negative silhouette width by placing it into the cluster to which it is closest. This process can (usually) be repeated until there are no negative silhouette widths present in the data. (There is a small chance that this algorithm will fail to converge by entering a small infinite loop where the same elements are rearranged to get back to an earlier configuration.)

Setup

We must first load the necessary packages.

library(SillyPutty)
library(Umpire)
suppressMessages( library(Mercator) )
suppressMessages( library(mclust) ) # for adjusted rand index

Generating and Formatting Data

We use the Umpire R package (version 2.0.10) to generate more complex and realistic synthetic data. We then compute the Euclidean distances between elements. Then, we use the Mercator R package (version 1.1.5) to visualize the data. Finally, we use the mclust R package (version 6.1.1) to compute the Adjusted Rand index (ARI), a measure of cluster quality that compares clusters to externally known truth.

Assign Umpire Model Parameters

The next chunk of code creates the objects that we will use to simulate a data set. We set things up to represent a kind of cancer with four subtypes corresponding to recurrent sets of “hits”, where each hit can be thought of as an abstract “mutation” that affects the expression of a pathway of related genes.

set.seed(21315)
trueK <- 4
## Set up survival outcome; baseline is exponential
sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1)
## Build a CancerModel with four subtypes
nBlocks <- 20    
cm <- CancerModel(name="cansim",
                  nPossible=nBlocks,
                  nPattern=trueK,
                  OUT = function(n) rnorm(n, 0, 1), 
                  SURV= function(n) rnorm(n, 0, 1),
                  survivalModel=sm)
## Include 100 blocks/pathways that are not hit by cancer
nTotalBlocks <- nBlocks + 100
## Assign values to hyperparameters
## block size
blockSize <- round(rnorm(nTotalBlocks, 100, 30))
## log normal mean hyperparameters
mu0    <- 6
sigma0 <- 1.5
## log normal sigma hyperparameters
rate   <- 28.11
shape  <- 44.25
## block correlation
p <- 0.6
w <- 5
## Set up the baseline Engine
rho <- rbeta(nTotalBlocks, p*w, (1-p)*w)
base <- lapply(1:nTotalBlocks,
               function(i) {
                 bs <- blockSize[i]
                 co <- matrix(rho[i], nrow=bs, ncol=bs)
                 diag(co) <- 1
                 mu <- rnorm(bs, mu0, sigma0)
                 sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1)
                 covo <- co *(t(sigma) %*% sigma)
                 MVN(mu, covo)
               })
eng <- Engine(base)
## Alter the means if there is a hit
altered <- alterMean(eng, normalOffset, delta=0, sigma=1)
## Build the CancerEngine using character strings
object <- CancerEngine(cm, "eng", "altered")
rm(sm, nBlocks, cm, nTotalBlocks, blockSize, mu0, sigma0, rate, shape, p, w, rho, base, eng, altered)

Simulate Data

Now we can take a random sample of 144 elements from the distribution that we just defined.

trueN <- 144
dset <- rand(object, trueN, keepall = TRUE) # contains two objects
labels <- dset$clinical$CancerSubType # the true clusters/types
d1 <- dset$data # the noise-free simulated data

To make our data set even more realistic, we are going to add noise that mimics what happens in some biological assays.

SpecialNoise <- function(nFeat, nu = 0.1, shape = 1.02, scale = 0.05/shape) {
  NoiseModel(nu = nu,
             tau = rgamma(nFeat, shape = shape, scale = scale),
             phi = 0)
}
nm <- SpecialNoise(nrow(d1), nu = 0)
d1 <- blur(nm, d1)
dim(d1)
## [1] 11808   144

Euclidean Distance Matrix

Now we compute the Euclidean distances between pairs of elements in our simulated data set.

tdis <- t(d1)
dimnames(tdis) <- list(paste("Sample", 1:nrow(tdis), sep=''),
                     paste("Feature", 1:ncol(tdis), sep=''))
dis <- dist(tdis)   ## This step is the rate-liomiting factor. Only way to speed up is to use fewerw samples
names(labels) <- rownames(tdis)

Mercator Visualization

As noted above, we will use the Mercator package for visualization. This function will ensure that we generate consistent sets of pictures.

mercViews <- function(object, main, tag = NULL) {
  opar <- par(mfrow = c(2, 2))
  on.exit(par(opar))
  pts <- barplot(object, main = main)
  if (!is.null(tag)) {
    gt <- as.vector(as.matrix(table(getClusters(object))))
    loc <- pts[round((c(0, cumsum(gt))[-(1 + length(gt))] + cumsum(gt))/2)]
    mtext(tag, side =1, line = 0, at = loc, col = object@palette, font = 2)
  }
  plot(object, view = "tsne", main = "t-SNE")
  plot(object, view = "hclust")
  plot(object, view = "mds", main = "MDS")
}

Different Clustering Methods

We will apply various clustering methods to the data (represented primarily through its distance matrix). We want to demonstrate with this example that SillyPutty clustering can do a better job than hierarchical clustering or PAM.

Hierarchical Clustering

Figure 1 presents multiple views of the Euclidean distances between our simulated data. Since we know that we started with 4 clusters, we chose that as the number to find using the default method of hierarchical clustering with Ward’s linkage rule. (We will later illustrate how to use SillyPutty to find the number of clusters.)

The silhouette width plot in the upper left panel indicates that each of the clusters contains some poorly-classified elements, identified by their negative silhouette widths. Both the multidimensional scaling (MDS) plot in the lower right and the t-stochastic neighbor embedding (t-SNE) plot in the upper right clearly display colored points that appear to be in the wrong regions.

set.seed(1987)
vis <- Mercator(dis, "euclid", "hclust", K = trueK)
palette <- vis@palette[c(1:3, 7, 8, 6, 10, 4, 11, 5, 15, 14, 17:18, 9, 12, 16, 19:24)]
vis@palette <- palette
vis <- addVisualization(vis, "mds")
vis <- addVisualization(vis, "tsne")
mercViews(vis, "Hierarchical Clustering, Five Clusters")
Figure 1 : Hierachical Clustering, with four clusters.

Figure 1 : Hierachical Clustering, with four clusters.

The adjusted Rand index isn’t very good, either.

ari.hier <- adjustedRandIndex(labels, vis@clusters)
ari.hier
## [1] 0.7499108

Graphing Truth

Since we know the truth, we can reassign the clusters inside the Mercator object to see what everything is supposed to look like (Figure 2). Notice that the silhouette width plot agrees that everything is in the right place, and that the MDS and t-SNE plots are also consistent.

truebin <- remapColors(vis, setClusters(vis, labels))
mercViews(truebin, main = "True Cluster Types", 
          tag = unique(sort(labels)))
Figure 2 : Visualization of true cancer clusters.

Figure 2 : Visualization of true cancer clusters.

PAM Clustering

Here we apply PAM clustering to the same distance matrix (Figure 3). These results are clearly much worse than hierarchical clustering.

pc <- pam(dis, k = trueK, diss=TRUE)
pamc <- remapColors(vis, setClusters(vis, pc$clustering))
mercViews(pamc, main = "PAM, K = 4", 
          tag = paste("P", 1:trueK, sep = ""))
Figure 3 : PAM Clustering, K = 4.

Figure 3 : PAM Clustering, K = 4.

ari.pam <- adjustedRandIndex(labels, pamc@clusters)
ari.pam
## [1] 0.4450122

SillyPutty Clustering

RandomSillyPutty is the core of the SillyPutty package. It takes a distance matrix, the desired number of clusters K, and the number N of times you want to apply SillyPutty to the data set. Each time, you start with different random cluster assignments. You then apply the “move the worst element” algorithm described above. RandomSillyPutty saves the best and worst silhouette width scores along with their associated data clusters.

set.seed(12)
y2 <- suppressWarnings(RandomSillyPutty(dis, trueK, N = 100)) ## this is also slow
ari.max <- adjustedRandIndex(truebin@clusters, y2@MX)
ari.min <- adjustedRandIndex(truebin@clusters, y2@MN)
ari.max
## [1] 0.9833938
ari.min
## [1] 0.6565152

The adjusted rand index of the best SillyPutty clustering is 0.98, meaning that we have almost completely recovered the true cluster assignments present in the data. Note that even the worst result that we obtained from SillyPutty has a better ARI (0.61) than PAM (0.43), though not quite as good as hierarchical clustering (0.71).

We can now update the Mercator object using the cluster assignments defined by the best SillyPutty result (Figure 4). The silhouette width plot now says that it thinks all elements are in good clusters, and both the MDS and t-SNE plots support that conclusion.

randSillyBin <- remapColors(vis, setClusters(vis, y2@MX))
mercViews(randSillyBin, main = "SillyPutty Cluster Types, K = 4", 
          tag = paste("C", 1:trueK, sep = ""))
Figure 4 : Random SillyPutty clustering,  K = 4.

Figure 4 : Random SillyPutty clustering, K = 4.

We can also plot the cluster assignments that had the maximum and minimum silhouette widths from running the algorithm. We will use the multidimensional scaling layout and the color palette from the Mercator object.

plot(y2, randSillyBin@view[["mds"]], distobj = dis, col = randSillyBin@palette)
Figure 5 : Cluster assignements with best and worst silhouette widths after random starts.

Figure 5 : Cluster assignements with best and worst silhouette widths after random starts.

summary(y2)
## Mean Silhouette Widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02218 0.02330 0.02330 0.02329 0.02330 0.02330

Combining SillyPutty With Hierarchical Clustering

Since SillyPutty can start with any existing cluster assignments (even random ones, as we just saw), we can combine it with any other method. here we are going to start with the results of hierarchcial clustering, and just take one pass of SillyPutty to “improve” its results. To apply SillyPutty to an already precomputed clustering algorithm, you have to have the cluster identities of the clustering algorithm and the distance matrix of the data set. SillyPutty will then recalculate the clusters from a starting point within the post-clustered clusters and return the best silhouette width score and the new cluster identities.

hierSilly <- SillyPutty(vis@clusters, dis)
hierSillyBin <- remapColors(vis, setClusters(vis, hierSilly@cluster))
mercViews(hierSillyBin, main = "HClust + Silly, k = 4",tag = paste("C", 1:trueK, sep = ""))
Figure 6 : Hierarchical Clustering + SillyPutty, K = 4.

Figure 6 : Hierarchical Clustering + SillyPutty, K = 4.

ari.Sillyhier <- adjustedRandIndex(labels, hierSillyBin@clusters)
ari.Sillyhier
## [1] 0.9833938

Finding the Number of Clusters With SillyPutty

RangeSillyPutty uses RandomSillyPutty to determine the best mean silhouette width, for a range of clusters values. Then you can use the best silhouette widths to apporximate the actual number of clusters within the dataset.

Figure 6 shows the best silhouette width achieved with each possible number of clusters. The best overall value occurs when K = 4, which is the true number of clusters.

y <- findClusterNumber(dis, start = 2, end = 12, method = "HCSP")
plot(names(y), y, xlab = "K", ylab = "Silhouette Width", type = "b", lwd = 2, pch = 16)
Figure 7 : Best mean siilhouette width, by number of clusters, found by combining huierarchical clustering with Silly Putty.

Figure 7 : Best mean siilhouette width, by number of clusters, found by combining huierarchical clustering with Silly Putty.

Appendix

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mclust_6.1.1         Mercator_1.1.5       Thresher_1.1.4       PCDimension_1.1.13  
## [5] ClassDiscovery_3.4.5 oompaBase_3.2.9      cluster_2.1.6        Umpire_2.0.10       
## [9] SillyPutty_0.4.1    
## 
## loaded via a namespace (and not attached):
##  [1] ade4_1.7-22          changepoint_2.3      tidyselect_1.2.1     viridisLite_0.4.2   
##  [5] dplyr_1.1.4          viridis_0.6.5        fastmap_1.2.0        digest_0.6.37       
##  [9] lifecycle_1.0.4      mc2d_0.2.1           oompaData_3.1.4      cpm_2.3             
## [13] magrittr_2.0.3       kernlab_0.9-33       compiler_4.4.2       rlang_1.1.4         
## [17] sass_0.4.9           tools_4.4.2          igraph_2.1.1         utf8_1.2.4          
## [21] knitr_1.48           ggsignif_0.6.4       askpass_1.2.1        scatterplot3d_0.3-44
## [25] reticulate_1.39.0    abind_1.4-8          KernSmooth_2.23-24   Rtsne_0.17          
## [29] purrr_1.0.2          sys_3.4.3            nnet_7.3-19          grid_4.4.2          
## [33] stats4_4.4.2         fansi_1.0.6          ggpubr_0.6.0         movMF_0.2-8         
## [37] colorspace_2.1-1     ggplot2_3.5.1        scales_1.3.0         MASS_7.3-61         
## [41] cli_3.6.3            mvtnorm_1.3-2        rmarkdown_2.29       generics_0.1.3      
## [45] umap_0.2.10.0        RSpectra_0.16-2      cachem_1.1.0         modeltools_0.2-23   
## [49] vctrs_0.6.5          Matrix_1.7-1         jsonlite_1.8.9       slam_0.1-54         
## [53] carData_3.0-5        car_3.1-3            rstatix_0.7.2        Formula_1.2-5       
## [57] kohonen_3.0.12       maketools_1.3.1      dendextend_1.18.1    jquerylib_0.1.4     
## [61] tidyr_1.3.1          BimodalIndex_1.1.9   glue_1.8.0           Polychrome_1.5.1    
## [65] gtable_0.3.6         munsell_0.5.1        tibble_3.2.1         pillar_1.9.0        
## [69] htmltools_0.5.8.1    openssl_2.2.2        R6_2.5.1             evaluate_1.0.1      
## [73] lattice_0.22-6       highr_0.11           png_0.1-8            backports_1.5.0     
## [77] broom_1.0.7          bslib_0.8.0          Rcpp_1.0.13-1        flexmix_2.3-19      
## [81] gridExtra_2.3        xfun_0.49            zoo_1.8-12           buildtools_1.0.0    
## [85] pkgconfig_2.0.3