title: "Introduction to Silly Putty" author: "Dwayne Tally, Zachary B. Abrams, and Kevin R. Coombes" date: "`r Sys.Date()`" output: rmarkdown::html_document: theme: journal highlight: kate vignette: > %\VignetteIndexEntry{Introduction to SillyPutty} %\VignetteKeywords{SillyPutty,Clustering Algorithm,Clusters,Graphics} %\VignetteDepends{SillyPutty} %\VignettePackage{SillyPutty} %\VignetteEngine{knitr::rmarkdown} --- ```{r opts, echo=FALSE} knitr::opts_chunk$set(fig.width=8, fig.height=8) options(width=96) .format <- knitr::opts_knit$get("rmarkdown.pandoc.to") .tag <- function(N, cap ) ifelse(.format == "html", paste("Figure", N, ":", cap), cap) ``` ```{r mycss, results="asis", echo=FALSE} cat(' ') ``` # Introduction In many diseases, such as cancer, it is important to have a clear understanding of what potential clinical subgroup an individual patient belongs to. Unsupervised clustering is a useful analytic tool to address this problem. A variety of clustering methods already exist and are differentiated by the kinds of outcome measures they are intended to optimize. For example, K-means is designed to minimize the within-cluster sum of square errors. Partitioning around medoids generalizes this idea from the Euclidean distance metric defined by sums of squares to an arbitrary distance metric. The different linkage rules used in hierarchical clustering methods also change the nature of the value being optimized. Ever since Kaufmann and Rooseeuw introduced the idea of the silhouette width, researches have used its average value to select the best method when applying different clustering methods to the same data set. To out knowledge, no one has tried to use silhouette width as a quantity to be optimized directly when finding clusters. To test the idea that optimizing the silhouette width could be used to cluster elements, we developed a novel algorithm that we call "SillyPutty". In brief, after elements have been assigned to clusters, we can calculate the silhouette width (SW) of each element, yielding numbers between -1 and +1. A positive value of SW indicates that an element is likely to be properly clustered, while a negative value of SW indicates the element is probably not in the correct cluster. The repeated step in the SillyPutty algorithm is to reclassify the element with the most negative silhouette width by placing it into the cluster to which it is closest. This process can (usually) be repeated until there are no negative silhouette widths present in the data. (There is a small chance that this algorithm will fail to converge by entering a small infinite loop where the same elements are rearranged to get back to an earlier configuration.) ## Setup We must first load the necessary packages. ```{r Setup} library(SillyPutty) library(Umpire) suppressMessages( library(Mercator) ) suppressMessages( library(mclust) ) # for adjusted rand index ``` # Generating and Formatting Data We use the Umpire R package (version `r packageVersion("Umpire")`) to generate more complex and realistic synthetic data. We then compute the Euclidean distances between elements. Then, we use the Mercator R package (version `r packageVersion("Mercator")`) to visualize the data. Finally, we use the mclust R package (version `r packageVersion("mclust")`) to compute the Adjusted Rand index (ARI), a measure of cluster quality that compares clusters to externally known truth. ## Assign Umpire Model Parameters The next chunk of code creates the objects that we will use to simulate a data set. We set things up to represent a kind of cancer with four subtypes corresponding to recurrent sets of "hits", where each hit can be thought of as an abstract "mutation" that affects the expression of a pathway of related genes. ```{r genData} set.seed(21315) trueK <- 4 ## Set up survival outcome; baseline is exponential sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1) ## Build a CancerModel with four subtypes nBlocks <- 20 cm <- CancerModel(name="cansim", nPossible=nBlocks, nPattern=trueK, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) ## Include 100 blocks/pathways that are not hit by cancer nTotalBlocks <- nBlocks + 100 ## Assign values to hyperparameters ## block size blockSize <- round(rnorm(nTotalBlocks, 100, 30)) ## log normal mean hyperparameters mu0 <- 6 sigma0 <- 1.5 ## log normal sigma hyperparameters rate <- 28.11 shape <- 44.25 ## block correlation p <- 0.6 w <- 5 ## Set up the baseline Engine rho <- rbeta(nTotalBlocks, p*w, (1-p)*w) base <- lapply(1:nTotalBlocks, function(i) { bs <- blockSize[i] co <- matrix(rho[i], nrow=bs, ncol=bs) diag(co) <- 1 mu <- rnorm(bs, mu0, sigma0) sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1) covo <- co *(t(sigma) %*% sigma) MVN(mu, covo) }) eng <- Engine(base) ## Alter the means if there is a hit altered <- alterMean(eng, normalOffset, delta=0, sigma=1) ## Build the CancerEngine using character strings object <- CancerEngine(cm, "eng", "altered") rm(sm, nBlocks, cm, nTotalBlocks, blockSize, mu0, sigma0, rate, shape, p, w, rho, base, eng, altered) ``` ## Simulate Data Now we can take a random sample of 144 elements from the distribution that we just defined. ```{r simData} trueN <- 144 dset <- rand(object, trueN, keepall = TRUE) # contains two objects labels <- dset$clinical$CancerSubType # the true clusters/types d1 <- dset$data # the noise-free simulated data ``` To make our data set even more realistic, we are going to add noise that mimics what happens in some biological assays. ```{r noiseModel} SpecialNoise <- function(nFeat, nu = 0.1, shape = 1.02, scale = 0.05/shape) { NoiseModel(nu = nu, tau = rgamma(nFeat, shape = shape, scale = scale), phi = 0) } nm <- SpecialNoise(nrow(d1), nu = 0) d1 <- blur(nm, d1) dim(d1) ``` ## Euclidean Distance Matrix Now we compute the Euclidean distances between pairs of elements in our simulated data set. ```{r distancematrix} tdis <- t(d1) dimnames(tdis) <- list(paste("Sample", 1:nrow(tdis), sep=''), paste("Feature", 1:ncol(tdis), sep='')) dis <- dist(tdis) ## This step is the rate-liomiting factor. Only way to speed up is to use fewerw samples names(labels) <- rownames(tdis) ``` ```{r eval=FALSE, echo=FALSE, results='hide'} dataset <- tdis eucdist <- dis trueGroups <- labels save(eucdist, trueGroups, file="../data/eucdist.rda") ``` ## Mercator Visualization As noted above, we will use the Mercator package for visualization. This function will ensure that we generate consistent sets of pictures. ```{r mercViews} mercViews <- function(object, main, tag = NULL) { opar <- par(mfrow = c(2, 2)) on.exit(par(opar)) pts <- barplot(object, main = main) if (!is.null(tag)) { gt <- as.vector(as.matrix(table(getClusters(object)))) loc <- pts[round((c(0, cumsum(gt))[-(1 + length(gt))] + cumsum(gt))/2)] mtext(tag, side =1, line = 0, at = loc, col = object@palette, font = 2) } plot(object, view = "tsne", main = "t-SNE") plot(object, view = "hclust") plot(object, view = "mds", main = "MDS") } ``` # Different Clustering Methods We will apply various clustering methods to the data (represented primarily through its distance matrix). We want to demonstrate with this example that SillyPutty clustering can do a better job than hierarchical clustering or PAM. ## Hierarchical Clustering **Figure 1** presents multiple views of the Euclidean distances between our simulated data. Since we know that we started with `r trueK` clusters, we chose that as the number to find using the default method of hierarchical clustering with Ward's linkage rule. (We will later illustrate how to use SillyPutty to find the number of clusters.) The silhouette width plot in the upper left panel indicates that each of the clusters contains some poorly-classified elements, identified by their negative silhouette widths. Both the multidimensional scaling (MDS) plot in the lower right and the t-stochastic neighbor embedding (t-SNE) plot in the upper right clearly display colored points that appear to be in the wrong regions. ```{r fig01, fig.cap = .tag(1, "Hierachical Clustering, with four clusters.")} set.seed(1987) vis <- Mercator(dis, "euclid", "hclust", K = trueK) palette <- vis@palette[c(1:3, 7, 8, 6, 10, 4, 11, 5, 15, 14, 17:18, 9, 12, 16, 19:24)] vis@palette <- palette vis <- addVisualization(vis, "mds") vis <- addVisualization(vis, "tsne") mercViews(vis, "Hierarchical Clustering, Five Clusters") ``` The adjusted Rand index isn't very good, either. ```{r ari} ari.hier <- adjustedRandIndex(labels, vis@clusters) ari.hier ``` ## Graphing Truth Since we know the truth, we can reassign the clusters inside the Mercator object to see what everything is supposed to look like (**Figure 2**). Notice that the silhouette width plot agrees that everything is in the right place, and that the MDS and t-SNE plots are also consistent. ```{r fig02, fig.cap = .tag(2, "Visualization of true cancer clusters.")} truebin <- remapColors(vis, setClusters(vis, labels)) mercViews(truebin, main = "True Cluster Types", tag = unique(sort(labels))) ``` ## PAM Clustering Here we apply PAM clustering to the same distance matrix (**Figure 3**). These results are clearly much worse than hierarchical clustering. ```{r fig03, fig.cap = .tag(3, "PAM Clustering, K = 4.")} pc <- pam(dis, k = trueK, diss=TRUE) pamc <- remapColors(vis, setClusters(vis, pc$clustering)) mercViews(pamc, main = "PAM, K = 4", tag = paste("P", 1:trueK, sep = "")) ari.pam <- adjustedRandIndex(labels, pamc@clusters) ari.pam ``` # SillyPutty Clustering RandomSillyPutty is the core of the SillyPutty package. It takes a distance matrix, the desired number of clusters K, and the number N of times you want to apply SillyPutty to the data set. Each time, you start with _different_ random cluster assignments. You then apply the "move the worst element" algorithm described above. RandomSillyPutty saves the best and worst silhouette width scores along with their associated data clusters. ```{r RandomSilly} set.seed(12) y2 <- suppressWarnings(RandomSillyPutty(dis, trueK, N = 100)) ## this is also slow ari.max <- adjustedRandIndex(truebin@clusters, y2@MX) ari.min <- adjustedRandIndex(truebin@clusters, y2@MN) ari.max ari.min ``` The adjusted rand index of the best SillyPutty clustering is 0.98, meaning that we have almost completely recovered the true cluster assignments present in the data. Note that even the worst result that we obtained from SillyPutty has a better ARI (0.61) than PAM (0.43), though not quite as good as hierarchical clustering (0.71). We can now update the Mercator object using the cluster assignments defined by the best SillyPutty result (**Figure 4**). The silhouette width plot now says that it thinks all elements are in good clusters, and both the MDS and t-SNE plots support that conclusion. ```{r fig04, fig.cap = .tag(4, "Random SillyPutty clustering, K = 4.")} randSillyBin <- remapColors(vis, setClusters(vis, y2@MX)) mercViews(randSillyBin, main = "SillyPutty Cluster Types, K = 4", tag = paste("C", 1:trueK, sep = "")) ``` We can also plot the cluster assignments that had the maximum and minimum silhouette widths from running the \code{RandokmSillyPutty} algorithm. We will use the multidimensional scaling layout and the color palette from the Mercator object. ```{r fig05, fig.cap = .tag(5, "Cluster assignements with best and worst silhouette widths after random starts.")} plot(y2, randSillyBin@view[["mds"]], distobj = dis, col = randSillyBin@palette) summary(y2) ``` ## Combining SillyPutty With Hierarchical Clustering Since SillyPutty can start with any existing cluster assignments (even random ones, as we just saw), we can combine it with any other method. here we are going to start with the results of hierarchcial clustering, and just take one pass of SillyPutty to "improve" its results. To apply SillyPutty to an already precomputed clustering algorithm, you have to have the cluster identities of the clustering algorithm and the distance matrix of the data set. SillyPutty will then recalculate the clusters from a starting point within the post-clustered clusters and return the best silhouette width score and the new cluster identities. ```{r fig06, fig.cap = .tag(6, "Hierarchical Clustering + SillyPutty, K = 4.")} hierSilly <- SillyPutty(vis@clusters, dis) hierSillyBin <- remapColors(vis, setClusters(vis, hierSilly@cluster)) mercViews(hierSillyBin, main = "HClust + Silly, k = 4",tag = paste("C", 1:trueK, sep = "")) ari.Sillyhier <- adjustedRandIndex(labels, hierSillyBin@clusters) ari.Sillyhier ``` # Finding the Number of Clusters With SillyPutty RangeSillyPutty uses RandomSillyPutty to determine the best mean silhouette width, for a range of clusters values. Then you can use the best silhouette widths to apporximate the actual number of clusters within the dataset. **Figure 6** shows the best silhouette width achieved with each possible number of clusters. The best overall value occurs when K = 4, which is the true number of clusters. ```{r fig07, fig.width=6, fig.height=5, fig.cap=.tag(7, "Best mean siilhouette width, by number of clusters, found by combining huierarchical clustering with Silly Putty.")} y <- findClusterNumber(dis, start = 2, end = 12, method = "HCSP") plot(names(y), y, xlab = "K", ylab = "Silhouette Width", type = "b", lwd = 2, pch = 16) ``` # Appendix ```{r} sessionInfo() ```