---
title: "Using the Mercator Package"
author: "C.E. Coombes, Zachary B. Abrams, Suli Li, Kevin R. Coombes"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Using the Mercator Package}
%\VignetteKeywords{Mercator,clustering,visualization,PAM,Multi-dimensional scaling,t-distributed Stochastic Neighbor Embedding}
%\VignetteDepends{Mercator}
%\VignettePackage{Mercator}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE, results="hide"}
knitr::opts_chunk$set(echo = TRUE, fig.width=6,fig.height=5,echo=TRUE)
```
# Introduction
Modern biological experiments are increasingly producing interesting matrices.
These may represent the presence or absence of
specific gene mutations, copy number variants, microRNAs, or other
molecular or clinical phenomena. We have recently developed a tool,
*CytoGPS* [^Abrams and colleagues],
that converts conventional karyotypes from the standard text-based
notation (the International Standard for Human Cytogenetic Nomenclature;
ISCN) into a binary vector with three bits (loss, gain, or fusion) per
cytoband, which we call the "LGF model".
The Mercator package is intended to facilitate the
exploration of data sets. It implements metrics for continuous,
categorical, and binary data, including a subset of the 76
binary distance metrics described by [^Choi and colleagues], ensuring that
at least one representative of each of their major clusters is
included. Each resulting distance matrix can be combined with multiple
visualization techniques, providing a consistent interface to
thoroughly explore the data set.
# The BinaryMatrix Class
In this vignette, we focus initially on the tools for dealing with binary matrices.
We start by loading the package.
```{r library}
suppressMessages( suppressWarnings( library(Mercator) ) )
```
## A Limited Sample Dataset
We proceed with a model dataset of karyotypes from patients with
Chronic Myelogenous Leukemia (CML) with 400 chromosomal features recorded over
740 patients, from the public
Mitelman database. Cytogenetic abnormalities recorded in Mitelman as text
strings have been pre-processed with CytoGPS into
binary vectors. For the sake of clarity and efficiency we have chosen a subset
of patients and features for this example.
```{r data}
filename <- system.file("Examples/Mercator_Test_Data.csv", package="Mercator")
my.data <- read.csv(filename, header=TRUE)
dim(my.data)
```
## Generating the BinaryMatrix
Many of the functions of the Mercator package operate on a BinaryMatrix
S4 object, which serves as an input to subsequent functions and visualizations.
A BinaryMatrix object is formed from a matrix containing integer or numeric
values. Although Mercator was intially designed for processing and visualizing
binary data, the BinaryMatrix object and subsequent functions accept a variety
of integer and numeric values. In addition, all the visualization tools work with
an arbitrary distance matrix; see the vignette [Mercator for Continuous Data](./mercVis.html).
Row and column names, along with optional annotations, can be assigned as for data frames.
If no row or column headings are assigned, the BinaryMatrix takes the row and
column names of the input matrix by default. Notice that the resulting object always
includes a "history" element that tracks how it has been processed.
```{r}
my.data <- as.matrix(my.data)
my.binmat <- BinaryMatrix(my.data)
summary(my.binmat)
```
We wish to cluster the whole karyotypes of each patient to identify the
patterns of important chromosomal abnormalities that link them. To proceed,
we must transpose the BinaryMatrix. Transposition meaningfully transposes
the row and column headings of the BinaryMatrix, as well.
```{r}
my.binmat <- t(my.binmat)
summary(my.binmat)
```
## Remove Duplicates
The binary vectors are rarely unique.
Patients who have identical vectors of features can complicate some of the
clustering and visualization routines that we want to use (often by
introducing a division by zero). Similarly, features that have identical
occurence vectors in a patient cohort can also alter the
biological implications by automatically giving more "weight" to a
single genomic event (like a trisomy or monosomy). To deal with this
issue, the Mercator package includes a function to remove
duplicate or redundant binary vectors. Here we are going to remove
duplicate patients; that is, those with identical karyotypes.
```{r duplicates}
my.binmat <- removeDuplicates(my.binmat)
summary(my.binmat)
```
In the case of our data, only 136 of the 740 patient karyotypes are unique.
Some of the karyotypes are "not used" in the sense that they contain
none of the abnormalities selected for this limited subset.
```{r}
length(my.binmat@info$notUsed)
head(my.binmat@info$notUsed)
```
By contrast, many features are *used* but are simply redundant.
```{r}
length(my.binmat@info$redundant)
```
## Data Filtering with Thresher
Mercator provides easy access to functions of the Thresher R package,
which includes outlier detection and estimates of the number of clusters [^Wang and colleagues].
The underlying idea is that the features can be viewed as
"weight vectors" in a principal component space trying to display the samples.
The lengths of the vectors are a measure of their importance in the data set;
short vectors can (and probably should) be removed since they do not carry
much useful information. We have incorporated that feature into the
Mercator package.
We create a ThreshedBinaryMatrix to implement the algorithm.
Based on simulations described in the Thresher paper, a cutoff above approximately 0.3
can be used to select informative features. Then, we subset our ThreshedBinaryMatrix
to only include features above our given cutoff.
```{r thresher}
set.seed(21348)
my.binmat <- threshLGF(my.binmat, cutoff=0.3)
summary(my.binmat)
```
The red vertical line in the figure indicates the cutoff we have chosen
to separate uninformative features (<0.3) from informative ones (>0.3).
```{r delta, fig.cap="Histogram of weight vectors."}
Delta <- my.binmat@thresher@delta
hist(Delta, breaks=20, main="", xlab="Weight", col="gray")
abline(v=0.3, col='red')
```
The ThreshedBinaryMatrix object contains a reaper slot
that estimates the number of principal components and the number of clusters
after outliers have been removed. These values can be viewed numerically...
```{r pcdim}
my.binmat@reaper@pcdim
my.binmat@reaper@nGroups
```
... or they can be visualized with an *Auer-Gervini plot* (where we
are looking for a "long step") ...
```{r pcVis-1, fig.cap="Auer-Gervini plot."}
plot(my.binmat@reaper@ag, ylim=c(0, 30))
abline(h=my.binmat@reaper@pcdim, col="forestgreen", lwd=2)
abline(h=7, col="orange", lwd=2)
```
```{r pcVis-2, fig.cap="Scree plot."}
pts <- screeplot(my.binmat@reaper, xlim=c(0,30))
abline(v=pts[my.binmat@reaper@pcdim], col="forestgreen", lwd=2)
abline(v=pts[7], col="orange", lwd=2)
```
There is no method that can definitely identify the number of clusters
for a "perfect" solution, leaving part of the decision to the analyst's judgment.
Auer-Gervini and Scree plots provide informative visualizations to facilitate this
interference. The default value provided by the Auer-Gervini analysis (N=2; green) is
somewhat conservative. The value provided by the broken-stick model (N=7; orange)
overlaid on the scree plot is more aggressive, but is not too unreasonable
based on the Auer-Gervini plot. While one could make the argument that the correct
number of groups is at least seven, we will begin by using the number of groups
recommended by Thresher.
```{r kk}
kk <- 5
```
# Visualization
The Mercator package allows visualization of data with methods that include both
standard techniques (hierarchical clustering) and large-scale visualizations
(multidimensional scaling (MDS),
t-distributed Stochastic Neighbor Embedding (t-SNE), and iGraph.)
The Mercator package implements or provides access to 10 distance metrics:
Jaccard, Sokal-Michener, Hamming, Russell-Rao, Pearson, Goodman-Kruskal, Manhattan,
Canberra, Binary, and Euclidean. Although some of these metrics
can be used for continuous or categorical data, all are appropriate for
binary matrices. Mercator allows the user to compare different metrics
and select one that is useful for a given dataset.
## Jaccard Distance
Here, we will use the Jaccard distance because of its ease of
interpretability, common usage, and its appropriatness of application
to asymmetric binary data, such as the binary vector output of CytoGPS
in this dataset.
The Mercator constructor can be called with any initial visualization,
and visualizations can be added in an arbitrary order.
```{r distance, echo=TRUE}
jacc.Vis <- Mercator(my.binmat, "jaccard", "hclust", K=kk)
```
We can represent all the distances between features within the
dissimilarity matrix we have calculated on our data as a
histogram, as a visual representation of relatedness.
```{r jacc-hist, fig.cap = "Distribution of Jaccard distances."}
hist(jacc.Vis,
xlab="Jaccard Distance", main="Histogram of Distances")
```
### Hierarchical Clustering
Mercator allows us to implement common, standard visualizations
such as hierarchical clustering...
```{r jacc-hclust, fig.cap = "Hierarchical clustering using Jaccard distances."}
plot(jacc.Vis, view = "hclust")
```
In this dendrogram, there appear to be at least three different "blue"
branches, suggesting that there are more than the conservative estimate
of five clusters obtained from Thresher. We will take a look at
the data using another visualization technique before making a final
decision.
### t-Distributed Stochastic Neighbor Embedding
Mercator can use t-distributed Stochastic Neighbor Embedding
(t-SNE) plots for visualizing large-scale, high-dimensional data in
2-dimensional space.
```{r jacc-tsne5, fig.width=5, fig.height=5, fig.cap = "A t-SNE plot."}
jacc.Vis <- addVisualization(jacc.Vis, "tsne", perplexity=25)
plot(jacc.Vis, view = "tsne", main="t-SNE; Jaccard Distance")
```
Optional t-SNE parameters, such as perplexity, can be used to fine-tune
the plot as the visualization is created. Using addVisualization
to create a new, tuned plot of an existing type overwrites the
existing plot of that type.
```{r jacc-tsne10, fig.width=5, fig.height=5, fig.cap = "Another t-SNE plot."}
temp.Vis <- addVisualization(jacc.Vis, "tsne", perplexity = 10)
plot(temp.Vis, view = "tsne", main="t-SNE; Jaccard Distance; perplexity=10")
```
It does not appear that a smaller perplexity is useful for this data set.
### Multi-Dimensional Scaling
Mercator allows visualization of multi-dimensional scaling (MDS) plots, as well.
```{r jacc-mds1, fig.width=5, fig.height=5, fig.cap = "A multi-dimensioanl scaling plot."}
jacc.Vis <- addVisualization(jacc.Vis, "mds")
plot(jacc.Vis, view = "mds", main="MDS; Jaccard Distance")
```
### Silhouette-Width Barplots
At this point, we still don't know if we should increase the number of clusters
in the data set form 5 to 7. Fortunately, we have access to the "silhouette width",
which measures how well each element in the data set appears to be clustered.
```{r barp, fig.width=5, fig.height=4, fig.cap = "Histogram of silhouette widths."}
barplot(jacc.Vis)
```
Ths plot of the silhouette widths confirms our suspicion from the hierarchical
clustering dendrogram that the "blue" group is not as coherent/consistent as the
other grousp. So, we are going to try reclustering to label more clusters.
```{r reclue, fig.width=5, fig.height=4, fig.cap = "Silhouette widths for different K."}
jacc.Vis6 <- recluster(jacc.Vis, K = 6)
barplot(jacc.Vis6)
jacc.Vis7 <- recluster(jacc.Vis, K = 7)
barplot(jacc.Vis7)
```
There is some improvement when K=6, but a clearly worse result when K = 7. For the rest
of our analysis, we will switch to ooking at 6 groups.
```{r reset}
kk <- 6
jacc.Vis <- jacc.Vis6
rm(jacc.Vis6, jacc.Vis7)
```
### iGraph
Mercator can be used to visualize complex networks using
iGraph. To improve clarity of the visualization and computational
time, we have implement the downsample function to reduce the
number of data points to be linked and visualized.
The idea goes back to Peng Qiu's implementation of the
SPADE clustering algorithm for mass cytometry data. The main point is to *under* sample
the densest regions of the data space to make it more likely that rarer clusters
will still be adequately sampled. (While not required with the current data set,
this idea can be quite useful with data sets containing tens of thousands of objects.)
**Note:** The Mercator class includes a "subset" operator that tries to
preserve earlier visualizations. This operator is fast for MDS or t-SNE models, but
is very slow for large hierarchical clustering. (It uses the implementation in the
dendextend package, which works by removing a single leaf at a time from
the tree.) In the next code chunk, we downsample to create a meaningful subset,
and then compute a new dendrogram for each subset.
```{r downsample, fig.width=5, fig.height=5, fig.cap = "Downsample t-SNE plot."}
X <- jacc.Vis
X@view[["hclust"]] <- NULL # remove this view
N <- as.matrix(X@distance)
set.seed(87530)
P <- downsample(40, N, 0.1) # create a downsampled subset
J <- X[P]
names(J@view) # need to compute a new dendrogram
J <- addVisualization(J, "hclust", perplexity=5)
names(J@view)
plot(J, view = "tsne", main="Down-sampled t-SNE Plot")
```
Because our data set is small enough, we are going to create our graphical views
from the oriinal object rather than the downsampled one.
We can look at the resulting graph using three different "layouts".
```{r igraph, fig.width=4, fig.height=4, fig.cap = "Networks."}
jacc.Vis <- addVisualization(jacc.Vis, "graph", Q =0.5)
plot(jacc.Vis, view = "graph", layout = "tsne", main="T-SNE Layout")
plot(jacc.Vis, view = "graph", layout = "mds", main="MDS Layout")
plot(jacc.Vis, view = "graph", layout = "nicely",
main="Laid Out 'Nicely'",
xlim=c(-1,1))
```
## Cluster Identities
We can use the *getClusters* function to characterize each cluster and
use these for further manipulation.
We can easily determine cluster size...
```{r cluster1Identity}
my.clust <- getClusters(jacc.Vis)
tab <- table(my.clust)
tab
```
... or the patients that comprise each cluster.
```{r}
C <- my.binmat@columnInfo
Cl4 <- C[my.clust == 4 ,]
Cl4
```
## Sokal-Michener Metric
Finally, we are going to look at what happens if we use a different distance metric.
```{r sokal, fig.width=5, fig.height=5, fig.cap="Sokal-Michener distance t-SNE plot."}
set.seed(8642)
sokal.Vis <- Mercator(my.binmat, "sokal", "tsne", K=kk, peplexity = 10)
table(getClusters(sokal.Vis), getClusters(jacc.Vis))
plot(sokal.Vis, view = "tsne", main="t-SNE; Sokal-Michener Distance; perplexity=10")
```
The two largest groups get assigned the same colors (by chance) in the Jaccard and the
Sokal-Michener clusterings. However, it is not at all clear how to align the smaller
groups. For that purpose we can use the remapColors function.
```{r recolored}
SV <- remapColors(jacc.Vis, sokal.Vis)
table(getClusters(SV), getClusters(jacc.Vis))
plot(SV, view = "tsne", main="t-SNE; Sokal-Michener Distance; perplexity=10")
```
Now colors have been matched (as well as possible) between the two sets of visualizations.
# Changing the Color Palette
Each `Mercator` object stores its own palette internally, but you can change
the palette using the `slot` funciton.
```{r sv2, fig.width=5, fig.height=5, fig.cap = "Recolored Jaccard t-SNE plot."}
slot(jacc.Vis, "palette") <- c("red", "orange", "green", "blue",
"cyan", "magenta", "purple", "black")
plot(jacc.Vis, view = "tsne")
```
If the number of colors in the palette is smaller than the number of clusters,
they will be recycled, and other plotting symbols will be introduced.
```{r small, fig.width=5, fig.height=5, fig.cap = "Recolored Jaccard t-SNE plot."}
slot(jacc.Vis, "palette") <- c("red", "green", "blue", "purple")
plot(jacc.Vis, view = "tsne")
```
# References
[^Abrams and colleagues]: Abrams ZB, Zhang L, Abruzzo LV, Heerema NA, Li S,
Dillon T, Rodriguez R, Coombes KR, Payne PRO. CytoGPS: A Web-Enabled Karyotype
Analysis Tool for Cytogenetics. Bioinformatics. 2019 Jul 2.
doi: 10.1093/bioinformatics/btz520. (Epub ahead of print)
[^Choi and colleagues]: Choi SS, Cha SH, Tappert CC. A Survey of Binary Similarity and Distance Measures. Systemics, Cybernetics, and Informatics 2010;8(1):43-48.
[^Wang and colleagues]: Wang M, Abrams ZB, Kornblau SM, Coombes KR. Thresher: determining the number of clusters while removing outliers. BMC Bioinformatics 2018;19(1):9.