We want to illustrate the RPointCloud
package (Version
0.8.0) with a AML10.node287 data set. Not surprisingly, we start by
loading the package.
We also load several other useful packages (some of which may eventually get incorporated into the package requirements).
suppressMessages( library(Mercator) )
library(ClassDiscovery)
library(Polychrome)
data(Dark24)
data(Light24)
suppressMessages( library(igraph) )
suppressMessages( library("ape") )
suppressPackageStartupMessages( library(circlize) )
Now we fetch the sample AML10.node287 data set that is included with the package.
## [1] "AML10.node287" "AML10.node287.rips" "Arip" "Dark24"
## [5] "G" "G2" "H" "K"
## [9] "L" "Light24" "Lt" "P"
## [13] "U" "V" "W" "X"
## [17] "Y" "angle.df" "annote" "clinical"
## [21] "colorScheme" "colset" "cyc" "cyc1"
## [25] "cyc2" "cyc3" "cycles" "d0"
## [29] "d1" "d2" "daisydist" "diag"
## [33] "e" "eb" "edges" "ef"
## [37] "featMU" "featRai" "keg" "mds"
## [41] "mercury" "mu" "nu" "nzero"
## [45] "ob" "oopt" "pal" "persistence"
## [49] "poof" "rate" "ripdiag" "shape"
## [53] "sigma" "vd" "xx" "yy"
## [1] 214 18
## [1] "CD45RA" "CD133" "CD47" "p21" "CD90" "CycA" "CycB" "PCNA" "Ki-67" "pRb"
## [11] "CD99" "CD13" "pCDK1" "H2AX" "cPARP" "pS6" "pH3" "DNA-2"
The AML10.node187
object is a numeric matrix from a
patient identified as “AML10”. The complete data set [Behbehani et al.]
was processed by the SPADE algorithm [Qiu et al.], with individual cells
clustered into distinct nodes based on the expression patterns of cell
surface markers. All cells in “node187” were identified as early
monocytes. The columns in this data matrix are additional
protein/antibody markers that were measured in the experiment, which
focused on the cell cycle. The AML10.node187.rips
object is
a “Rips diagram” produced by applying the TDA
algorithm to
these data.
Here are some plots of the TDA
results using tools from
the original package. (I am not sure what any of these are really good
for.)
diag <- AML10.node287.rips[["diagram"]]
opar <- par(mfrow = c(1,2))
plot(diag, barcode = TRUE, main = "Barcode")
plot(diag, main = "Rips Diagram")
Now we use our Mercator
package to view the underlying
data.
mercury <- Mercator(amldist, metric = "euclidean", method = "hclust", K = 8)
mercury <- addVisualization(mercury, "mds")
mercury <- addVisualization(mercury, "tsne")
mercury <- addVisualization(mercury, "umap")
mercury <- addVisualization(mercury, "som")
opar <- par(mfrow = c(3,2), cex = 1.1)
plot(mercury, view = "hclust")
plot(mercury, view = "mds", main = "Mult-Dimensional Scaling")
plot(mercury, view = "tsne", main = "t-SNE")
plot(mercury, view = "umap", main = "UMAP")
barplot(mercury, main = "Silhouette Width")
plot(mercury, view = "som", main = "Self-Organizing Maps")
Here is a picture of the “zero-cycle” data, which can also be used ultimately to cluster the points (where each point is a patient). The connected lines are similar to a single-linkage clustering structure, showing when individual points are merged together as the TDA parameter increases.
nzero <- sum(diag[, "dimension"] == 0)
cycles <- AML10.node287.rips[["cycleLocation"]][1:nzero]
L <- sapply(cycles, length)
cycles <- cycles[L > 0]
W <- mercury@view[["umap"]]$layout
plot(W, main = "Connected Zero Cycles")
for (cyc in cycles) {
points(W[cyc[1], , drop = FALSE], pch = 16,col = "red")
X <- c(W[cyc[1], 1], W[cyc[2],1])
Y <- c(W[cyc[1], 2], W[cyc[2],2])
lines(X, Y)
}
We can convert the 0-dimensional cycle structure into a dendrogram,
by first passing them through the igraph
package. We start
by putting all the zero-cycle data together, which can be viewed as an
“edge-list” from the igraph
perspective.
edges <- t(do.call(cbind, cycles)) # this creates an "edgelist"
G <- graph_from_edgelist(edges)
G <- set_vertex_attr(G, "label", value = attr(amldist, "Labels"))
Note that we attached the sample names to the graph, obtaining them from the daisy distance matrix. Now we use two different algorithms to decide how to layout the graph.
opar <- par(mfrow = c(1,2), mai = c(0.01, 0.01, 1.02, 0.01))
plot(G, layout = Lt, main = "As Tree")
plot(G, layout = L, main = "Fruchterman-Reingold")
Note that the Fruchterman-Reingold layout gives the most informative structure.
There are a variety of community-finding algorithms that we can apply. (Communities in graph theory are similar to clusters in other machine learning areas of study.) “Edge-betweenness” seems to work best.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 7 25 6 4 6 11 9 22 25 8 1 6 6 9 7 6 12 9 9 7 6 6 7
The first line in the next code chunk shows that we did actually produce a tree. We explore three different ways ro visualize it
## [1] TRUE
H <- as.hclust(keg)
H$labels <- attr(amldist, "Labels")
K <- 10
colset <- Light24[cutree(H, k=K)]
G2 <- set_vertex_attr(G, "color", value = colset)
e <- 0.01
opar <- par(mai = c(e, e, e, e))
plot(G2, layout = L)
P <- as.phylo(H)
opar <- par(mai = c(0.01, 0.01, 1.0, 0.01))
plot(P, type = "u", tip.color = colset, cex = 1.2, main = "Ape/Cladogram")
In any of the “scatter plot views” (e.g., MDS, UMAP, t-SNE) from Mercator, we may want to overlay different colors to represent different AML10.node287 features.
featKi67 <- Feature(AML10.node287[,"Ki-67"], "Ki-67", c("cyan", "red"), c("Low", "High"))
featCD99 <- Feature(AML10.node287[,"CD99"], "CD99", c("green", "magenta"), c("Low", "High"))
opar <- par(mfrow = c(1,2))
plot(W, main = "UMAP; Ki-67", xlab = "U1", ylab = "U2")
points(featKi67, W, pch = 16, cex = 1.4)
plot(W, main = "UMAP; CD99", xlab = "U1", ylab = "U2")
points(featCD99, W, pch = 16, cex = 1.4)
We have a statistical approach to deciding which of the detected cycles are statistically significant. Empirically, the persistence of 0-dimensional cycles looks like a gamma distribution, while the persistence of higher dimensional cycles looks like an exponential distribution. In both cases, we use an empirical Bayes approach, treating the observed distribution as a mixture of either gamma or exponential (as appropriate) with an unknown distribution contributing to heavier tails.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8442 1.4682 1.6475 1.7155 1.9176 3.6600
Now we want to determine if there are significant “loops” in the data, and, if so, how many?
d1 <- persistence[diag[, "dimension"] == 1]
ef <- ExpoFit(d1) # should be close to log(2)/median?
plot(ef)
## Warning in d1 > cutoff(0.8, 0.56, eb): longer object length is not a multiple of shorter object
## length
## [1] 74
## $low
## [1] 0.00116908
##
## $high
## [1] 0.2349851
## [1] 9
## [1] 35 59 60 75 87 96 103 116 133
## [1] 60
Let’s pick out the most persistent 1-cycle.
## [,1] [,2]
## [1,] 4 116
## [2,] 116 211
## [3,] 12 36
## [4,] 4 12
## [5,] 120 211
## [6,] 36 120
Each row represents an edge, by listing the IDs of the points at either end of the line segment. In this case, there are nine edges that link together to form a connect loop (or topological circle).
cyc2 <- Cycle(AML10.node287.rips, 1, 96, "red")
cyc3 <- Cycle(AML10.node287.rips, 1, 87, "purple")
opar <- par(mfrow = c(1, 3))
plot(cyc1, W, lwd = 2, pch = 16, col = "gray", xlab = "U1", ylab = "U2", main = "UMAP")
lines(cyc2, W, lwd=2)
lines(cyc3, W, lwd=2)
plot(U, pch = 16, col = "gray", main = "MDS")
lines(cyc1, U, lwd = 2)
lines(cyc2, U, lwd = 2)
lines(cyc3, U, lwd = 2)
plot(V, pch = 16, col = "gray", main = "t-SNE")
lines(cyc1, V, lwd = 2)
lines(cyc2, V, lwd = 2)
lines(cyc3, V, lwd = 2)
poof <- angleMeans(W, AML10.node287.rips, cyc3, AML10.node287)
poof[is.na(poof)] <- 0
angle.df <- poof[, c("Ki-67", "CD99", "pRb", "PCNA",
"CycA", "CycB")]
colorScheme <- list(c(M = "green", U = "magenta"),
c(Hi = "cyan", Lo ="red"),
c(Hi = "blue", Lo = "yellow"),
c(Hi = "#dddddd", Lo = "#111111"),
c(No = "#dddddd", Yes = "brown"),
c(No = "#dddddd", Yes = "purple"))
annote <- LoopCircos(cyc1, angle.df, colorScheme)
image(annote)
Now we want to determine if there are significant “voids” (empty interiors of spheres) in the data, and, if so, how many?
d2 <- persistence[diag[, "dimension"] == 2]
ef <- ExpoFit(d2) # should be close to log(2)/median?
plot(ef)
## Warning in d2 > cutoff(0.8, 0.75, eb): longer object length is not a multiple of shorter object
## length
## [1] 21
## Warning in d2 > cutoff(0.95, 0.75, eb): longer object length is not a multiple of shorter
## object length
## [1] 18
## $low
## [1] 0.0006739591
##
## $high
## [1] 0.1246824
## [1] 15
## [1] 1 2 7 13 14 15 20 22 24 26 27 29 32 33 34