Title: | Construct and Visualize TDA Mapper Graphs |
---|---|
Description: | Topological data analysis (TDA) is a method of data analysis that uses techniques from topology to analyze high-dimensional data. Here we implement Mapper, an algorithm from this area developed by Singh, Mémoli and Carlsson (2007) which generalizes the concept of a Reeb graph <https://en.wikipedia.org/wiki/Reeb_graph>. |
Authors: | George Clare Kennedy [aut, cre] |
Maintainer: | George Clare Kennedy <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.0 |
Built: | 2024-12-19 06:28:37 UTC |
Source: | CRAN |
Get a tester function for an interval.
check_in_interval(endpoints)
check_in_interval(endpoints)
endpoints |
A vector of interval endpoints, namely a left and a right. Must be in order. |
A function that eats a data point and outputs TRUE if the datapoint is in the interval and FALSE if not.
Compute dispersion of a single cluster
compute_tightness(dists, cluster)
compute_tightness(dists, cluster)
dists |
A distance matrix for points in the cluster. |
cluster |
A list containing named vectors, whose names are data point names and whose values are cluster labels |
This method computes a measure of cluster dispersion. It finds the medoid of the input data set and returns the average distance to the medoid. Formally, we say the tightness of a cluster
is given by
where
A smaller value indicates a tighter cluster based on this metric.
A real number in representing a measure of dispersion of a cluster.
The easiest clustering method
convert_to_clusters(bins)
convert_to_clusters(bins)
bins |
A list of bins, each containing names of data from some data frame. |
A named vector whose names are data point names and whose values are cluster labels
Run mapper using a one-dimensional filter, a cover of intervals, and a clustering algorithm.
create_1D_mapper_object( data, dists, filtered_data, cover, clustering_method = "single", global_clustering = TRUE )
create_1D_mapper_object( data, dists, filtered_data, cover, clustering_method = "single", global_clustering = TRUE )
data |
A data frame. |
dists |
A distance matrix for the data frame. |
filtered_data |
The result of a function applied to the data frame; there should be one filter value per observation in the original data frame. |
cover |
A 2D array of interval left and right endpoints; rows should be intervals and columns left and right endpoints (in that order). |
clustering_method |
A string to pass to hclust to determine clustering method. |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context. |
A list of two data frames, one with node data containing bin membership, data points per cluster, and cluster dispersion, and one with edge data containing sources, targets, and weights representing overlap strength.
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) projx = data$x num_bins = 10 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) create_1D_mapper_object(data, dist(data), projx, cover, "single")
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) projx = data$x num_bins = 10 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) create_1D_mapper_object(data, dist(data), projx, cover, "single")
Run mapper using an -net cover (greedily generated) and the 2D inclusion function as a filter.
create_ball_mapper_object(data, dists, eps)
create_ball_mapper_object(data, dists, eps)
data |
A data frame. |
dists |
A distance matrix for the data frame. |
eps |
A positive real number for your desired ball radius. |
A list of two data frames, one with node data containing ball size, data points per ball, ball tightness, and one with edge data containing sources, targets, and weights representing overlap strength.
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) eps = .5 create_ball_mapper_object(data, dist(data), eps)
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) eps = .5 create_ball_mapper_object(data, dist(data), eps)
Make a cover of balls
create_balls(data, dists, eps)
create_balls(data, dists, eps)
data |
A data frame. |
dists |
A distance matrix for the data frame. |
eps |
A positive real number. |
A list of vectors of data point names, one list element per ball. The output is such that every data point is contained in a ball of radius , and no ball center is contained in more than one ball. The centers are datapoints themselves.
num_points = 5000 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1), z = sapply(1:num_points, function(x) 10 * sin(x) ^ 2 * cos(x)) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) balls = create_balls(data = P.data, dists = P.dist, eps = .25)
num_points = 5000 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1), z = sapply(1:num_points, function(x) 10 * sin(x) ^ 2 * cos(x)) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) balls = create_balls(data = P.data, dists = P.dist, eps = .25)
Create bins of data
create_bins(data, filtered_data, cover_element_tests)
create_bins(data, filtered_data, cover_element_tests)
data |
A data frame. |
filtered_data |
The result of a function applied to the data frame; there should be one filter value per observation in the original data frame. |
cover_element_tests |
A list of membership test functions for a set of cover elements. In other words, each element of |
A list of level sets, each containing a vector of the names of the data inside it.
Run ball mapper, but additionally cluster within the balls. Can use two different distance matrices to accomplish this.
create_clusterball_mapper_object( data, dist1, dist2, eps, clustering_method, global_clustering = TRUE )
create_clusterball_mapper_object( data, dist1, dist2, eps, clustering_method, global_clustering = TRUE )
data |
A data frame. |
dist1 |
A distance matrix for the data frame; this will be used to ball the data. |
dist2 |
Another distance matrix for the data frame; this will be used to cluster the data after balling. |
eps |
A positive real number for your desired ball radius. |
clustering_method |
A string to pass to hclust to determine clustering method. |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context. |
A list of two dataframes, one with node data containing bin membership, datapoints per cluster, and cluster dispersion, and one with edge data containing sources, targets, and weights representing overlap strength.
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) data.dists = dist(data) eps = 1 create_clusterball_mapper_object(data, data.dists, data.dists, eps, "single")
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) data.dists = dist(data) eps = 1 create_clusterball_mapper_object(data, data.dists, data.dists, eps, "single")
Run the mapper algorithm on a data frame.
create_mapper_object( data, dists, filtered_data, cover_element_tests, method = "none", global_clustering = TRUE )
create_mapper_object( data, dists, filtered_data, cover_element_tests, method = "none", global_clustering = TRUE )
data |
A data frame. |
dists |
A distance matrix for the data frame. |
filtered_data |
The result of a function applied to the data frame; there should be one filter value per observation in the original data frame. |
cover_element_tests |
A list of membership test functions for a set of cover elements. In other words, each element of |
method |
A string to pass to hclust to determine clustering method. |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context |
A list of two dataframes, one with node data and one with edge data.
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) projx = data$x num_bins = 10 percent_overlap = 25 xcover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) check_in_interval <- function(endpoints) { return(function(x) (endpoints[1] - x <= 0) & (endpoints[2] - x >= 0)) } # each of the "cover" elements will really be a function that checks if a data point lives in it xcovercheck = apply(xcover, 1, check_in_interval) # build the mapper object xmapper = create_mapper_object( data = data, dists = dist(data), filtered_data = projx, cover_element_tests = xcovercheck, method = "single" )
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) projx = data$x num_bins = 10 percent_overlap = 25 xcover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) check_in_interval <- function(endpoints) { return(function(x) (endpoints[1] - x <= 0) & (endpoints[2] - x >= 0)) } # each of the "cover" elements will really be a function that checks if a data point lives in it xcovercheck = apply(xcover, 1, check_in_interval) # build the mapper object xmapper = create_mapper_object( data = data, dists = dist(data), filtered_data = projx, cover_element_tests = xcovercheck, method = "single" )
Create a bin of data
create_single_bin(data, filtered_data, cover_element_test)
create_single_bin(data, filtered_data, cover_element_test)
data |
A data frame. |
filtered_data |
The result of a function applied to the data frame; there should be one filter value per observation in the original data frame. |
cover_element_test |
A membership test function for a cover element. It should return |
A vector of names of points from the data frame, representing a level set.
This is a function that generates a cover of an interval with
some number of (possibly) overlapping, evenly spaced, identical width subintervals.
create_width_balanced_cover(min_val, max_val, num_bins, percent_overlap)
create_width_balanced_cover(min_val, max_val, num_bins, percent_overlap)
min_val |
The left endpoint |
max_val |
The right endpoint |
num_bins |
The number of cover intervals with which to cover the interval. A positive integer. |
percent_overlap |
How much overlap desired between the cover intervals
(the percent of the intersection of each interval with its immediate
neighbor relative to its length, e.g., |
A 2D numeric array.
left_ends - The left endpoints of the cover intervals.
right_ends - The right endpoints of the cover intervals.
create_width_balanced_cover(min_val=0, max_val=100, num_bins=10, percent_overlap=15) create_width_balanced_cover(-11.5, 10.33, 100, 2)
create_width_balanced_cover(min_val=0, max_val=100, num_bins=10, percent_overlap=15) create_width_balanced_cover(-11.5, 10.33, 100, 2)
Cut a dendrogram
cut_dendrogram(dend, threshold)
cut_dendrogram(dend, threshold)
dend |
A single dendrogram. |
threshold |
A mininum tallest branch value. |
The number of clusters is determined to be 1 if the tallest branch of the dendrogram is less than the threshold, or if the index of dispersion (standard deviation squared divided by mean) of the branch heights is below 0.015. Otherwise, we cut at the longest branch of the dendrogram to determine the number of clusters.
A named vector whose names are data point names and whose values are cluster labels.
Compute eccentricity of data points
eccentricity_filter(dists)
eccentricity_filter(dists)
dists |
A distance matrix associated to a data frame. |
A vector of centrality measures, calculated per data point as the sum of its distances to every other data point, divided by the number of points.
num_points = 100 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) eccentricity = eccentricity_filter(P.dist)
num_points = 100 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) eccentricity = eccentricity_filter(P.dist)
Recover bins
get_bin_vector(binclust_data)
get_bin_vector(binclust_data)
binclust_data |
A list of bins, each containing named vectors whose names are those of data points and whose values are cluster ids. |
A vector of integers equal in length to the number of clusters, whose members identify which bin that cluster belongs to.
Compute cluster sizes
get_cluster_sizes(binclust_data)
get_cluster_sizes(binclust_data)
binclust_data |
A list of bins, each containing named vectors whose names are those of data points and whose values are cluster ids. |
A vector of integers representing the lengths of the clusters in the input data.
Compute dispersion measures of a list of clusters
get_cluster_tightness_vector(dists, binclust_data)
get_cluster_tightness_vector(dists, binclust_data)
dists |
A distance matrix for the data points inside all the input clusters |
binclust_data |
A list of named vectors whose names are those of data points and whose values are cluster ids |
A vector of real numbers in representing a measure of dispersion of a cluster, calculated according to compute_tightness.
Get data within a cluster
get_clustered_data(binclust_data)
get_clustered_data(binclust_data)
binclust_data |
A list of bins, each containing named vectors whose names are those of data points and whose values are cluster ids |
A list of strings, each a comma separated list of the toString values of the data point names.
This function processes the binned data and global distance matrix to return freshly clustered data.
get_clusters(bins, dists, method, global_clustering = TRUE)
get_clusters(bins, dists, method, global_clustering = TRUE)
bins |
A list containing "bins" of vectors of names of data points. |
dists |
A distance matrix containing pairwise distances between named data points. |
method |
A string to pass to hclust to determine clustering method. |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context |
A list containing named vectors (one per bin), whose names are data point names and whose values are cluster labels
Calculate edge weights
get_edge_weights(overlap_lengths, cluster_sizes, edges)
get_edge_weights(overlap_lengths, cluster_sizes, edges)
overlap_lengths |
A named vector of cluster overlap lengths, obtained by calling |
cluster_sizes |
A vector of cluster sizes. |
edges |
A 2D array of source and target nodes, representing an edge list. Should be ordered consistently with the |
This value is calculated per edge by dividing the number of data points in the overlap by the number of points in the cluster on either end, and taking the maximum value. Formally,
A vector of real numbers representing cluster overlap strength.
Obtain edge list from cluster intersections
get_edgelist_from_overlaps(overlaps, num_vertices)
get_edgelist_from_overlaps(overlaps, num_vertices)
overlaps |
A named list of edges, whose elements contain the names of clusters in the overlap represented by that edge; output of |
num_vertices |
The number of vertices in the graph. |
A 2D array representing the edge list of a graph.
Perform hierarchical clustering and process dendrograms
get_hierarchical_clusters(dist_mats, method, global_clustering = TRUE)
get_hierarchical_clusters(dist_mats, method, global_clustering = TRUE)
dist_mats |
A list of distance matrices to be used for clustering. |
method |
A string to pass to hclust to determine clustering method. |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context |
A list containing named vectors (one per dendrogram), whose names are data point names and whose values are cluster labels
Get cluster overlaps
get_overlaps(binclust_data)
get_overlaps(binclust_data)
binclust_data |
A list of bins, each containing named vectors whose names are those of data points and whose values are cluster ids. |
A named list of edges, whose elements contain the names of clusters in the overlap represented by that edge.
Find the tallest branch of a dendrogram
get_tallest_branch(dend)
get_tallest_branch(dend)
dend |
A single dendrogram. |
The height of the tallest branch (longest time between merge heights) of the input dendrogram.
Get a tester function for a ball.
is_in_ball(ball)
is_in_ball(ball)
ball |
A list of data points. |
A function that eats a data point and returns TRUE or FALSE depending if the point is in the ball or not.
make igraph
mapper_object_to_igraph(mapperobject)
mapper_object_to_igraph(mapperobject)
mapperobject |
mapper object generated by mappeR |
an igraph object
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) projy = data$y cover = create_width_balanced_cover(min(projy), max(projy), 10, 25) mapperobj = create_1D_mapper_object(data, dist(data), data$y, cover, "single") mapper_object_to_igraph(mapperobj)
data = data.frame(x = sapply(1:100, function(x) cos(x)), y = sapply(1:100, function(x) sin(x))) projy = data$y cover = create_width_balanced_cover(min(projy), max(projy), 10, 25) mapperobj = create_1D_mapper_object(data, dist(data), data$y, cover, "single") mapper_object_to_igraph(mapperobj)
Find which triangular number you're on
next_triangular(x)
next_triangular(x)
x |
A positive integer. |
The index of the next greatest or equal triangular number to .
Cut many dendrograms
process_dendrograms(dends, global_clustering = TRUE)
process_dendrograms(dends, global_clustering = TRUE)
dends |
A list of dendrograms to be cut. |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context. |
This function uses a value of 10 percent of the tallest branch across dendrograms as a threshold for cut_dendrogram.
A list of named vectors (one per dendrogram) whose names are data point names and whose values are cluster labels.
This function tells the computer to look away for a second, so the goblins come and cluster your data while it's not watching.
run_cluster_machine(dist_mats, method, global_clustering = TRUE)
run_cluster_machine(dist_mats, method, global_clustering = TRUE)
dist_mats |
A list of distance matrices of each bin that is to be clustered. |
method |
A string to pass to |
global_clustering |
Whether you want clustering to happen in a global (all level visible) or local (only current level set visible) context |
A list containing named vectors (one per bin), whose names are data point names and whose values are cluster labels (within each bin)
Perform single linkage clustering
run_link(dist, method)
run_link(dist, method)
dist |
A distance matrix. |
method |
A string to pass to hclust to determine clustering method. |
A dendrogram generated by fastcluster
.
Construct mapper graph from data
run_mapper(binclust_data, dists, binning = TRUE)
run_mapper(binclust_data, dists, binning = TRUE)
binclust_data |
A list of bins, each containing named vectors whose names are those of data points and whose values are cluster ids |
dists |
A distance matrix for the data that has been binned and clustered. |
binning |
Whether the output dataframe should sort vertices into "bins" or not. Should be true if using clustering, leave false otherwise |
A list of two dataframes, one with node data containing bin membership, datapoints per cluster, and cluster dispersion, and one with edge data containing sources, targets, and weights representing overlap strength.
Subset a distance matrix
subset_dists(bin, dists)
subset_dists(bin, dists)
bin |
A list of names of data points. |
dists |
A distance matrix for data points in the bin, possibly including extra points. |
A distance matrix for only the data points in the input bin.