--- title: "Getting started with geokmeans" author: "The geokmeans authors" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with geokmeans} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) set.seed(1) ``` ## Introduction **geokmeans** provides fast C++ implementations of several *k*-means clustering algorithms behind a single, uniform R interface. The headline method is **Geometric-*k*-means**, a bound-free algorithm that uses geometry (scalar projection) to focus distance computations only on the points that can actually change cluster membership, while skipping the rest. It returns the *same* clustering as Lloyd's *k*-means when initialized identically, but typically with far fewer distance computations, less memory, and lower energy use (Sharma et al., 2026, \doi{10.1007/s10994-025-06891-1}). The package exposes seven algorithms: | Function | Algorithm | Type | |---------------------|----------------------|-------------| | `geo_kmeans()` | Geometric-*k*-means | bound-free | | `ball_kmeans()` | Ball *k*-means++ | bound-free | | `lloyd_kmeans()` | Lloyd's *k*-means | baseline | | `elkan_kmeans()` | Elkan | bounded | | `hamerly_kmeans()` | Hamerly | bounded | | `annulus_kmeans()` | Annulus | bounded | | `exponion_kmeans()` | Exponion | bounded | All share the same interface and return value; they differ only in the acceleration strategy used internally. ```{r library} library(geokmeans) ``` ## A first clustering Every function takes a numeric matrix (observations in rows, features in columns) and `centers`, which is either the number of clusters or a matrix of initial centroids. ```{r first} X <- rbind( matrix(rnorm(200, mean = 0), ncol = 2), matrix(rnorm(200, mean = 6), ncol = 2) ) fit <- geo_kmeans(X, centers = 2) fit ``` The result is a `geokmeans` object: a list with the final centroids, the per-point cluster assignment, and computational statistics. ```{r structure} str(fit) ``` ```{r plot, fig.alt = "Two clusters coloured by assignment with centroids marked"} plot(X, col = fit$cluster, pch = 19, cex = 0.6, xlab = "x1", ylab = "x2", main = "geo_kmeans result") points(fit$centroids, pch = 8, cex = 2, lwd = 2) ``` ## Choosing an algorithm Because every variant returns the same partition (they are all exact), the choice is about speed, not quality. Call a function directly, or use the `kmeans_dc()` dispatcher with `method`: ```{r dispatch} kmeans_dc(X, centers = 2, method = "elkan")$centroids ``` ### Comparing distance computations The algorithms differ in how many point-to-centroid distances they compute. On the same data and seed, they reach the same solution with very different amounts of work: ```{r compare} set.seed(42) Y <- do.call(rbind, lapply(seq_len(6), function(i) matrix(rnorm(500, mean = 4 * i), ncol = 2))) methods <- c("lloyd", "hamerly", "annulus", "exponion", "ball", "geokmeans") comparison <- data.frame( method = methods, distance_calcs = vapply(methods, function(m) { kmeans_dc(Y, centers = 6, method = m, seed = 1)$distance_calculations }, numeric(1)), row.names = NULL ) comparison[order(comparison$distance_calcs), ] ``` Lower is better. The relative savings grow with the number of observations and clusters. ## Initialization and reproducibility For a numeric `centers`, the starting centroids are chosen by `init`: * `"random"` (default) -- random observations, drawn with R's RNG; * `"sequential"` -- the first `k` observations. Because initialization uses R's random number generator, results are reproducible. Passing a non-`NULL` `seed` calls `set.seed()` internally so a call is self-contained and repeatable; the default `seed = NULL` leaves the RNG untouched and honours a preceding `set.seed()`. ```{r seed} a <- geo_kmeans(X, centers = 2, seed = 7) b <- geo_kmeans(X, centers = 2, seed = 7) identical(a$centroids, b$centroids) ``` You can also supply your own starting centroids as a matrix (mirroring `stats::kmeans()`): ```{r custom-init} init <- X[c(1, 101), ] geo_kmeans(X, centers = init)$centroids ``` ## Working with the bundled data Two small datasets ship with the package under `extdata`: ```{r realdata} path <- system.file("extdata", "Breastcancer.csv", package = "geokmeans") bc <- as.matrix(read.csv(path, header = FALSE)) dim(bc) bc_fit <- geo_kmeans(bc, centers = 2, seed = 1) table(bc_fit$cluster) ``` ## Safeguards for degenerate inputs You cannot form more non-empty clusters than there are distinct observations. Requesting too many is an error with an explanatory message: ```{r toomany, error = TRUE} D <- rbind(matrix(0.1, 50, 2), matrix(9, 50, 2)) # only 2 distinct rows geo_kmeans(D, centers = 3) ``` If a run leaves a cluster empty, it is dropped by default and the labels are renumbered (`drop_empty = TRUE`); set `drop_empty = FALSE` to keep the requested number of centroids. Set `with_labels = FALSE` to skip computing per-point assignments when you only need the centroids. ## Citation ```{r cite, eval = FALSE} citation("geokmeans") ``` Please cite the paper: > Sharma, P., Malec, M., Kurban, H., Kulekci, O., & Dalkilic, M. (2026). > *Geometric-k-means: A Bound Free Approach to Fast and Eco-Friendly k-means.* > Machine Learning, 115(2), article 30. \doi{10.1007/s10994-025-06891-1}