Getting started with geokmeans

Introduction

geokmeans provides fast C++ implementations of several k-means clustering algorithms behind a single, uniform R interface. The headline method is Geometric-k-means, a bound-free algorithm that uses geometry (scalar projection) to focus distance computations only on the points that can actually change cluster membership, while skipping the rest. It returns the same clustering as Lloyd’s k-means when initialized identically, but typically with far fewer distance computations, less memory, and lower energy use (Sharma et al., 2026, ).

The package exposes seven algorithms:

Function	Algorithm	Type
`geo_kmeans()`	Geometric-k-means	bound-free
`ball_kmeans()`	Ball k-means++	bound-free
`lloyd_kmeans()`	Lloyd’s k-means	baseline
`elkan_kmeans()`	Elkan	bounded
`hamerly_kmeans()`	Hamerly	bounded
`annulus_kmeans()`	Annulus	bounded
`exponion_kmeans()`	Exponion	bounded

All share the same interface and return value; they differ only in the acceleration strategy used internally.

library(geokmeans)

A first clustering

Every function takes a numeric matrix (observations in rows, features in columns) and centers, which is either the number of clusters or a matrix of initial centroids.

X <- rbind(
  matrix(rnorm(200, mean = 0), ncol = 2),
  matrix(rnorm(200, mean = 6), ncol = 2)
)

fit <- geo_kmeans(X, centers = 2)
fit
#> k-means clustering (geokmeans) with 2 clusters
#> Iterations: 2 | Distance computations: 601
#> Cluster sizes: 100, 100
#> Centroids:
#>        [,1]    [,2]
#> [1,] 6.0297  6.0516
#> [2,] 0.1089 -0.0378

The result is a geokmeans object: a list with the final centroids, the per-point cluster assignment, and computational statistics.

str(fit)
#> List of 6
#>  $ centroids            : num [1:2, 1:2] 6.0297 0.1089 6.0516 -0.0378
#>  $ iterations           : int 2
#>  $ distance_calculations: num 601
#>  $ cluster              : int [1:200] 2 2 2 2 2 2 2 2 2 2 ...
#>  $ method               : chr "geokmeans"
#>  $ k                    : int 2
#>  - attr(*, "class")= chr "geokmeans"

plot(X, col = fit$cluster, pch = 19, cex = 0.6,
     xlab = "x1", ylab = "x2", main = "geo_kmeans result")
points(fit$centroids, pch = 8, cex = 2, lwd = 2)

Two clusters coloured by assignment with centroids marked

Choosing an algorithm

Because every variant returns the same partition (they are all exact), the choice is about speed, not quality. Call a function directly, or use the kmeans_dc() dispatcher with method:

kmeans_dc(X, centers = 2, method = "elkan")$centroids
#>           [,1]        [,2]
#> [1,] 0.1088874 -0.03780808
#> [2,] 6.0296755  6.05160093

Comparing distance computations

The algorithms differ in how many point-to-centroid distances they compute. On the same data and seed, they reach the same solution with very different amounts of work:

set.seed(42)
Y <- do.call(rbind, lapply(seq_len(6), function(i)
  matrix(rnorm(500, mean = 4 * i), ncol = 2)))

methods <- c("lloyd", "hamerly", "annulus", "exponion",
             "ball", "geokmeans")

comparison <- data.frame(
  method = methods,
  distance_calcs = vapply(methods, function(m) {
    kmeans_dc(Y, centers = 6, method = m, seed = 1)$distance_calculations
  }, numeric(1)),
  row.names = NULL
)
comparison[order(comparison$distance_calcs), ]
#>      method distance_calcs
#> 4  exponion          16789
#> 6 geokmeans          16954
#> 5      ball          17414
#> 3   annulus          19458
#> 2   hamerly          21241
#> 1     lloyd          54000

Lower is better. The relative savings grow with the number of observations and clusters.

Initialization and reproducibility

For a numeric centers, the starting centroids are chosen by init:

"random" (default) – random observations, drawn with R’s RNG;
"sequential" – the first k observations.

Because initialization uses R’s random number generator, results are reproducible. Passing a non-NULL seed calls set.seed() internally so a call is self-contained and repeatable; the default seed = NULL leaves the RNG untouched and honours a preceding set.seed().

a <- geo_kmeans(X, centers = 2, seed = 7)
b <- geo_kmeans(X, centers = 2, seed = 7)
identical(a$centroids, b$centroids)
#> [1] TRUE

You can also supply your own starting centroids as a matrix (mirroring stats::kmeans()):

init <- X[c(1, 101), ]
geo_kmeans(X, centers = init)$centroids
#>           [,1]        [,2]
#> [1,] 0.1088874 -0.03780808
#> [2,] 6.0296755  6.05160093

Working with the bundled data

Two small datasets ship with the package under extdata:

path <- system.file("extdata", "Breastcancer.csv", package = "geokmeans")
bc <- as.matrix(read.csv(path, header = FALSE))
dim(bc)
#> [1] 569  30

bc_fit <- geo_kmeans(bc, centers = 2, seed = 1)
table(bc_fit$cluster)
#> 
#>   1   2 
#> 457 112

Safeguards for degenerate inputs

You cannot form more non-empty clusters than there are distinct observations. Requesting too many is an error with an explanatory message:

D <- rbind(matrix(0.1, 50, 2), matrix(9, 50, 2))  # only 2 distinct rows
geo_kmeans(D, centers = 3)
#> Error:
#> ! Requested 3 clusters, but 'data' has only 2 distinct row(s).
#> k-means cannot form more non-empty clusters than there are distinct observations.
#> Set 'centers' to at most 2, or provide data with at least 3 distinct rows.

If a run leaves a cluster empty, it is dropped by default and the labels are renumbered (drop_empty = TRUE); set drop_empty = FALSE to keep the requested number of centroids. Set with_labels = FALSE to skip computing per-point assignments when you only need the centroids.

Citation

citation("geokmeans")

Please cite the paper:

Sharma, P., Malec, M., Kurban, H., Kulekci, O., & Dalkilic, M. (2026). Geometric-k-means: A Bound Free Approach to Fast and Eco-Friendly k-means. Machine Learning, 115(2), article 30.