Title: | K-Means with Simultaneous Outlier Detection |
---|---|
Description: | An implementation of the 'k-means--' algorithm proposed by Chawla and Gionis, 2013 in their paper, "k-means-- : A unified approach to clustering and outlier detection. SIAM International Conference on Data Mining (SDM13)", <doi:10.1137/1.9781611972832.21> and using 'ordering' described by Howe, 2013 in the thesis, Clustering and anomaly detection in tropical cyclones". Useful for creating (potentially) tighter clusters than standard k-means and simultaneously finding outliers inexpensively in multidimensional space. |
Authors: | David Charles Howe [aut, cre] |
Maintainer: | David Charles Howe <[email protected]> |
License: | GPL-3 |
Version: | 0.2.0 |
Built: | 2024-12-01 08:49:44 UTC |
Source: | CRAN |
An implementation of the 'k-means–' algorithm proposed by Chawla and Gionis, 2013 in their paper, "k-means– : A unified approach to clustering and outlier detection. SIAM International Conference on Data Mining (SDM13)", doi:10.1137/1.9781611972832.21 and using 'ordering' described by Howe, 2013 in the thesis, "Clustering and anomaly detection in tropical cyclones".
Useful for creating (potentially) tighter clusters than standard k-means and simultaneously finding outliers inexpensively in multidimensional space.
kmod( X, k = 5, l = 0, i_max = 100, conv_method = "delta_C", conv_error = 0, allow_empty_c = FALSE )
kmod( X, k = 5, l = 0, i_max = 100, conv_method = "delta_C", conv_error = 0, allow_empty_c = FALSE )
X |
matrix of numeric data or an object that can be coerced to such a matrix (such as a data frame with numeric columns only). |
k |
the number of clusters (default = 5) |
l |
the number of outliers (default = 0) |
i_max |
the maximum number of iterations permissible (default = 100) |
conv_method |
character: the method used to assess if kmod has converged (default = "delta_C") |
conv_error |
numeric: the tolerance permissible when assessing convergence (default = 0) |
allow_empty_c |
logical: set whether empty clusters are permissible (default = FALSE) |
kmod returns a list comprising the following components
k
the number of clusters specified
l
the number of outliers specified
C
the set of cluster centroids
C_sizes
cluster sizes
C_ss
the sum of squares for each cluster
L
the set of outliers
L_dist_sqr
the distance squares for each outlier to C
L_index
the index of each outlier in the supplied dataset
XC_dist_sqr_assign
the distance square and cluster assignment
of each point in the supplied dataset
within_ss
the within cluster sum of squares (excludes outliers)
between_ss
the between cluster sum of squares
tot_ss
the total sum of squares
iterations
the number of iterations taken to converge
# a 2-dimensional example with 2 clusters and 5 outliers x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmod(x, 2, 5)) # cluster a dataset with 8 clusters and 0 outliers x <- kmod(x, 8)
# a 2-dimensional example with 2 clusters and 5 outliers x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmod(x, 2, 5)) # cluster a dataset with 8 clusters and 0 outliers x <- kmod(x, 8)