--- title: "UAHDataScienceUC: A Comprehensive Guide to Clustering Algorithms" author: "Andriy Protsak" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using the Unified Interface in clustlearn 1.1.0} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The UAHDataScienceUC package provides a robust collection of clustering algorithms implemented in R. This package, developed at the Universidad de Alcalá de Henares, offers both traditional and advanced clustering methods, making it a valuable tool for data scientists and researchers. In this vignette, we'll explore the various clustering algorithms available in the package and learn how to use them effectively. ## Installation You can install the package from CRAN using: ```{r, eval = FALSE} install.packages("UAHDataScienceUC") ``` ## Available algorithms The package implements several clustering algorithms, each with its own strengths and use cases: * K-Means Clustering * Agglomerative Hierarchical Clustering * Divisive Hierarchical Clustering * DBSCAN (Density-Based Spatial Clustering) * Gaussian Mixture Models * Genetic K-Means * Correlation-Based Clustering ```{r} # Load library library(UAHDataScienceUC) # Load data data(db5) # Create sample data data <- db5[1:10, ] ``` ## K-Means Clustering It partitions n observations into k clusters where each observation belongs to the cluster with the nearest mean. ```{r} # Perform k-means clustering result <- kmeans_(data, centers = 3, max_iterations = 10) # Plot results plot(data, col = result$cluster, pch = 20) points(result$centers, col = 1:3, pch = 8, cex = 2) ``` ## Agglomerative Hierarchical Clustering This algorithm builds a hierarchy of clusters from bottom-up, starting with individual observations and progressively merging them into clusters. ```{r} # Perform hierarchical clustering result <- agglomerative_clustering( data, proximity = "single", distance_method = "euclidean", learn = TRUE ) ``` ## DBSCAN DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is particularly effective at finding clusters of arbitrary shape and identifying noise points. ```{r} result <- dbscan( data, epsilon = 0.3, min_pts = 4, learn = TRUE ) ``` ## Gaussian Mixture Models This probabilistic model assumes that the data points are generated from a mixture of several Gaussian distributions. ```{r} result <- gaussian_mixture( data, k = 3, max_iter = 100, learn = TRUE ) # Plot results with contours plot(data, col = result$cluster, pch = 20) ``` ## Genetic K-Means This algorithm combines traditional k-means with genetic algorithm concepts for potentially better cluster optimization. ```{r} result <- genetic_kmeans( data, k = 3, population_size = 10, mut_probability = 0.5, max_generations = 10, learn = TRUE ) ``` ## Correlation Clustering Correlation clustering performs hierarchical clustering by analyzing relationships between data points and a target, with support for weighted features. ```{r} # Create sample data data <- matrix(c(1,2,1,4,5,1,8,2,9,6,3,5,8,5,4), ncol=3) dataFrame <- data.frame(data) target <- c(1,2,3) weights <- c(0.1, 0.6, 0.3) # Perform correlation clustering result <- correlation_clustering( dataFrame, target = target, weight = weights, distance_method = "euclidean", normalize = TRUE, learn = TRUE ) ``` ## Distances The package supports various distance metrics for algorithms like agglomerative clustering and correlation clustering, the available metrics are: euclidean distance, manhattan distance, canberra distance and chebyshev distance. You can specify these in algorithms that accept a distance parameter: ```{r} # Using different distance metrics agglomerative_clustering(data, distance_method = "euclidean") agglomerative_clustering(data, distance_method = "manhattan") agglomerative_clustering(data, distance_method = "canberra") agglomerative_clustering(data, distance_method = "chebyshev") ```