---
title: "UAHDataScienceUC: A Comprehensive Guide to Clustering Algorithms"
author: "Andriy Protsak"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using the Unified Interface in clustlearn 1.1.0}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The UAHDataScienceUC package provides a robust collection of clustering algorithms implemented in R. This package, developed at the Universidad de Alcalá de Henares, offers both traditional and advanced clustering methods, making it a valuable tool for data scientists and researchers. In this vignette, we'll explore the various clustering algorithms available in the package and learn how to use them effectively.

## Installation

You can install the package from CRAN using:
```{r, eval = FALSE}
install.packages("UAHDataScienceUC")
```

## Available algorithms
The package implements several clustering algorithms, each with its own strengths and use cases:
  
  * K-Means Clustering  
  * Agglomerative Hierarchical Clustering  
  * Divisive Hierarchical Clustering  
  * DBSCAN (Density-Based Spatial Clustering)  
  * Gaussian Mixture Models  
  * Genetic K-Means  
  * Correlation-Based Clustering  
  
```{r}
# Load library
library(UAHDataScienceUC)

# Load data
data(db5)

# Create sample data
data <- db5[1:10, ]
```
## K-Means Clustering
It partitions n observations into k clusters where each observation belongs to the cluster with the nearest mean.


```{r}
# Perform k-means clustering
result <- kmeans_(data, centers = 3, max_iterations = 10)

# Plot results
plot(data, col = result$cluster, pch = 20)
points(result$centers, col = 1:3, pch = 8, cex = 2)
```

## Agglomerative Hierarchical Clustering
This algorithm builds a hierarchy of clusters from bottom-up, starting with individual observations and progressively merging them into clusters.

```{r}
# Perform hierarchical clustering
result <- agglomerative_clustering(
  data,
  proximity = "single",
  distance_method = "euclidean",
  learn = TRUE
)
```

## DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is particularly effective at finding clusters of arbitrary shape and identifying noise points.

```{r}
result <- dbscan(
  data,
  epsilon = 0.3,
  min_pts = 4,
  learn = TRUE
)
```

## Gaussian Mixture Models
This probabilistic model assumes that the data points are generated from a mixture of several Gaussian distributions.

```{r}
result <- gaussian_mixture(
  data,
  k = 3,
  max_iter = 100,
  learn = TRUE
)

# Plot results with contours
plot(data, col = result$cluster, pch = 20)
```

## Genetic K-Means
This algorithm combines traditional k-means with genetic algorithm concepts for potentially better cluster optimization.

```{r}
result <- genetic_kmeans(
  data,
  k = 3,
  population_size = 10,
  mut_probability = 0.5,
  max_generations = 10,
  learn = TRUE
)
```

## Correlation Clustering
Correlation clustering performs hierarchical clustering by analyzing relationships between data points and a target, with support for weighted features.

```{r}
# Create sample data
data <- matrix(c(1,2,1,4,5,1,8,2,9,6,3,5,8,5,4), ncol=3)
dataFrame <- data.frame(data)
target <- c(1,2,3)
weights <- c(0.1, 0.6, 0.3)

# Perform correlation clustering
result <- correlation_clustering(
    dataFrame,
    target = target,
    weight = weights,
    distance_method = "euclidean",
    normalize = TRUE,
    learn = TRUE
)
```

## Distances
The package supports various distance metrics for algorithms like agglomerative clustering and correlation clustering, the available metrics are:
euclidean distance, manhattan distance, canberra distance and chebyshev distance.

You can specify these in algorithms that accept a distance parameter:
```{r}
# Using different distance metrics
agglomerative_clustering(data, distance_method = "euclidean")
agglomerative_clustering(data, distance_method = "manhattan")
agglomerative_clustering(data, distance_method = "canberra")
agglomerative_clustering(data, distance_method = "chebyshev")
```