Package 'drclust'

Title: Simultaneous Clustering and (or) Dimensionality Reduction
Description: Methods for simultaneous clustering and dimensionality reduction such as: Double k-means, Reduced k-means, Factorial k-means, Clustering with Disjoint PCA but also methods for exclusively dimensionality reduction: Disjoint PCA, Disjoint FA. The statistical methods implemented refer to the following articles: de Soete G., Carroll J. (1994) "K-means clustering in a low-dimensional Euclidean space" <doi:10.1007/978-3-642-51175-2_24> ; Vichi M. (2001) "Double k-means Clustering for Simultaneous Classification of Objects and Variables" <doi:10.1007/978-3-642-59471-7_6> ; Vichi M., Kiers H.A.L. (2001) "Factorial k-means analysis for two-way data" <doi:10.1016/S0167-9473(00)00064-5> ; Vichi M., Saporta G. (2009) "Clustering and disjoint principal component analysis" <doi:10.1016/j.csda.2008.05.028> ; Vichi M. (2017) "Disjoint factor analysis with cross-loadings" <doi:10.1007/s11634-016-0263-9>.
Authors: Ionel Prunila [aut, cre], Maurizio Vichi [aut]
Maintainer: Ionel Prunila <[email protected]>
License: GPL (>= 3)
Version: 0.1
Built: 2024-10-31 21:24:48 UTC
Source: CRAN

Help Index


pseudoF (pF or Calinski-Harabsz) index for choosing k in partitioning models

Description

Calculates and plots the CH index for k = 2, ..., maxK. The function provides an interval wide (2tol*pF) so that the choice of K is less conservative. Instead of just choosing the maximum pF, if it exists, picks the value such that its upper bound is larger than max pF.

Usage

apseudoF(data, maxK, tol, model, Q)

Arguments

data

Units x variables numeric data matrix.

maxK

Maximum number of clusters for the units to be tested.

tol

Approximation value. It is half of the length of theinterval put for each pF. 0 <= tol < 1. Its default value is 0.05.

model

Partitioning Models to run for each value of k. (1 = doublekm; 2 = redkm; 3 = factkm; 4 = dpcakm)

Q

Number of principal components w.r.t. variables selected for the maxK -1 partitions to be tested.

Value

bestK

best value of K (scalar).

Author(s)

Ionel Prunila, Maurizio Vichi

References

Calinski T., Harabasz J. (1974) "A dendrite method for cluster analysis" <doi:10.1080/03610927408827101>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

apF <- apseudoF(iris, maxK=10, tol = 0.05, model = 3, Q = 2)

Ward-dendrogeam of centroids of partitioning models

Description

Plots the Ward-dendrogram of the centroids of a partitioning model. The plot is useful as a diagnosis tool for the choice o the number of clusters.

Usage

centree(drclust_out)

Arguments

drclust_out

Output of either doublekm, redkm, factkm or dpcakm.

Value

centroids-dkm

Centroids x centroids distance matrix.

Author(s)

Ionel Prunila, Maurizio Vichi

References

Ward J. H. (1963) "Hierarchical Grouping to Optimize an Objective Function" <doi:10.1080/01621459.1963.10500845>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

dc_out <- dpcakm(iris, 20, 3)
d <- centree(dc_out)

classification variable

Description

Recodes the binary and row-stochastic membership matrix U into the classification variable (similar to the "cluster" output returned by kmeans()).

Usage

cluster(U)

Arguments

U

Binary and row-stochastic matrix.

Value

cl

vector of length n indicating, for each element, the index of the cluster to which it has been assigned.

Author(s)

Ionel Prunila, Maurizio Vichi

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# standardizing the data
iris <- scale(iris)

# double k-means with 3 unit-clusters and 2 components for the variables
p1 <- redkm(iris, K = 3, Q = 2)
cl <- cluster(p1$U)

Cronbach Alpha

Description

Computes the Cronbach Alpha index on a units x variables data matrix. It measures the internal reliability, i.e., the propensity of J variables of a data matrix (n units x J variables) to be concordantly correlated with a single factor (composite indicator).

Usage

CronbachAlpha(X)

Arguments

X

Units x variables numeric data matrix.

Value

as

Cronbach's Alpha

Author(s)

Ionel Prunila, Maurizio Vichi

References

Cronbach L. J. (1951) "Coefficient alpha and the internal structure of tests" <doi:10.1007/BF02310555>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# standardizing the data
iris <- scale(iris)

# compute Cronbach's Alpha
as <- CronbachAlpha(iris)

Disjoint Factor Analysis

Description

Performs disjoint factor analysis, i.e., a Factor Analysis with a simple structure. In fact, each factor is defined by a disjoint subset of variables, resulting thus, in a simplified, easier to interpret loading matrix A and factors. Estimation is carried out via Maximum Likelihood.

Usage

disfa(X, Q, Rndstart, verbose, maxiter, tol, constr, prep, print)

Arguments

X

Units x variables numeric data matrix.

Q

Number of factors.

Rndstart

Number of runs to be performed (Defaults is 20).

verbose

Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).

maxiter

Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).

tol

Tolerance threshold (maximum difference between the values of the objective function of two consecutive iterations such that convergence is assumed. Default is 1e-6).

constr

is a vector of length J = nr. of variables, pre-specifying to which cluster some of the variables must be assigned. Each component of the vector can assume integer values from 1 o Q (See example for more details), or 0 if no constraint on the variable is imposed (i.e., it will be assigned based on the plain algorithm).

prep

Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

print

Prints summary statistics of the performed method (1 = enabled; 0 = disabled, default option).

Value

returns a list of estimates and some descriptive quantities of the final results.

V

Variables x factors membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each variable has been assigned.

A

Variables x components loading matrix.

Psi

Specific variance of each observed variable, not accounted for by the common factors (matrix).

discrepancy

Value of the objective function, to be minimized. Difference between the observed and estimated covariance matrices (scalar).

RMSEA

Adjusted Root Mean Squared Error (scalar).

AIC

Aikake Information Criterion (scalar).

BIC

Bayesian Information Criterion (scalar).

GFI

Goodness of Fit Index (scalar).

Author(s)

Ionel Prunila, Maurizio Vichi

References

Vichi M. (2017) "Disjoint factor analysis with cross-loadings" <doi:10.1007/s11634-016-0263-9>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# No constraint on variables
out <- disfa(iris, Q = 2)

# Constraint: the first two variables must contribute to the same factor.
outc <- disfa(iris, Q = 2, constr = c(1,1,0,0))

Disjoint Principal Components Analysis

Description

Performs disjoint PCA, that is, a simplified version of PCA. Computes each one of the Q principal components from a different subset of the J variables (resulting thus, in a simplified, easier to interpret loading matrix A).

Usage

dispca(X, Q, Rndstart, verbose, maxiter, tol, prep, print, constr)

Arguments

X

Units x variables numeric data matrix.

Q

Number of factors.

Rndstart

Number of runs to be performed (Defaults is 20).

verbose

Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).

maxiter

Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).

tol

Tolerance threshold (maximum difference between the values of the objective function of two consecutive iterations such that convergence is assumed). Default is 1e-6.

prep

Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

print

Prints summary statistics of the results (1 = enabled; 0 = disabled, default option).

constr

is a vector of length J = nr. of variables, pre-specifying to which cluster some of the variables must be assigned. Each component of the vector can assume integer values from 1 o Q (See example for more details), or 0 if no constraint on the variable is imposed (i.e., it will be assigned based on the plain algorithm).

Value

returns a list of estimates and some descriptive quantities of the final results.

V

Variables x factors membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster it has been assigned.

A

Variables x components loading matrix.

betweenss

Amount of deviance captured by the model (scalar).

totss

total amount of deviance (scalar).

size

Number of variables assigned to each column-cluster (vector).

loop

The index of the (best) run from which the results have been chosen.

it

the number of iterations performed during the (best) run.

Author(s)

Ionel Prunila, Maurizio Vichi

References

Vichi M., Saporta G. (2009) "Clustering and disjoint principal component analysis" <doi:10.1016/j.csda.2008.05.028>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# No constraint on variables
out <- dispca(iris, Q = 2)

# Constraint: the first two variables must contribute to the same factor.
outc <- dispca(iris, Q = 2, constr = c(1,1,0,0))

Double k-means Clustering

Description

Performs simultaneous k-means partitioning on units and variables (rows and columns of the data matrix).

Usage

doublekm(Xs, K, Q, Rndstart, verbose, maxiter, tol, prep, print)

Arguments

Xs

Units x variables numeric data matrix.

K

Number of clusters for the units.

Q

Number of clusters for the variables.

Rndstart

Number of runs to be performed (Defaults is 20).

verbose

Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).

maxiter

Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).

tol

Tolerance threshold. It is the maximum difference between the values of the objective function of two consecutive iterations such that convergence is assumed (default is 1e-6).

prep

Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

print

Prints summary statistics of the results (1 = enabled; 0 = disabled, default option).

Value

returns a list of estimates and some descriptive quantities of the final results.

U

Units x clusters membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which unit-cluster each unit has been assigned.

V

Variables x clusters membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which variable-cluster each variable has been assigned.

centers

K x Q matrix of centers containing the row means expressed in terms of column means.

totss

The total sum of squares (scalar).

withinss

Vector of within-row-cluster sum of squares, one component per cluster.

columnwise_withinss

Vector of within-column-cluster sum of squares, one component per cluster.

betweenss

Amount of deviance captured by the model (scalar).

K-size

Number of units assigned to each row-cluster (vector).

Q-size

Number of variables assigned to each column-cluster (vector).

pseudoF

Calinski-Harabasz index of the resulting (row-) partition (scalar).

loop

The index of the (best) run from which the results have been chosen.

it

the number of iterations performed during the (best) run.

Author(s)

Ionel Prunila, Maurizio Vichi

References

Vichi M. (2001) "Double k-means Clustering for Simultaneous Classification of Objects and Variables" <doi:10.1007/978-3-642-59471-7_6>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# double k-means with 3 unit-clusters and 2 variable-clusters
out <- doublekm(iris, K = 3, Q = 2)

Clustering with Disjoint Principal Components Analysis

Description

Performs simultaneously k-means partitioning on units and disjoint PCA on the variables, computing each principal component from a different subset of variables. The result is a simplified, easier to interpret loading matrix A, the principal components and the clustering. The reduced subspace is identified by the centroids.

Usage

dpcakm(X, K, Q, Rndstart, verbose, maxiter, tol, constr, print, prep)

Arguments

X

Units x variables numeric data matrix.

K

Number of clusters for the units.

Q

Number of principal components.

Rndstart

Number of runs to be performed (Defaults is 20).

verbose

Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).

maxiter

Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).

tol

Tolerance threshold (maximum difference between the values of the objective function of two consecutive iterations such that convergence is assumed. Default is 1e-6).

constr

is a vector of length J = nr. of variables, pre-specifying to which cluster some of the variables must be assigned. Each component of the vector can assume integer values from 1 o Q = nr. of variable-cluster / principal components (See examples for more details), or 0 if no constraint on the variable is imposed (i.e., it will be assigned based on the plain algorithm).

print

Prints summary statistics of the results (1 = enabled; 0 = disabled, default option).

prep

Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

Value

returns a list of estimates and some descriptive quantities of the final results.

V

Variables x factors membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each variable has been assigned.

U

Units x clusters membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each unit has been assigned.

A

Variables x components loading matrix.

centers

K x Q matrix of centers containing the row means expressed in the reduced space of Q principal components.

totss

The total sum of squares (scalar).

withinss

Vector of within-cluster sum of squares, one component per cluster.

betweenss

Amount of deviance captured by the model (scalar).

K-size

Number of units assigned to each row-cluster (vector).

Q-size

Number of variables assigned to each column-cluster (vector).

pseudoF

Calinski-Harabasz index of the resulting partition (scalar).

loop

The index of the (best) run from which the results have been chosen.

it

the number of iterations performed during the (best) run.

Author(s)

Ionel Prunila, Maurizio Vichi

References

Vichi M., Saporta G. (2009) "Clustering and disjoint principal component analysis" <doi:10.1016/j.csda.2008.05.028>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# No constraint on variables
out <- dpcakm(iris, K = 3, Q = 2, Rndstart = 5)

# Constraint: the first two variables must contribute to the same factor.
outc <- dpcakm(iris, K = 3, Q = 2, Rndstart = 5,constr = c(1,1,0,0))

double pseudoF (Calinski-Harabsz) index

Description

A pseudoF version for double partitioning, for the choice of the number of clusters of the units and variables (rows and columns of the data matrix). It is a diagnostic tool for inspecting simultaneously the optimal number of unit-clusters and variable-clusters.

Usage

dpseudoF(data, maxK, maxQ)

Arguments

data

Units x variables numeric data matrix.

maxK

Maximum number of clusters for the units to be tested.

maxQ

Maximum number of clusters for the variables to be tested.

Value

dpseudoF

matrix containing the pF value for each pair of K and Q within the specified range

Author(s)

Ionel Prunila, Maurizio Vichi

References

R. Rocci, M. Vichi (2008)" Two-mode multi-partitioning" <doi:10.1016/j.csda.2007.06.025>

T. Calinski & J. Harabasz (1974). A dendrite method for cluster analysis. Communications in Statistics, 3:1, 1-27

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

dpeudoF <- dpseudoF(iris, maxK=10, maxQ = 3)

Factorial k-means

Description

Performs simultaneously k-means partitioning on units and principal component analysis on the variables. Identifies the best partition in a Least-Squares sense in the best reduced space of the data. Both the data and the centroids are used to identify the best Least-Squares reduced subspace, where also their distances is measured.

Usage

factkm(X, K, Q, Rndstart, verbose, maxiter, tol, rot, prep, print)

Arguments

X

Units x variables numeric data matrix.

K

Number of clusters for the units.

Q

Number of principal components w.r.t. variables.

Rndstart

Number of runs to be performed (Defaults is 20).

verbose

Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).

maxiter

Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).

tol

Tolerance threshold (maximum difference in the values of the objective function of two consecutive iterations such that convergence is assumed. Default is 1e-6).

rot

performs varimax rotation of axes obtained via PCA. (=1 enabled; =0 disabled, default option)

prep

Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

print

Prints summary statistics of the results (1 = enabled; 0 = disabled, default option).

Value

returns a list of estimates and some descriptive quantities of the final results.

U

Units x clusters membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each unit has been assigned.

A

Variables x components loading matrix (orthonormal).

centers

K x Q matrix of centers containing the row means expressed in the reduced space of Q principal components.

totss

The total sum of squares.

withinss

Vector of within-cluster sum of squares, one component per cluster.

betweenss

amount of deviance captured by the model.

size

Number of units assigned to each cluster.

pseudoF

Calinski-Harabasz index of the resulting partition.

loop

The index of the (best) run from which the results have been chosen.

it

the number of iterations performed during the (best) run.

Author(s)

Ionel Prunila, Maurizio Vichi

References

Vichi M., Kiers H.A.L. (2001) "Factorial k-means analysis for two-way data" <doi:10.1016/S0167-9473(00)00064-5>

Kaiser H.F. (1958) "The varimax criterion for analytic rotation in factor analysis" <doi:10.1007/BF02289233>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# factorial k-means with 3 unit-clusters and 2 components for the variables
out <- factkm(iris, K = 3, Q = 2, Rndstart = 15, verbose = 0, maxiter = 100, tol = 1e-7, rot = 1)

Heatmap of a partition in a reduced subspace

Description

Plots the heatmap of a partition on a reduced subspace obtained via either: doublekm, redkm, factkm or dpcakm.

Usage

heatm(data, drclust_out)

Arguments

data

Units x variables data matrix.

drclust_out

Out of either doublekm, redkm, factkm or dpcakm.

Value

No return value, called for side effects

Author(s)

Ionel Prunila, Maurizio Vichi

References

Kolde R. (2019) "pheatmap: Pretty Heatmaps" <https://cran.r-project.org/web/packages/pheatmap/index.html>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# standardizing the data
iris <- scale(iris)

# applying a clustering algorithm
drclust_out <- dpcakm(iris, 20, 3)

# obtain a heatmap based on the output of the clustering algorithm and the data
h <- heatm(iris, drclust_out)

Selecting the number of principal components to be extracted from a dataset

Description

Selects the optimal number of principal components to be extracted from a dataset based on Kaiser's criterion

Usage

kaiserCrit(data)

Arguments

data

Units x variables data matrix.

Value

bestQ

Number of components to be extracted (scalar).

Author(s)

Ionel Prunila, Maurizio Vichi

References

Kaiser H. F. (1960) "The Application of Electronic Computers to Factor Analysis" <doi:10.1177/001316446002000>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- scale(as.matrix(iris[,-5])) 

# Apply the Kaiser rule
h <- kaiserCrit(iris)

Adjusted Rand Index

Description

Performs the Adjusted Rand Index on a confusion matrix (row-by-column product of two partition-matrices). ARI is a measure of the similarity between two data clusterings.

Usage

mrand(N)

Arguments

N

Confusion matrix.

Value

mri

Adjusted Rand Index of a confusion matrix (scalar).

Author(s)

Ionel Prunila, Maurizio Vichi

References

Rand W. M. (1971) "Objective criteria for the evaluation of clustering methods" <doi:10.2307/2284239>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# standardizing the data
iris <- scale(iris)

# double k-means with 3 unit-clusters and 2 components for the variables
p1 <- redkm(iris, K = 3, Q = 2, Rndstart = 10)
p2 <- doublekm(iris, K=3, Q=2, Rndstart = 10)
mri <- mrand(t(p1$U)%*%p2$U)

k-means on a reduced subspace

Description

Performs simultaneously k-means partitioning on units and principal component analysis on the variables.

Usage

redkm(X, K, Q, Rndstart, verbose, maxiter, tol, rot, prep, print)

Arguments

X

Units x variables numeric data matrix.

K

Number of clusters for the units.

Q

Number of principal components w.r.t. variables.

Rndstart

Number of runs to be performed (Defaults is 20).

verbose

Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).

maxiter

Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).

tol

Tolerance threshold (maximum difference between the values of the objective function of two consecutive iterations such that convergence is assumed. Default is 1e-6).

rot

performs varimax rotation of axes obtained via PCA. (=1 enabled; =0 disabled, default option)

prep

Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

print

Tolerancestats summary statistics of the performed method (1 = enabled; 0 = disabled, default option).

Value

returns a list of estimates and some descriptive quantities of the final results.

U

Units x clusters membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each unit has been assigned.

A

Variables x components loading matrix (orthonormal).

centers

K x Q matrix of centers containing the row means expressed in the reduced space of Q principal components.

totss

The total sum of squares (scalar).

withinss

Vector of within-cluster sum of squares, one component per cluster.

betweenss

Amount of deviance captured by the model (scalar).

size

Number of units assigned to each cluster (vector).

pseudoF

Calinski-Harabasz index of the resulting partition (scalar).

loop

The index of the (best) run from which the results have been chosen.

it

the number of iterations performed during the (best) run.

Author(s)

Ionel Prunila, Maurizio Vichi

References

de Soete G., Carroll J. (1994) "K-means clustering in a low-dimensional Euclidean space" <doi:10.1007/978-3-642-51175-2_24>

Kaiser H.F. (1958) "The varimax criterion for analytic rotation in factor analysis" <doi:10.1007/BF02289233>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# reduced k-means with 3 unit-clusters and 2 components for the variables
out <- redkm(iris, K = 3, Q = 2, Rndstart = 15, verbose = 0, maxiter = 100, tol = 1e-7, rot = 1)

Silhouette

Description

Computes and plots the silhouette of a partition

Usage

silhouette(data, drclust_out)

Arguments

data

Units x variables data matrix.

drclust_out

Out of either doublekm, redkm, factkm or dpcakm.

Value

cl.silhouette

Silhouette index for the given partition, for each object (matrix).

fe.silhouette

Factoextra silhouette graphical object

Author(s)

Ionel Prunila, Maurizio Vichi

References

Rousseeuw P. J. (1987) "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis" <doi:10.1016/0377-0427(87)90125-7>

Maechler M. et al. (2023) "cluster: Cluster Analysis Basics and Extensions" <https://CRAN.R-project.org/package=cluster>

Kassambara A. (2022) "factoextra: Extract and Visualize the Results of Multivariate Data Analyses" <https://cran.r-project.org/web/packages/factoextra/index.html>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

#standardizing the data
iris <- scale(iris)

#applying a clustering algorithm
drclust_out <- dpcakm(iris, 20, 3)

#silhouette based on the data and the output of the clustering algorithm
d <- silhouette(iris, drclust_out)