Title: | Projection Pursuit Based on Gaussian Mixtures and Evolutionary Algorithms |
---|---|
Description: | Projection Pursuit (PP) algorithm for dimension reduction based on Gaussian Mixture Models (GMMs) for density estimation using Genetic Algorithms (GAs) to maximise an approximated negentropy index. For more details see Scrucca and Serafini (2019) <doi:10.1080/10618600.2019.1598871>. |
Authors: | Alessio Serafini [aut] , Luca Scrucca [aut, cre] |
Maintainer: | Luca Scrucca <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.3 |
Built: | 2024-12-12 07:09:00 UTC |
Source: | CRAN |
An R package implementing a Projection Pursuit (PP) algorithm based on finite Gaussian Mixture Models (GMMs) for density estimation using Genetic Algorithms (GAs) to maximise an approximated negentropy index. The ppgmmga algorithm provides a method to visualise high-dimensional data in a lower-dimensional space.
An introduction to ppgmmga package is provided in the accompaying vignette A quick tour of ppgmmga.
Serafini A. [email protected]
Scrucca L. [email protected]
Scrucca, L. and Serafini, A. (2019) Projection pursuit based on Gaussian mixtures and evolutionary algorithms. Journal of Computational and Graphical Statistics, 28:4, 847–860. DOI: 10.1080/10618600.2019.1598871
ppgmmga
, plot.ppgmmga
, ppgmmga-class
, ppgmmga.options
, summary.ppgmmga
Compute the number of classes for a histogram as the maximum of the "Sturges" and "FD" (Freedman Diaconis) estimators as in numpy library for Python.
nclass.numpy(x, ...)
nclass.numpy(x, ...)
x |
A vector of values. |
... |
Further arguments passed to or from other methods. |
Scrucca L. [email protected]
## Not run: library(ggplot2) x <- rnorm(100) ggplot() + geom_histogram(aes(x), col = "grey92", bins = nclass.numpy(x)) x <- rnorm(1000) ggplot() + geom_histogram(aes(x), col = "grey92", bins = nclass.numpy(x)) n = c(50, seq(100,1000,by=100)) brks = rep(NA, length(n)) for(i in seq(n)) brks[i] = nclass.numpy(rnorm(n[i])) ggplot() + geom_point(aes(x = n, y = brks)) ## End(Not run)
## Not run: library(ggplot2) x <- rnorm(100) ggplot() + geom_histogram(aes(x), col = "grey92", bins = nclass.numpy(x)) x <- rnorm(1000) ggplot() + geom_histogram(aes(x), col = "grey92", bins = nclass.numpy(x)) n = c(50, seq(100,1000,by=100)) brks = rep(NA, length(n)) for(i in seq(n)) brks[i] = nclass.numpy(rnorm(n[i])) ggplot() + geom_point(aes(x = n, y = brks)) ## End(Not run)
Plot method for objects of class 'ppgmmga'
.
## S3 method for class 'ppgmmga' plot(x, class = NULL, dim = seq(x$d), drawAxis = TRUE, bins = nclass.numpy, ...)
## S3 method for class 'ppgmmga' plot(x, class = NULL, dim = seq(x$d), drawAxis = TRUE, bins = nclass.numpy, ...)
x |
An object of class |
class |
A numeric or character vector indicating the classification of the observations/cases to be plotted. |
dim |
A numeric vector indicating the dimensions to use for plotting.
By default, all the dimensions of the projection subspace (i.e. |
drawAxis |
A logical value specifying whether or not the axes should be included in the 2D scatterplot. By default is to |
bins |
An R function to be used for computing the number of classes for the histogram. By default |
... |
further arguments. |
Plots the cloud of points onto a subspace after appling the Projection Pursuit algorithm based on Gaussian mixtures and Genetic algorithm implemented in ppgmmga
function.
Returns a object of class ggplot
.
Serafini A. [email protected]
Scrucca L. [email protected]
Scrucca, L. and Serafini, A. (2019) Projection pursuit based on Gaussian mixtures and evolutionary algorithms. Journal of Computational and Graphical Statistics, 28:4, 847–860. DOI: 10.1080/10618600.2019.1598871
## Not run: data(iris) X <- iris[,-5] Class <- iris$Species # 1D pp1 <- ppgmmga(data = X, d = 1, approx = "UT") summary(pp1, check = TRUE) plot(pp1) plot(pp1, Class) # 2D pp2 <- ppgmmga(data = X, d = 2, approx = "UT") summary(pp2, check = TRUE) plot(pp2) plot(pp2, Class) # 3D pp3 <- ppgmmga(data = X, d = 3) summary(pp3, check = TRUE) plot(pp3) plot(pp3, Class) plot(pp3, Class, dim = c(1,3)) plot(pp3, Class, dim = c(2,3)) ## End(Not run)
## Not run: data(iris) X <- iris[,-5] Class <- iris$Species # 1D pp1 <- ppgmmga(data = X, d = 1, approx = "UT") summary(pp1, check = TRUE) plot(pp1) plot(pp1, Class) # 2D pp2 <- ppgmmga(data = X, d = 2, approx = "UT") summary(pp2, check = TRUE) plot(pp2) plot(pp2, Class) # 3D pp3 <- ppgmmga(data = X, d = 3) summary(pp3, check = TRUE) plot(pp3) plot(pp3, Class) plot(pp3, Class, dim = c(1,3)) plot(pp3, Class, dim = c(2,3)) ## End(Not run)
A Projection Pursuit (PP) method for dimension reduction seeking "interesting" data structures in low-dimensional projections. A negentropy index is computed from the density estimated using Gaussian Mixture Models (GMMs). Then, the PP index is maximised by Genetic Algorithms (GAs) to find the optimal projection basis.
ppgmmga(data, d, approx = c("UT", "VAR", "SOTE", "none"), center = TRUE, scale = TRUE, GMM = NULL, gatype = c("ga", "gaisl"), options = ppgmmga.options(), seed = NULL, verbose = interactive(), ...)
ppgmmga(data, d, approx = c("UT", "VAR", "SOTE", "none"), center = TRUE, scale = TRUE, GMM = NULL, gatype = c("ga", "gaisl"), options = ppgmmga.options(), seed = NULL, verbose = interactive(), ...)
data |
A |
||||||||
d |
An integer specifying the dimension of the subspace onto which the data are projected and visualised. |
||||||||
approx |
A string specifying the type of computation to perform to obtain the negentropy for GMMs. Possible values are:
|
||||||||
center |
A logical value indicating whether or not the data are centred. By default is set to |
||||||||
scale |
A logical value indicating whether or not the data are scaled. By default is set to |
||||||||
GMM |
An object of class |
||||||||
gatype |
A string specifying the type of genetic algoritm to be used to maximised the negentropy. Possible values are:
|
||||||||
options |
A list of options containing all the important arguments to pass to |
||||||||
seed |
An integer value with the random number generator state. It may be used to replicate the results of ppgmmga algorithm. |
||||||||
verbose |
A logical value controlling if the evolution of GA search is shown. By default is |
||||||||
... |
Further arguments passed to or from other methods. |
Projection pursuit (PP) is a features extraction method for analysing high-dimensional data with low-dimension projections by maximising a projection index to find out the best orthogonal projections. A general PP procedure can be summarised in few steps: the data may be transformed, the PP index is chosen and the subspace dimension is fixed. Then, the PP index is optimised.
For clusters visualisation the negentropy index is considerd. Since such index requires an estimation of the underling data density, Gaussian mixture models (GMMs) are used to approximate such density. Genetic Algorithms are then employed to maximise the negentropy with respect to the basis of the projection subspace.
Returns an object of class 'ppgmmga'
. See ppgmmga-class
for a description of the object.
Serafini A. [email protected]
Scrucca L. [email protected]
Scrucca, L. and Serafini, A. (2019) Projection pursuit based on Gaussian mixtures and evolutionary algorithms. Journal of Computational and Graphical Statistics, 28:4, 847–860. DOI: 10.1080/10618600.2019.1598871
summary.ppgmmga
, plot.ppgmmga
, ppgmmga-class
## Not run: data(iris) X <- iris[,-5] Class <- iris$Species # 1-dimensional PPGMMGA PP1D <- ppgmmga(data = X, d = 1) summary(PP1D) plot(PP1D, bins = 11) plot(PP1D, bins = 11, Class) # 2-dimensional PPGMMGA PP2D <- ppgmmga(data = X, d = 2) summary(PP2D) plot(PP2D) plot(PP2D, Class) ## Unscented Transformation approximation PP2D_1 <- ppgmmga(data = X, d = 2, approx = "UT") summary(PP2D_1) plot(PP2D_1, Class) ## VARiational approximation PP2D_2 <- ppgmmga(data = X, d = 2, approx = "VAR") summary(PP2D_2) plot(PP2D_2, Class) ## Second Order Taylor Expansion approximation PP2D_3 <- ppgmmga(data = X, d = 2, approx = "SOTE") summary(PP2D_3) plot(PP2D_3, Class) # 3-dimensional PPGMMGA PP3D <- ppgmmga(data = X, d = 3,) summary(PP3D) plot(PP3D, Class) # A rotating 3D plot can be obtained using: # if(!require("msir")) install.packages("msir") # msir::spinplot(PP3D$Z, markby = Class, # col.points = ppgmmga.options("classPlotColors")[1:3]) ## End(Not run)
## Not run: data(iris) X <- iris[,-5] Class <- iris$Species # 1-dimensional PPGMMGA PP1D <- ppgmmga(data = X, d = 1) summary(PP1D) plot(PP1D, bins = 11) plot(PP1D, bins = 11, Class) # 2-dimensional PPGMMGA PP2D <- ppgmmga(data = X, d = 2) summary(PP2D) plot(PP2D) plot(PP2D, Class) ## Unscented Transformation approximation PP2D_1 <- ppgmmga(data = X, d = 2, approx = "UT") summary(PP2D_1) plot(PP2D_1, Class) ## VARiational approximation PP2D_2 <- ppgmmga(data = X, d = 2, approx = "VAR") summary(PP2D_2) plot(PP2D_2, Class) ## Second Order Taylor Expansion approximation PP2D_3 <- ppgmmga(data = X, d = 2, approx = "SOTE") summary(PP2D_3) plot(PP2D_3, Class) # 3-dimensional PPGMMGA PP3D <- ppgmmga(data = X, d = 3,) summary(PP3D) plot(PP3D, Class) # A rotating 3D plot can be obtained using: # if(!require("msir")) install.packages("msir") # msir::spinplot(PP3D$Z, markby = Class, # col.points = ppgmmga.options("classPlotColors")[1:3]) ## End(Not run)
'ppgmmga'
An S3 class object for ppgmmga algorithm
Object can be created by calls to the ppgmmga
function.
The input data matrix.
The dimension of the projection subspace.
The type of approximation used for computing negentropy.
An object of class 'densityMclust'
containing the Gaussian mixture density estimation. See densityMclust
for details.
An object of class 'ga'
containing the Genetic Algorithm search. See ga
for details.
The value of maximised negentropy.
The matrix basis of the projection subspace.
The matrix of projected data.
Serafini A. [email protected]
Scrucca L. [email protected]
ppgmmga
, plot.ppgmmga
, summary.ppgmmga
Set or retrieve default values to be used by the ppgmmga package.
ppgmmga.options(...)
ppgmmga.options(...)
... |
A single character vector, or a named list with components. In the one argument case, the form |
This function can be used to set or retrieve the values to be used by the ppgmmga package.
The function globally sets the arguments for the current session of R. The default options are restored with a new R session. To temporarily change the options for a single call to ppgmmga
function, look at options
argument in ppgmmga
.
Available options are:
modelNames
A string specifying the GMM to fit. See mclustModelNames
for the available models.
G
An integer value or a vector of integer values specifying the number of mixture components. If more than a single value is provided, the best model is selected using the BIC criterion. By default G = 1:9
.
initMclust
A string specifying the type of initialisation to be used for the EM algorithm. See mclust.options
for more details.
popSize
The GA population size. By default popSize = 100
.
pcrossover
The probability of crossover. By default pcrossover = 0.8
.
pmutation
The probability of mutation. By default pmutation = 0.1
.
maxiter
An integer value specifying the maximum number of iterations before stopping the GA. By default maxiter = 1000
.
run
An integer value indicating the number of generations without improvment in the best value of fitness fuction. run = 100
.
selection
An R
function performing the selection genetic operator. See ga_Selection
for details. By default selection = gareal_lsSelection
.
crossover
An R
function performing the crossover genetic operator. See ga_Crossover
for details. By default crossover = gareal_laCrossover
.
mutation
An R
function performing the mutation genetic operator. See ga_Mutation
for details. By default mutation = gareal_raMutation
.
parallel
A logical value specifying whether or not GA should be run in parallel. By default parallel = FALSE
.
numIslands
An integer value specifying the number of islands to be used in the Island Genetic Algorithm. By default numIslands = 4
.
migrationRate
A value specifying the fraction of migration between islands. By default migrationRate = 0.1
.
migrationInterval
An integer values specifying the number of generations to run before each migration. By default migrationInterval = 10
.
optim
A logical value specifying whether or not a local search should be performed. By default optim = TRUE
.
optimPoptim
A value specifying the probability a local search is performed at each GA generation. By default optimPoptim = 0.05
.
optimPressel
A value in specifying the pressure selection. Values close to 1 tend to assign higher selection probabilities to solutions with higher fitness, whereas values close to 0 tend to assign equal selection probability to any solution. By default
optimPressel = 0.5
.
optimMethod
A string specifying the general-purpose optimisation method to be used for local search. See optim
for the available algorithms. By default optimMethod = "L-BFGS-B"
.
optimMaxit
An integer value specifying the number of iterations for the local search algorithm. By default optimMaxit = 100
.
orth
A string specifying the method employed to orthogonalise the matrix basis. Available methods are the QR decomposition "QR"
, and the Singular Value Decomposition "SVD"
. By default orth = "QR"
.
classPlotSymbols
A vector whose entries are either integers corresponding to graphics symbols or single characters for indicating classifications when plotting data. Classes are assigned symbols in the given order.
classPlotColors
A vector whose entries correspond to colors for indicating classifications when plotting data. Classes are assigned colors in the given order.
For more details about options related to Gaussian mixture modelling see densityMclust
, and for those related to genetic algorithms see ga
and gaisl
.
Serafini A. [email protected]
Scrucca L. [email protected]
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016) mclust 5: Clustering, classification and density estimation using gaussian finite mixture models. The R journal, 8(1), 205-233. https://journal.r-project.org/archive/2016-1/scrucca-fop-murphy-etal.pdf
Scrucca, L. (2013) GA: A Package for Genetic Algorithms in R. Journal of Statistical Software, 53(4), 1-37. http://www.jstatsoft.org/v53/i04/
Scrucca, L. (2017) On some extensions to GA package: hybrid optimisation, parallelisation and islands evolution. The R Journal, 9/1, 187-206. https://journal.r-project.org/archive/2017/RJ-2017-008
Scrucca, L. and Serafini, A. (2019) Projection pursuit based on Gaussian mixtures and evolutionary algorithms. Journal of Computational and Graphical Statistics, 28:4, 847–860. DOI: 10.1080/10618600.2019.1598871
## Not run: ppgmmga.options() # Print a single option ppgmmga.options("popSize") # Change (globally) an option ppgmmga.options("popSize" = 10) ppgmmga.options("popSize") ## End(Not run)
## Not run: ppgmmga.options() # Print a single option ppgmmga.options("popSize") # Change (globally) an option ppgmmga.options("popSize" = 10) ppgmmga.options("popSize") ## End(Not run)
Summary method for objects of class 'ppgmmga'
.
## S3 method for class 'ppgmmga' summary(object, check = (object$approx != "none"), ...) ## S3 method for class 'summary.ppgmmga' print(x, digits = getOption("digits"), ...)
## S3 method for class 'ppgmmga' summary(object, check = (object$approx != "none"), ...) ## S3 method for class 'summary.ppgmmga' print(x, digits = getOption("digits"), ...)
object |
An object of class |
check |
A logical value specifying whether or not a Monte Carlo negentropy approximation check should be performed. By default is |
x |
An object of class |
digits |
The number of significant digits. |
... |
Further arguments passed to or from other methods. |
The summary function returns an object of class summary.ppgmmga
which can be printed by the corresponding print method. A list with the information from the ppgmmga
algorithm is returned.
If the optional argument check = TRUE
then the value of negentropy is compared to the Monte Carlo negentropy calculated for the same optimal projection basis selected by the algorithm.
By default, it allows to check if the value returned by the employed approximation is closed to the Monte Carlo approximation of to the "true" negentropy.
The ratio between the approximated value returned by the algorithm and the value computed with Monte Carlo is called Relative Accuracy. Such value should be close to 1 for a good approximation.
Serafini A. [email protected]
Scrucca L. [email protected]