mclustAddons is a contributed R package that extends
the functionality available in the mclust package
(Scrucca et al. 2016, Scrucca et al. 2023).
In particular, the following methods are included:
density estimation for data with bounded support (Scrucca, 2019);
modal clustering using modal EM algorithm for Gaussian mixtures (Scrucca, 2021);
entropy estimation via Gaussian mixture modeling (Robin & Scrucca, 2023).
This document gives a quick tour of mclustAddons (version 0.9). It was written in R Markdown, using the knitr package for production.
References on the methodologies implemented are provided at the end of this document.
x <- rchisq(200, 3)
xgrid <- seq(-2, max(x), length=1000)
f <- dchisq(xgrid, 3) # true density
dens <- densityMclustBounded(x, lbound = 0)
summary(dens, parameters = TRUE)
## ── Density estimation for bounded data via GMMs ───────────
##
## Boundaries: x
## lower 0
## upper Inf
##
## Model E (univariate, equal variance) model with 1 component
## on the transformation scale:
##
## log-likelihood n df BIC ICL
## -390.0517 200 3 -795.9983 -795.9983
##
## x
## Range-power transformation: 0.3715163
##
## Mixing probabilities:
## 1
## 1
##
## Means:
## 1
## 0.9191207
##
## Variances:
## 1
## 1.309037
plot(dens, what = "density")
lines(xgrid, f, lty = 2)
x <- rbeta(200, 5, 1.5)
xgrid <- seq(-0.1, 1.1, length=1000)
f <- dbeta(xgrid, 5, 1.5) # true density
dens <- densityMclustBounded(x, lbound = 0, ubound = 1)
summary(dens, parameters = TRUE)
## ── Density estimation for bounded data via GMMs ───────────
##
## Boundaries: x
## lower 0
## upper 1
##
## Model E (univariate, equal variance) model with 1 component
## on the transformation scale:
##
## log-likelihood n df BIC ICL
## 120.4011 200 3 224.9072 224.9072
##
## x
## Range-power transformation: -0.2200992
##
## Mixing probabilities:
## 1
## 1
##
## Means:
## 1
## 1.152107
##
## Variances:
## 1
## 0.5212129
plot(dens, what = "density")
x1 <- rchisq(200, 3)
x2 <- 0.5*x1 + sqrt(1-0.5^2)*rchisq(200, 5)
x <- cbind(x1, x2)
dens <- densityMclustBounded(x, lbound = c(0,0))
summary(dens, parameters = TRUE)
## ── Density estimation for bounded data via GMMs ───────────
##
## Boundaries: x1 x2
## lower 0 0
## upper Inf Inf
##
## Model VEE (ellipsoidal, equal shape and orientation) model with 1 component
## on the transformation scale:
##
## log-likelihood n df BIC ICL
## -889.9509 200 7 -1816.99 -1816.99
##
## x1 x2
## Range-power transformation: 0.2957409 0.3172159
##
## Mixing probabilities:
## 1
## 1
##
## Means:
## [,1]
## x1 1.041532
## x2 2.209400
##
## Variances:
## [,,1]
## x1 x2
## x1 1.2671762 0.3912456
## x2 0.3912456 0.8249777
plot(dens, what = "BIC")
The data consist in the lengths of 86 spells of psychiatric treatment undergone by control patients in a suicide study (Silverman, 1986).
data("suicide")
dens <- densityMclustBounded(suicide, lbound = 0)
summary(dens, parameters = TRUE)
## ── Density estimation for bounded data via GMMs ───────────
##
## Boundaries: suicide
## lower 0
## upper Inf
##
## Model E (univariate, equal variance) model with 1 component
## on the transformation scale:
##
## log-likelihood n df BIC ICL
## -497.8204 86 3 -1009.004 -1009.004
##
## suicide
## Range-power transformation: 0.1929267
##
## Mixing probabilities:
## 1
## 1
##
## Means:
## 1
## 6.700073
##
## Variances:
## 1
## 7.788326
plot(dens, what = "density",
lwd = 2, col = "dodgerblue2",
data = suicide, breaks = 15,
xlab = "Length of psychiatric treatment")
rug(suicide)
This dataset provides the proportion of white student enrollment in 56 school districts in Nassau County (Long Island, New York), for the 1992-1993 school year (Simonoff 1996, Sec. 3.2).
data("racial")
x <- racial$PropWhite
dens <- densityMclustBounded(x, lbound = 0, ubound = 1)
summary(dens, parameters = TRUE)
## ── Density estimation for bounded data via GMMs ───────────
##
## Boundaries: x
## lower 0
## upper 1
##
## Model E (univariate, equal variance) model with 1 component
## on the transformation scale:
##
## log-likelihood n df BIC ICL
## 42.4598 56 3 72.84355 72.84355
##
## x
## Range-power transformation: 0.3869476
##
## Mixing probabilities:
## 1
## 1
##
## Means:
## 1
## 2.795429
##
## Variances:
## 1
## 5.253254
plot(dens, what = "density",
lwd = 2, col = "dodgerblue2",
data = x, breaks = 15,
xlab = "Proportion of white student enrolled in schools")
rug(x)
data(Baudry_etal_2010_JCGS_examples, package = "mclust")
GMM <- Mclust(ex4.1)
plot(GMM, what = "classification")
MEM <- MclustMEM(GMM)
summary(MEM)
## ── Modal EM for GMMs ───────────────────
##
## Data dimensions = 600 x 2
## Mclust model = EEV,6
## MEM iterations = 17
## Number of modes = 4
##
## Modes:
## X1 X2
## mode1 8.06741504 -0.01772230
## mode2 8.07370160 4.98485099
## mode3 1.10622966 4.97230749
## mode4 -0.01639289 0.06464381
##
## Modal clustering:
## 1 2 3 4
## 118 122 228 132
plot(MEM)
MEM <- MclustMEM(GMM)
summary(MEM)
## ── Modal EM for GMMs ───────────────────
##
## Data dimensions = 300 x 3
## Mclust model = EVI,3
## MEM iterations = 15
## Number of modes = 2
##
## Modes:
## X1 X2 X3
## mode1 0.7964915 0.7444244 0.4547285
## mode2 0.4996361 0.5014374 0.4957522
##
## Modal clustering:
## 1 2
## 78 222
plot(MEM)
Univariate Gaussian
EntropyGauss(1) # population entropy
## [1] 1.418939
x = rnorm(1000) # generate sample
EntropyGauss(var(x)) # sample entropy assuming Gaussian distribution
## [1] 1.384565
mod = densityMclust(x, plot = FALSE)
EntropyGMM(mod) # GMM-based entropy estimate
## [1] 1.384065
plot(mod, what = "density", data = x, breaks = 31); rug(x)
Univariate Mixed-Gaussian
Consider the mixed-Gaussian distribution f(x) = 0.5 × N(−2, 1) + 0.5 × N(2, 1),
whose entropy is 2.051939 in the population.
cl = rbinom(1000, size = 1, prob = 0.5)
x = ifelse(cl == 1, rnorm(1000, 2, 1), rnorm(1000, -2, 1)) # generate sample
mod = densityMclust(x, plot = FALSE)
EntropyGMM(mod) # GMM-based entropy estimate
## [1] 2.037679
plot(mod, what = "density", data = x, breaks = 31); rug(x)
Multivariate Chi-squared
Consider a 10-dimensional independent χ2 distribution, whose
entropy is 24.23095 in the population.
Scrucca L., Fraley C., Murphy T. B. and Raftery A. E. (2023) Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman & Hall/CRC, ISBN: 978-1032234953, https://mclust-org.github.io/book/
Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8/1, pp. 289-317. https://doi.org/10.32614/RJ-2016-021
Scrucca L. (2019) A transformation-based approach to Gaussian mixture density estimation for bounded data, Biometrical Journal, 61:4, 873–888. https://doi.org/10.1002/bimj.201800174
Scrucca L. (2021) A fast and efficient Modal EM algorithm for Gaussian mixtures. Statistical Analysis and Data Mining, 14:4, 305–314. https://doi.org/10.1002/sam.11527
Robin S. and Scrucca L. (2023) Mixture-based estimation of entropy. Computational Statistics & Data Analysis, 177, 107582. https://doi.org/10.1016/j.csda.2022.107582
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mclustAddons_0.9 mclust_6.1.1 knitr_1.49 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] doParallel_1.0.17 cli_3.6.3 rlang_1.1.4 xfun_0.49
## [5] jsonlite_1.8.9 buildtools_1.0.0 rngtools_1.5.2 htmltools_0.5.8.1
## [9] maketools_1.3.1 sys_3.4.3 sass_0.4.9 evaluate_1.0.1
## [13] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10 foreach_1.5.2
## [17] lifecycle_1.0.4 doRNG_1.8.6 compiler_4.4.2 codetools_0.2-20
## [21] Rcpp_1.0.13-1 digest_0.6.37 R6_2.5.1 parallel_4.4.2
## [25] bslib_0.8.0 tools_4.4.2 iterators_1.0.14 cachem_1.1.0