Title: | Overfitting Bayesian Mixtures of Factor Analyzers with Parsimonious Covariance and Unknown Number of Components |
---|---|
Description: | Model-based clustering of multivariate continuous data using Bayesian mixtures of factor analyzers (Papastamoulis (2019) <DOI:10.1007/s11222-019-09891-z> (2018) <DOI:10.1016/j.csda.2018.03.007>). The number of clusters is estimated using overfitting mixture models (Rousseau and Mengersen (2011) <DOI:10.1111/j.1467-9868.2011.00781.x>): suitable prior assumptions ensure that asymptotically the extra components will have zero posterior weight, therefore, the inference is based on the ``alive'' components. A Gibbs sampler is implemented in order to (approximately) sample from the posterior distribution of the overfitting mixture. A prior parallel tempering scheme is also available, which allows to run multiple parallel chains with different prior distributions on the mixture weights. These chains run in parallel and can swap states using a Metropolis-Hastings move. Eight different parameterizations give rise to parsimonious representations of the covariance per cluster (following Mc Nicholas and Murphy (2008) <DOI:10.1007/s11222-008-9056-0>). The model parameterization and number of factors is selected according to the Bayesian Information Criterion. Identifiability issues related to label switching are dealt by post-processing the simulated output with the Equivalence Classes Representatives algorithm (Papastamoulis and Iliopoulos (2010) <DOI:10.1198/jcgs.2010.09008>, Papastamoulis (2016) <DOI:10.18637/jss.v069.c01>). |
Authors: | Panagiotis Papastamoulis [aut, cre] |
Maintainer: | Panagiotis Papastamoulis <[email protected]> |
License: | GPL-2 |
Version: | 5.1 |
Built: | 2024-12-09 06:38:49 UTC |
Source: | CRAN |
Model-based clustering of multivariate continuous data using Bayesian mixtures of factor analyzers (Papastamoulis (2019) <DOI:10.1007/s11222-019-09891-z> (2018) <DOI:10.1016/j.csda.2018.03.007>). The number of clusters is estimated using overfitting mixture models (Rousseau and Mengersen (2011) <DOI:10.1111/j.1467-9868.2011.00781.x>): suitable prior assumptions ensure that asymptotically the extra components will have zero posterior weight, therefore, the inference is based on the “alive” components. A Gibbs sampler is implemented in order to (approximately) sample from the posterior distribution of the overfitting mixture. A prior parallel tempering scheme is also available, which allows to run multiple parallel chains with different prior distributions on the mixture weights. These chains run in parallel and can swap states using a Metropolis-Hastings move. Eight different parameterizations give rise to parsimonious representations of the covariance per cluster (following Mc Nicholas and Murphy (2008) <DOI:10.1007/s11222-008-9056-0>). The model parameterization and number of factors is selected according to the Bayesian Information Criterion. Identifiability issues related to label switching are dealt by post-processing the simulated output with the Equivalence Classes Representatives algorithm (Papastamoulis and Iliopoulos (2010) <DOI:10.1198/jcgs.2010.09008>, Papastamoulis (2016) <DOI:10.18637/jss.v069.c01>).
The main fuction of the package is fabMix
.
Panagiotis Papastamoulis
Maintainer: Panagiotis Papastamoulis <[email protected]>
Fokoue, E. and Titterington, D.M. (2003). Mixtures of Factor Analysers: Bayesian Estimation and Inference by Stochastic Simulation. Machine Learing, 50(1): 73-94.
McNicholas, P.D. and Murphy, T.B. Statistics and Computing (2008) 18: 285. https://doi.org/10.1007/s11222-008-9056-0.
Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.
Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society, Series B (methodological), 73(5): 689-710.
van Havre, Z., White, N., Rousseau, J. and Mengersen, K. (2015). Overfitting Bayesian Mixture Models with an Unknown Number of Components. PLOS ONE, 10(7): 1-27.
Papastamoulis, P. (2016). label.switching
: An R Package for Dealing with the Label Switching Problem in MCMC Outputs. Journal of Statistical Software, 69(1), 1-24.
Papastamoulis, P. (2018). Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Computational Statistics and Data Analysis, 124: 220-234. DOI: 10.1016/j.csda.2018.03.007.
Papastamoulis, P (2019). Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Statistics and Computing, doi: 10.1007/s11222-019-09891-z.
# TOY EXAMPLE (very small numbers... only for CRAN check purposes) ################################################################# # (a) using 2 cores in parallel, each one running 2 heated chains. ################################################################# library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) # Run `fabMix` for a _small_ number of iterations # considering only `UUU` (maximal model), # using the default prior parallel heating parameters `dirPriorAlphas`. # NOTE: `dirPriorAlphas` may require some tuning in general. qRange <- 2 # values for the number of factors (only the true number # is considered here) Kmax <- 2 # number of components for the overfitted mixture model nChains <- 2 # number of parallel heated chains set.seed(1) fm <- fabMix( model = c("UUU"), nChains = nChains, rawData = syntheticDataset$data, outDir = "toyExample", Kmax = Kmax, mCycles = 4, burnCycles = 1, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 2, warm_up = 5) # WARNING: the following parameters: # Kmax, nChains, mCycles, burnCycles, warm_up_overfitting, warm_up # should take (much) _larger_ values. E.g. a typical implementation consists of: # Kmax = 20, nChains >= 3, mCycles = 1100, burnCycles = 100, # warm_up_overfitting = 500, warm_up = 5000. # Now print a run summary and produce some plots. print(fm) # you may also plot, summary the output. ################################################################# # (b) using 12 cores_____________________________________________ #_______4 models with 3 heated chains running in parallel________ #_______considering all 8 model parameterizations________________ ################################################################# ## Not run: library('fabMix') set.seed(99) n = 100 # sample size p = 30 # number of variables q = 2 # number of factors K = 5 # number of clusters sINV_diag = rep(1/100,p) # diagonal of inverse variance of errors syntheticDataset <- simData(sameLambda=FALSE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) qRange <- 1:3 # range of values for the number of factors Kmax <- 20 # number of components for the overfitted mixture model nChains <- 3 # number of parallel heated chains # the next command takes ~ 1 hour in a Linux workstation with 12 threads. fm <- fabMix( parallelModels = 4, nChains = nChains, model = c("UUU","CUU","UCU","CCU","UCC","UUC","CUC","CCC"), rawData = syntheticDataset$data, outDir = "toyExample_b", Kmax = Kmax, mCycles = 600, burnCycles = 100, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 500, warm_up = 5000) print(fm) plot(fm, what = "BIC") plot(fm, what = "classification_pairs") ## End(Not run)
# TOY EXAMPLE (very small numbers... only for CRAN check purposes) ################################################################# # (a) using 2 cores in parallel, each one running 2 heated chains. ################################################################# library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) # Run `fabMix` for a _small_ number of iterations # considering only `UUU` (maximal model), # using the default prior parallel heating parameters `dirPriorAlphas`. # NOTE: `dirPriorAlphas` may require some tuning in general. qRange <- 2 # values for the number of factors (only the true number # is considered here) Kmax <- 2 # number of components for the overfitted mixture model nChains <- 2 # number of parallel heated chains set.seed(1) fm <- fabMix( model = c("UUU"), nChains = nChains, rawData = syntheticDataset$data, outDir = "toyExample", Kmax = Kmax, mCycles = 4, burnCycles = 1, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 2, warm_up = 5) # WARNING: the following parameters: # Kmax, nChains, mCycles, burnCycles, warm_up_overfitting, warm_up # should take (much) _larger_ values. E.g. a typical implementation consists of: # Kmax = 20, nChains >= 3, mCycles = 1100, burnCycles = 100, # warm_up_overfitting = 500, warm_up = 5000. # Now print a run summary and produce some plots. print(fm) # you may also plot, summary the output. ################################################################# # (b) using 12 cores_____________________________________________ #_______4 models with 3 heated chains running in parallel________ #_______considering all 8 model parameterizations________________ ################################################################# ## Not run: library('fabMix') set.seed(99) n = 100 # sample size p = 30 # number of variables q = 2 # number of factors K = 5 # number of clusters sINV_diag = rep(1/100,p) # diagonal of inverse variance of errors syntheticDataset <- simData(sameLambda=FALSE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) qRange <- 1:3 # range of values for the number of factors Kmax <- 20 # number of components for the overfitted mixture model nChains <- 3 # number of parallel heated chains # the next command takes ~ 1 hour in a Linux workstation with 12 threads. fm <- fabMix( parallelModels = 4, nChains = nChains, model = c("UUU","CUU","UCU","CCU","UCC","UUC","CUC","CCC"), rawData = syntheticDataset$data, outDir = "toyExample_b", Kmax = Kmax, mCycles = 600, burnCycles = 100, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 500, warm_up = 5000) print(fm) plot(fm, what = "BIC") plot(fm, what = "classification_pairs") ## End(Not run)
Complete log-likelihood function for models with same error variance per component (xCx).
complete.log.likelihood(x_data, w, mu, Lambda, SigmaINV, z)
complete.log.likelihood(x_data, w, mu, Lambda, SigmaINV, z)
x_data |
|
w |
a vector of length |
mu |
|
Lambda |
|
SigmaINV |
|
z |
A vector of length |
complete log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) SigmaINV <- array(1, dim = c(p,p)) # compute the complete.log.likelihood complete.log.likelihood(x_data = x_data, w = w, mu = mu, Lambda = Lambda, SigmaINV = SigmaINV, z = z)
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) SigmaINV <- array(1, dim = c(p,p)) # compute the complete.log.likelihood complete.log.likelihood(x_data = x_data, w = w, mu = mu, Lambda = Lambda, SigmaINV = SigmaINV, z = z)
Complete log-likelihood function for models with different error variance per component (xUx) and diagonal covariance structure per component (.
complete.log.likelihood_q0(x_data, w, mu, SigmaINV, z)
complete.log.likelihood_q0(x_data, w, mu, SigmaINV, z)
x_data |
|
w |
a vector of length |
mu |
|
SigmaINV |
|
z |
A vector of length |
complete log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- scale(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( -0.1 + 0.2*runif(K * p), dim = c(K,p) ) SigmaINV <- array( 1, dim = c(K,p,p)) # compute the complete.log.likelihood ( -inf ) complete.log.likelihood_q0(x_data = x_data, w = w, mu = mu, SigmaINV = SigmaINV, z = z)
library('fabMix') data(waveDataset1500) x_data <- scale(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( -0.1 + 0.2*runif(K * p), dim = c(K,p) ) SigmaINV <- array( 1, dim = c(K,p,p)) # compute the complete.log.likelihood ( -inf ) complete.log.likelihood_q0(x_data = x_data, w = w, mu = mu, SigmaINV = SigmaINV, z = z)
Complete log-likelihood function for models with same error variance per component (xCx) and diagonal covariance structure per component (.
complete.log.likelihood_q0_sameSigma(x_data, w, mu, SigmaINV, z)
complete.log.likelihood_q0_sameSigma(x_data, w, mu, SigmaINV, z)
x_data |
|
w |
a vector of length |
mu |
|
SigmaINV |
|
z |
A vector of length |
complete log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- scale(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( -0.1 + 0.2*runif(K * p), dim = c(K,p) ) SigmaINV <- array( 1, dim = c(p,p)) # compute the complete.log.likelihood ( -inf ) complete.log.likelihood_q0_sameSigma(x_data = x_data, w = w, mu = mu, SigmaINV = SigmaINV, z = z)
library('fabMix') data(waveDataset1500) x_data <- scale(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( -0.1 + 0.2*runif(K * p), dim = c(K,p) ) SigmaINV <- array( 1, dim = c(p,p)) # compute the complete.log.likelihood ( -inf ) complete.log.likelihood_q0_sameSigma(x_data = x_data, w = w, mu = mu, SigmaINV = SigmaINV, z = z)
Complete log-likelihood function for models with different error variance per component (xUx).
complete.log.likelihood_Sj(x_data, w, mu, Lambda, SigmaINV, z)
complete.log.likelihood_Sj(x_data, w, mu, Lambda, SigmaINV, z)
x_data |
|
w |
a vector of length |
mu |
|
Lambda |
|
SigmaINV |
|
z |
The allocation vector. |
complete log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) SigmaINV <- array( c(0.5, 0.75, 1), dim = c(K,p,p)) # compute the complete.log.likelihood complete.log.likelihood_Sj(x_data = x_data, w = w, mu = mu, Lambda = Lambda, SigmaINV = SigmaINV, z = z)
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) SigmaINV <- array( c(0.5, 0.75, 1), dim = c(K,p,p)) # compute the complete.log.likelihood complete.log.likelihood_Sj(x_data = x_data, w = w, mu = mu, Lambda = Lambda, SigmaINV = SigmaINV, z = z)
This function simulates and
.
compute_A_B_G_D_and_simulate_mu_Lambda(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
compute_A_B_G_D_and_simulate_mu_Lambda(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
SigmaINV |
Precision matrix |
suff_statistics |
Sufficient statistics |
OmegaINV |
Prior parameter: |
K |
Number of overfitting mixture components |
priorConst1 |
Prior constant: |
T_INV |
Prior parameter: |
v_r |
This vector is used to set to zero the upper right |
A list containing a draw from the conditional distributions of and
:
Lambdas |
|
mu |
|
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(p,p) ) diag(SigmaINV) <- 0.5 + 0.5*runif(p) OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics(y = y, z = z, K = K, x_data = x_data) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(p,p) ) diag(SigmaINV) <- 0.5 + 0.5*runif(p) OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics(y = y, z = z, K = K, x_data = x_data) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
This function simulates and
for the CCU model.
compute_A_B_G_D_and_simulate_mu_Lambda_CCU(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
compute_A_B_G_D_and_simulate_mu_Lambda_CCU(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
SigmaINV |
Precision matrix |
suff_statistics |
Sufficient statistics |
OmegaINV |
Prior parameter: |
K |
Number of overfitting mixture components |
priorConst1 |
Prior constant: |
T_INV |
Prior parameter: |
v_r |
This vector is used to set to zero the upper right |
A list containing a draw from the conditional distributions of and
:
Lambdas |
|
mu |
|
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(p,p) ) diag(SigmaINV) = 0.5 + 0.5*runif(p) OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics_given_mu(y = y, z = z, K = K, x_data = x_data, mu = mu) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda_CCU(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(p,p) ) diag(SigmaINV) = 0.5 + 0.5*runif(p) OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics_given_mu(y = y, z = z, K = K, x_data = x_data, mu = mu) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda_CCU(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
This function simulates and
for the CUU model.
compute_A_B_G_D_and_simulate_mu_Lambda_CUU(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
compute_A_B_G_D_and_simulate_mu_Lambda_CUU(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
SigmaINV |
Precision matrix |
suff_statistics |
Sufficient statistics |
OmegaINV |
Prior parameter: |
K |
Number of overfitting mixture components |
priorConst1 |
Prior constant: |
T_INV |
Prior parameter: |
v_r |
This vector is used to set to zero the upper right |
A list containing a draw from the conditional distributions of and
:
Lambdas |
|
mu |
|
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(K,p,p) ) for(k in 1:K){ diag(SigmaINV[k,,]) <- 0.5 + 0.5*runif(p) } OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics_given_mu(y = y, z = z, K = K, x_data = x_data, mu = mu) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda_CUU(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(K,p,p) ) for(k in 1:K){ diag(SigmaINV[k,,]) <- 0.5 + 0.5*runif(p) } OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics_given_mu(y = y, z = z, K = K, x_data = x_data, mu = mu) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda_CUU(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
.
This function simulates .
compute_A_B_G_D_and_simulate_mu_Lambda_q0(SigmaINV, suff_statistics, K, priorConst1, T_INV, v_r)
compute_A_B_G_D_and_simulate_mu_Lambda_q0(SigmaINV, suff_statistics, K, priorConst1, T_INV, v_r)
SigmaINV |
Precision matrix |
suff_statistics |
Sufficient statistics |
K |
Number of overfitting mixture components |
priorConst1 |
Prior constant: |
T_INV |
Prior parameter: |
v_r |
This vector is used to set to zero the upper right |
A list containing a draw from the conditional distributions of :
mu |
|
Panagiotis Papastamoulis
.
This function simulates .
compute_A_B_G_D_and_simulate_mu_Lambda_q0_sameSigma(SigmaINV, suff_statistics, K, priorConst1, T_INV, v_r)
compute_A_B_G_D_and_simulate_mu_Lambda_q0_sameSigma(SigmaINV, suff_statistics, K, priorConst1, T_INV, v_r)
SigmaINV |
Precision matrix |
suff_statistics |
Sufficient statistics |
K |
Number of overfitting mixture components |
priorConst1 |
Prior constant: |
T_INV |
Prior parameter: |
v_r |
This vector is used to set to zero the upper right |
A list containing a draw from the conditional distributions of :
mu |
|
Panagiotis Papastamoulis
This function simulates and
.
compute_A_B_G_D_and_simulate_mu_Lambda_Sj(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
compute_A_B_G_D_and_simulate_mu_Lambda_Sj(SigmaINV, suff_statistics, OmegaINV, K, priorConst1, T_INV, v_r)
SigmaINV |
Precision matrix |
suff_statistics |
Sufficient statistics |
OmegaINV |
Prior parameter: |
K |
Number of overfitting mixture components |
priorConst1 |
Prior constant: |
T_INV |
Prior parameter: |
v_r |
This vector is used to set to zero the upper right |
A list containing a draw from the conditional distributions of and
:
Lambdas |
|
mu |
|
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(K,p,p) ) for(k in 1:K){ diag(SigmaINV[k,,]) <- 0.5 + 0.5*runif(p) } OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics(y = y, z = z, K = K, x_data = x_data) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda_Sj(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
library('fabMix') data(waveDataset1500) x_data <- scale(as.matrix(waveDataset1500[ 1:20, -1])) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) SigmaINV <- array(data = 0, dim = c(K,p,p) ) for(k in 1:K){ diag(SigmaINV[k,,]) <- 0.5 + 0.5*runif(p) } OmegaINV <- diag(q) # compute sufficient stats suf_stat <- compute_sufficient_statistics(y = y, z = z, K = K, x_data = x_data) v_r <- numeric(p) #indicates the non-zero values of Lambdas for( r in 1:p ){ v_r[r] <- min(r,q) } T_INV <- array(data = 0, dim = c(p,p)) diag(T_INV) <- diag(var(x_data)) diag(T_INV) <- 1/diag(T_INV) ksi <- colMeans(x_data) priorConst1 <- array(diag(T_INV)*ksi, dim =c(p,1)) # now simulate mu and Lambda f2 <- compute_A_B_G_D_and_simulate_mu_Lambda_Sj(SigmaINV = SigmaINV, suff_statistics = suf_stat, OmegaINV = OmegaINV, K = K, priorConst1 = priorConst1, T_INV = T_INV, v_r = v_r) # f2$mu contains the simulated means # f2$Lambdas contains the simulated factor loadings
Compute sufficient statistics given and
.
compute_sufficient_statistics(y, z, K, x_data)
compute_sufficient_statistics(y, z, K, x_data)
y |
|
z |
Allocation vector |
K |
Number of components |
x_data |
|
A list with six entries of sufficient statistics.
cluster_size |
Integer vector of length |
sx |
|
sy |
|
sxx |
Not used |
syy |
|
sxy |
|
Panagiotis Papastamoulis
data(waveDataset1500) x_data <- as.matrix(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) y <- array(rnorm(n = q*n), dim = c(n,q)) # compute sufficient stats suf_stat <- compute_sufficient_statistics(y = y, z = z, K = K, x_data = x_data)
data(waveDataset1500) x_data <- as.matrix(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) y <- array(rnorm(n = q*n), dim = c(n,q)) # compute sufficient stats suf_stat <- compute_sufficient_statistics(y = y, z = z, K = K, x_data = x_data)
Compute sufficient statistics given ,
and
.
compute_sufficient_statistics_given_mu(y, z, K, x_data,mu)
compute_sufficient_statistics_given_mu(y, z, K, x_data,mu)
y |
|
z |
Allocation vector |
K |
Number of components |
x_data |
|
mu |
|
A list with six entries of sufficient statistics.
cluster_size |
Integer vector of length |
sx |
|
sy |
|
sxx |
Not used |
syy |
|
sxy |
|
Panagiotis Papastamoulis
data(waveDataset1500) x_data <- as.matrix(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) # compute sufficient stats suf_stat <- compute_sufficient_statistics_given_mu(y = y, z = z, K = K, x_data = x_data, mu = mu)
data(waveDataset1500) x_data <- as.matrix(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) mu <- array( runif(K * p), dim = c(K,p) ) y <- array(rnorm(n = q*n), dim = c(n,q)) # compute sufficient stats suf_stat <- compute_sufficient_statistics_given_mu(y = y, z = z, K = K, x_data = x_data, mu = mu)
Compute sufficient statistics given .
compute_sufficient_statistics_q0(z, K, x_data)
compute_sufficient_statistics_q0(z, K, x_data)
z |
Allocation vector |
K |
Number of components |
x_data |
Data |
A list with six entries of sufficient statistics.
cluster_size |
Integer vector of length |
sx |
|
sy |
Not used here |
sxx |
Not used |
syy |
Not used here |
sxy |
Not used here |
Panagiotis Papastamoulis
data(waveDataset1500) x_data <- as.matrix(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # compute sufficient stats suf_stat <- compute_sufficient_statistics_q0( z = z, K = K, x_data = x_data)
data(waveDataset1500) x_data <- as.matrix(waveDataset1500[ 1:20, -1]) # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] n <- dim(x_data)[1] q <- 2 K <- length(table(z)) # 3 classes # compute sufficient stats suf_stat <- compute_sufficient_statistics_q0( z = z, K = K, x_data = x_data)
Compute quantiles for the correlation matrix per cluster based on the MCMC output stored in a fabMix.object
.
CorMat_mcmc_summary(x, quantile_probs)
CorMat_mcmc_summary(x, quantile_probs)
x |
An object of class |
quantile_probs |
Vector of probabilities for computing the corresponding quantiles. |
quantiles |
A list containing the quantiles for the correlation matrix per component. Each element is a |
p_matrix |
A |
Panagiotis Papastamoulis
Compute quantiles for the covariance matrix per cluster based on the MCMC output stored in a fabMix.object
.
CovMat_mcmc_summary(x, quantile_probs)
CovMat_mcmc_summary(x, quantile_probs)
x |
An object of class |
quantile_probs |
Vector of probabilities for computing the corresponding quantiles. |
A list containing the quantiles for the covariance matrix per component. Each element is a array, where
and
denote the dimension of the multivariate data and number of alive components for the selected model, respectively.
Panagiotis Papastamoulis
This functions is a wrapper for the label.switching
package and applies the ECR
and ECR.ITERATIVE.1
algorithms. The model may have the same variance of error terms per cluster or not.
dealWithLabelSwitching(sameSigma, x_data, outputFolder, q, burn, z.true, compute_regularized_expression, Km)
dealWithLabelSwitching(sameSigma, x_data, outputFolder, q, burn, z.true, compute_regularized_expression, Km)
sameSigma |
Logical value indicating whether the parameterization with the same error variance per cluster is used. |
x_data |
Data |
outputFolder |
Name of the folder where the |
q |
Number of factors |
burn |
Discard observations as burn-in period (optional). |
z.true |
An (optional) vector of cluster assignments which is considered as the groun-truth clustering of the data. Useful for direct comparisons against the real parameter values in simulated data. |
compute_regularized_expression |
Logical. Should regularized expression be computed? |
Km |
Number of components in the overfitted mixture model. |
The following files are produced in the output folder:
Panagiotis Papastamoulis
Papastamoulis, P. (2016). label.switching
: An R Package for Dealing with the Label Switching Problem in MCMC Outputs. Journal of Statistical Software, 69(1), 1-24.
This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.
fabMix(model, nChains, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, rmDir, parallelModels, lowerTriangular)
fabMix(model, nChains, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, rmDir, parallelModels, lowerTriangular)
model |
Any subset of "UUU" "CUU" "UCU" "CCU" "UCC" "UUC" "CUC", "CCC" indicating the fitted models. By default, all models are fitted. |
nChains |
The number of parallel heated chains. When 'dirPriorAlphas' is supplied, 'nChains' can be ignored. |
dirPriorAlphas |
vector of length |
rawData |
The observed data as an |
outDir |
Name of the output folder. An error is thrown if the directory already exists inside the current working directory. Note: it should NOT correspond to an absolute path, e.g.: |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
A vector containing the number of factors to be fitted. |
normalize |
Should the observed data be normalized? Default value: TRUE. (Recommended) |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up_overfitting |
Number of iterations for the overfitting initialization scheme. Default value: 500. |
warm_up |
Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 5000. |
overfittingInitialization |
Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
rmDir |
Logical value indicating whether to delete the |
parallelModels |
Model-level parallelization: An optional integer specifying the number of cores that will be used in order to fit in parallel each member of |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Let ;
denote independent
-dimensional random vectors. Let
with
denote the latent factor for observation
. In the typical factor analysis model, each observation is modelled as
, with
, where
and
are assumed independent,
. The
matrix
consists of the factor loadings. Assume that there are
clusters and let
denotes the latent allocation of observation
to one amongs the
clusters, with prior probability
,
, independent for
. Following McNicholas et al (2008), the following parameterizations are used:
UUU model: , with
UCU model: , with
UUC model: , with
UCC model: , with
CUU model: , with
CCU model: , with
CUC model: , with
CCC model: , with
In all cases, and
are assumed independent,
. Note that
and
denote positive definite matrices,
denotes the
identity matrix and
,
denote positive scalars.
An object of class fabMix.object
, that is, a list consisting of the following entries:
bic |
Bayesian Information Criterion per model and number of factors. |
class |
The estimated single best clustering of the observations according to the selected model. |
n_Clusters_per_model |
The most probable number of clusters (number of non-empty components of the overfitted mixture) per model and number of factors. |
posterior_probability |
The posterior probability of the estimated allocations according to the selected model. |
covariance_matrix |
The estimated posterior mean of the covariance matrix per cluster according to the selected model. |
mu |
The estimated posterior mean of the mean per cluster according to the selected model. |
weights |
The estimated posterior mean of the mixing proportions according to the selected model. |
selected_model |
Data frame containing the parameterization, number of clusters and factors of the selected model. |
mcmc |
A list containing the MCMC draws for the parameters of the selected model. Each entry is returned as an |
data |
The observed data. |
regularizedExpression |
The regularized expressions of variable scores to each factor per cluster (see Papastamoulis 2018, CSDA). |
Kmap_prob |
The posterior probability of the Maximum A Posteriori number of alive clusters for each parameterization and factor level. |
It is recommended to use: normalize = TRUE
(default). Tuning of dirPriorAlphas
may be necessary to achieve reasonable acceptance rates of chain swaps. Note that the output is reordered in order to deal with the label switching problem, according to the ECR algorithm applied by dealWithLabelSwitching
function.
Parallelization is enabled in both the chain-level as well as the model-level. By default all heated chains (specified by the nchains
argument) run in parallel using (at most) the same number of threads (if available). If parallelModels = NULL
(default), then the selected parameterizations will run (serially) on the same thread. Otherwise, if parallelModels = x
(where x
denotes a positive integer), the algorithm will first use x
threads to fit the specified model parameterizations in parallel, and furthermore will also parallelize the heated chains (according to the remaining free cores on the user's system). The user should combine parallelModels
with nChains
efficiently, for example: if the number of available threads equals 12 and the user wishes to run 3 heated chains per model (recall that there are 8 parameterizations in total), then, an ideal allocation would be parallelModels = 4
and nChains = 3
because all available threads will be constantly busy. If the user wishes to run nChains = 4
heated chains per model using 12 threads, then an ideal allocation would be parallelModels = 3
models running in parallel. In the case where parallelModels*nChains
> m
, with m
denoting the available number of threads, the algorithm will first allocate min(parallelModels
, m
) threads to run the same number of parameterizations in parallel, and then the remaining threads (if any) will be used to process the parallel heated chains. If no other threads are available, the heated chains will be allocated to single threads.
Panagiotis Papastamoulis
Martyn Plummer, Nicky Best, Kate Cowles and Karen Vines (2006). CODA: Convergence Diagnosis and Output Analysis for MCMC, R News, vol 6, 7-11.
McNicholas, P.D. and Murphy, T.B. Stat Comput (2008) 18: 285. https://doi.org/10.1007/s11222-008-9056-0.
Papastamoulis, P. (2018). Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Computational Statistics and Data Analysis, 124: 220-234. DOI: 10.1016/j.csda.2018.03.007.
Papastamoulis, P (2019). Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Statistics and Computing, doi: 10.1007/s11222-019-09891-z.
# TOY EXAMPLE (very small numbers... only for CRAN check purposes) ################################################################# # (a) using 2 cores in parallel, each one running 2 heated chains. ################################################################# library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) # Run `fabMix` for a _small_ number of iterations # but considering only the `UUU` (maximal model), # using the default prior parallel heating parameters `dirPriorAlphas`. # NOTE: `dirPriorAlphas` may require some tuning in general. qRange <- 2 # values for the number of factors (only the true number # is considered here) Kmax <- 2 # number of components for the overfitted mixture model nChains <- 2 # number of parallel heated chains set.seed(1) fm <- fabMix( model = "UUU", nChains = nChains, rawData = syntheticDataset$data, outDir = "toyExample", Kmax = Kmax, mCycles = 4, burnCycles = 1, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 2, warm_up = 5) # WARNING: the following parameters: # Kmax, nChains, mCycles, burnCycles, warm_up_overfitting, warm_up # should take (much) _larger_ values. E.g. a typical implementation consists of: # Kmax = 20, nChains >= 3, mCycles = 1100, burnCycles = 100, # warm_up_overfitting = 500, warm_up = 5000. # You may also print and plot # print(fm) # plot(fm) ################################################################# # (b) using 12 cores_____________________________________________ #_______4 models with 3 heated chains running in parallel________ #_______considering all 8 model parameterizations________________ ################################################################# ## Not run: library('fabMix') set.seed(99) n = 200 # sample size p = 30 # number of variables q = 2 # number of factors K = 5 # number of clusters sINV_diag = rep(1/20,p) # diagonal of inverse variance of errors syntheticDataset <- simData(sameLambda=FALSE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) qRange <- 1:3 # range of values for the number of factors Kmax <- 20 # number of components for the overfitted mixture model nChains <- 3 # number of parallel heated chains # the next command takes ~ 2 hours in a Linux machine with 12 threads. fm <- fabMix( parallelModels = 4, nChains = nChains, model = c("UUU","CUU","UCU","CCU","UCC","UUC","CUC","CCC"), rawData = syntheticDataset$data, outDir = "toyExample_b", Kmax = Kmax, mCycles = 1100, burnCycles = 100, q = qRange) print(fm) plot(fm, what = "BIC") # see also # plot(fm); summary(fm) ## End(Not run)
# TOY EXAMPLE (very small numbers... only for CRAN check purposes) ################################################################# # (a) using 2 cores in parallel, each one running 2 heated chains. ################################################################# library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) # Run `fabMix` for a _small_ number of iterations # but considering only the `UUU` (maximal model), # using the default prior parallel heating parameters `dirPriorAlphas`. # NOTE: `dirPriorAlphas` may require some tuning in general. qRange <- 2 # values for the number of factors (only the true number # is considered here) Kmax <- 2 # number of components for the overfitted mixture model nChains <- 2 # number of parallel heated chains set.seed(1) fm <- fabMix( model = "UUU", nChains = nChains, rawData = syntheticDataset$data, outDir = "toyExample", Kmax = Kmax, mCycles = 4, burnCycles = 1, q = qRange, g = 0.5, h = 0.5, alpha_sigma = 0.5, beta_sigma = 0.5, warm_up_overfitting = 2, warm_up = 5) # WARNING: the following parameters: # Kmax, nChains, mCycles, burnCycles, warm_up_overfitting, warm_up # should take (much) _larger_ values. E.g. a typical implementation consists of: # Kmax = 20, nChains >= 3, mCycles = 1100, burnCycles = 100, # warm_up_overfitting = 500, warm_up = 5000. # You may also print and plot # print(fm) # plot(fm) ################################################################# # (b) using 12 cores_____________________________________________ #_______4 models with 3 heated chains running in parallel________ #_______considering all 8 model parameterizations________________ ################################################################# ## Not run: library('fabMix') set.seed(99) n = 200 # sample size p = 30 # number of variables q = 2 # number of factors K = 5 # number of clusters sINV_diag = rep(1/20,p) # diagonal of inverse variance of errors syntheticDataset <- simData(sameLambda=FALSE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) qRange <- 1:3 # range of values for the number of factors Kmax <- 20 # number of components for the overfitted mixture model nChains <- 3 # number of parallel heated chains # the next command takes ~ 2 hours in a Linux machine with 12 threads. fm <- fabMix( parallelModels = 4, nChains = nChains, model = c("UUU","CUU","UCU","CCU","UCC","UUC","CUC","CCC"), rawData = syntheticDataset$data, outDir = "toyExample_b", Kmax = Kmax, mCycles = 1100, burnCycles = 100, q = qRange) print(fm) plot(fm, what = "BIC") # see also # plot(fm); summary(fm) ## End(Not run)
CUC
and CCC
models
This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.
fabMix_CxC(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, cccStart, lowerTriangular)
fabMix_CxC(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, cccStart, lowerTriangular)
sameSigma |
Logical value denoting the parameterization of the error variance per component. If |
dirPriorAlphas |
The prior Dirichlet parameters for each chain. |
rawData |
The observed data as an |
outDir |
Name of the output folder. |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
Number of factors |
normalize |
Should the observed data be normalized? Default value: TRUE. |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up_overfitting |
Number of iterations for the overfitting initialization scheme. Default value: 100. |
warm_up |
Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 500. |
overfittingInitialization |
Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
cccStart |
Initialization from the CCC model. |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files written to outDir
It is recommended to always use: normalize = TRUE
(default). Tuning of dirPriorAlphas
may be necessary to achieve reasonable acceptance rates of chain swaps. Also note that the output is not identifiable due to label switching and the user has to subsequently call the dealWithLabelSwitching
function. See the fabMix
function for examples.
Panagiotis Papastamoulis
CCU
and CUU
models
This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.
fabMix_CxU(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, lowerTriangular)
fabMix_CxU(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, lowerTriangular)
sameSigma |
Logical value denoting the parameterization of the error variance per component. If |
dirPriorAlphas |
The prior Dirichlet parameters for each chain. |
rawData |
The observed data as an |
outDir |
Name of the output folder. |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
Number of factors |
normalize |
Should the observed data be normalized? Default value: TRUE. |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up_overfitting |
Number of iterations for the overfitting initialization scheme. Default value: 100. |
warm_up |
Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 500. |
overfittingInitialization |
Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files written to outDir
It is recommended to always use: normalize = TRUE
(default). Tuning of dirPriorAlphas
may be necessary to achieve reasonable acceptance rates of chain swaps. Also note that the output is not identifiable due to label switching and the user has to subsequently call the dealWithLabelSwitching
function. See the fabMix
function for examples.
Panagiotis Papastamoulis
This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights. Missing values are simulated from their full conditional posterior distribution.
fabMix_missing_values(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up, progressGraphs, gwar, lowerTriangular)
fabMix_missing_values(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up, progressGraphs, gwar, lowerTriangular)
sameSigma |
Logical value denoting the parameterization of the error variance per component. If |
dirPriorAlphas |
The prior Dirichlet parameters for each chain. |
rawData |
The observed data as an |
outDir |
Name of the output folder. |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
Number of factors |
normalize |
Should the observed data be normalized? Default value: TRUE. |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up |
NUmber of iterations that will be used to initialize the models before starting proposing switchings. Default value: 500. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files written to outDir
It is recommended to always use: normalize = TRUE
(default). Tuning of dirPriorAlphas
may be necessary to achieve reasonable acceptance rates of chain swaps. Also note that the output is not identifiable due to label switching and the user has to subsequently call the dealWithLabelSwitching
function.
Panagiotis Papastamoulis
This function runs multiple copies of the fabMix
function in parallel.
fabMix_parallelModels(model, nChains, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, rmDir, parallelModels, lowerTriangular)
fabMix_parallelModels(model, nChains, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, rmDir, parallelModels, lowerTriangular)
model |
Any subset of "UUU" "CUU" "UCU" "CCU" "UCC" "UUC" "CUC", "CCC" indicating the fitted models. |
nChains |
The number of parallel heated chains. When 'dirPriorAlphas' is supplied, 'nChains' can be ignored. |
dirPriorAlphas |
vector of length |
rawData |
The observed data as an |
outDir |
Name of the output folder. An error is thrown if this directory already exists. |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
A vector containing the number of factors to be fitted. |
normalize |
Should the observed data be normalized? Default value: TRUE. (Recommended) |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up_overfitting |
Number of iterations for the overfitting initialization scheme. Default value: 500. |
warm_up |
Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 5000. |
overfittingInitialization |
Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
rmDir |
Logical value indicating whether to delete the |
parallelModels |
Model-level parallelization: An optional integer specifying the number of cores that will be used in order to fit in parallel each member of |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
An object of class fabMix.object
(see the fabMix
function).
See the fabMix
function for examples.
Panagiotis Papastamoulis
UUC
and UCC
models
This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.
fabMix_UxC(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, lowerTriangular)
fabMix_UxC(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, lowerTriangular)
sameSigma |
Logical value denoting the parameterization of the error variance per component. If |
dirPriorAlphas |
The prior Dirichlet parameters for each chain. |
rawData |
The observed data as an |
outDir |
Name of the output folder. |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
Number of factors |
normalize |
Should the observed data be normalized? Default value: TRUE. |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up_overfitting |
Number of iterations for the overfitting initialization scheme. Default value: 100. |
warm_up |
Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 500. |
overfittingInitialization |
Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files written to outDir
It is recommended to always use: normalize = TRUE
(default). Tuning of dirPriorAlphas
may be necessary to achieve reasonable acceptance rates of chain swaps. Also note that the output is not identifiable due to label switching and the user has to subsequently call the dealWithLabelSwitching
function. See the fabMix
function for examples.
Panagiotis Papastamoulis
UUU
and UCU
model
This function runs parallel chains under a prior tempering scheme of the Dirichlet prior distribution of mixture weights.
fabMix_UxU(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, lowerTriangular)
fabMix_UxU(sameSigma, dirPriorAlphas, rawData, outDir, Kmax, mCycles, burnCycles, g, h, alpha_sigma, beta_sigma, q, normalize, thinning, zStart, nIterPerCycle, gibbs_z, warm_up_overfitting, warm_up, overfittingInitialization, progressGraphs, gwar, lowerTriangular)
sameSigma |
Logical value denoting the parameterization of the error variance per component. If |
dirPriorAlphas |
The prior Dirichlet parameters for each chain. |
rawData |
The observed data as an |
outDir |
Name of the output folder. |
Kmax |
Number of components in the overfitted mixture. Default: 20. |
mCycles |
Number of MCMC cycles. Each cycle consists of |
burnCycles |
Number of cycles that will be discarded as burn-in period. |
g |
Prior parameter |
h |
Prior parameter |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
q |
Number of factors |
normalize |
Should the observed data be normalized? Default value: TRUE. |
thinning |
Optional integer denoting the thinning of the keeped MCMC cycles. |
zStart |
Optional starting value for the allocation vector. |
nIterPerCycle |
Number of iteration per MCMC cycle. Default value: 10. |
gibbs_z |
Select the gibbs sampling scheme for updating latent allocations of mixture model. Default value: 1. |
warm_up_overfitting |
Number of iterations for the overfitting initialization scheme. Default value: 100. |
warm_up |
Number of iterations that will be used to initialize the models before starting proposing switchings. Default value: 500. |
overfittingInitialization |
Logical value indicating whether the chains are initialized via the overfitting initialization scheme. Default: TRUE. |
progressGraphs |
Logical value indicating whether to plot successive states of the chains while the sampler runs. Default: FALSE. |
gwar |
Initialization parameter. Default: 0.05. |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files written to outDir
It is recommended to always use: normalize = TRUE
(default). Tuning of dirPriorAlphas
may be necessary to achieve reasonable acceptance rates of chain swaps. Also note that the output is not identifiable due to label switching and the user has to subsequently call the dealWithLabelSwitching
function. See the fabMix
function for examples.
Panagiotis Papastamoulis
This function computes four information criteria for a given run of the fabMix
algorithm, namely: AIC, BIC, DIC and . Given various runs with different number of factors, the selected model corresponds to the one with the smalled value of the selected criterion.
getStuffForDIC(sameSigma, sameLambda, isotropic, x_data, outputFolder, q, burn, Km, normalize, discardLower)
getStuffForDIC(sameSigma, sameLambda, isotropic, x_data, outputFolder, q, burn, Km, normalize, discardLower)
sameSigma |
Logical value indicating whether the parameterization with the same variance of errors per component is used. Default: TRUE. |
sameLambda |
Logical value indicating whether the parameterization with same loadings per component is used. Default: FALSE. |
isotropic |
Logical value indicating whether the parameterization with isotropic error variance per component is used. Default: FALSE. |
x_data |
Observed data. |
outputFolder |
Name of the folder where the |
q |
Number of factors. Note that this should coincide with the number of factors in the |
burn |
Discard observations as burn-in period (optional). |
Km |
Number of components in the overfitted mixture model. Note that this should coincide with the same entry in the |
normalize |
Should the observed data be normalized? Note that this should coincide with the same entry in the |
discardLower |
Discard draws with log-likelihood values lower than the specific quantile. This applied only for the DIC computation. |
The information criteria are saved to the informationCriteria_map_model.txt
file in the outputFolder
.
It is well known that DIC tends to overfit, so it advised to compare models with different number of factors using AIC or BIC. The main function of the package uses BIC.
Panagiotis Papastamoulis
Log-density function of the Dirichlet distribution
log_dirichlet_pdf(alpha, weights)
log_dirichlet_pdf(alpha, weights)
alpha |
Parameter vector |
weights |
Vector of weights |
Log-density of the evaluated at
.
Panagiotis Papastamoulis
Generate a random draw from the Dirichlet distribution .
myDirichlet(alpha)
myDirichlet(alpha)
alpha |
Parameter vector |
Simulated vector
Panagiotis Papastamoulis
Log-likelihood of the mixture model evaluated only at the alive components.
observed.log.likelihood0(x_data, w, mu, Lambda, Sigma, z)
observed.log.likelihood0(x_data, w, mu, Lambda, Sigma, z)
x_data |
The observed data |
w |
Vector of mixture weights |
mu |
Vector of marginal means |
Lambda |
Factor loadings |
Sigma |
Diagonal of the common covariance matrix of the errors per cluster |
z |
Allocation vector |
Log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) SigmaINV <- array(1, dim = c(p,p)) Sigma <- 1/diag(SigmaINV) # compute the complete.log.likelihood observed.log.likelihood0(x_data = x_data, w = w, mu = mu, Lambda = Lambda, Sigma = Sigma, z = z)
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) SigmaINV <- array(1, dim = c(p,p)) Sigma <- 1/diag(SigmaINV) # compute the complete.log.likelihood observed.log.likelihood0(x_data = x_data, w = w, mu = mu, Lambda = Lambda, Sigma = Sigma, z = z)
and same variance of errors
Log-likelihood of the mixture model evaluated only at the alive components.
observed.log.likelihood0_q0_sameSigma(x_data, w, mu, Sigma, z)
observed.log.likelihood0_q0_sameSigma(x_data, w, mu, Sigma, z)
x_data |
The observed data |
w |
Vector of mixture weights |
mu |
Vector of marginal means |
Sigma |
Covariance matrix of the errors per cluster |
z |
Allocation vector |
Log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) SigmaINV <- array(1, dim = c(p,p)) Sigma <- 1/diag(SigmaINV) # compute the complete.log.likelihood observed.log.likelihood0_q0_sameSigma(x_data = x_data, w = w, mu = mu, Sigma = Sigma, z = z)
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) SigmaINV <- array(1, dim = c(p,p)) Sigma <- 1/diag(SigmaINV) # compute the complete.log.likelihood observed.log.likelihood0_q0_sameSigma(x_data = x_data, w = w, mu = mu, Sigma = Sigma, z = z)
Log-likelihood of the mixture model evaluated only at the alive components.
observed.log.likelihood0_Sj(x_data, w, mu, Lambda, Sigma, z)
observed.log.likelihood0_Sj(x_data, w, mu, Lambda, Sigma, z)
x_data |
The observed data |
w |
Vector of mixture weights |
mu |
Vector of marginal means |
Lambda |
Factor loadings |
Sigma |
|
z |
Allocation vector |
Log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) Sigma <- matrix(1:K, nrow = K, ncol = p) # compute the complete.log.likelihood observed.log.likelihood0_Sj(x_data = x_data, w = w, mu = mu, Lambda = Lambda, Sigma = Sigma, z = z)
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Lambda <- array( runif(K*p*q), dim = c(K,p,q) ) Sigma <- matrix(1:K, nrow = K, ncol = p) # compute the complete.log.likelihood observed.log.likelihood0_Sj(x_data = x_data, w = w, mu = mu, Lambda = Lambda, Sigma = Sigma, z = z)
Log-likelihood of the mixture model evaluated only at the alive components.
observed.log.likelihood0_Sj_q0(x_data, w, mu, Sigma, z)
observed.log.likelihood0_Sj_q0(x_data, w, mu, Sigma, z)
x_data |
The observed data |
w |
Vector of mixture weights |
mu |
Vector of marginal means |
Sigma |
|
z |
Allocation vector |
Log-likelihood value
Panagiotis Papastamoulis
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Sigma <- matrix(1:K, nrow = K, ncol = p) # compute the complete.log.likelihood observed.log.likelihood0_Sj_q0(x_data = x_data, w = w, mu = mu, Sigma = Sigma, z = z)
library('fabMix') data(waveDataset1500) x_data <- waveDataset1500[ 1:20, -1] # data z <- waveDataset1500[ 1:20, 1] # class p <- dim(x_data)[2] q <- 2 K <- length(table(z)) # 3 classes # give some arbitrary values to the parameters: set.seed(1) w <- rep(1/K, K) mu <- array( runif(K * p), dim = c(K,p) ) Sigma <- matrix(1:K, nrow = K, ncol = p) # compute the complete.log.likelihood observed.log.likelihood0_Sj_q0(x_data = x_data, w = w, mu = mu, Sigma = Sigma, z = z)
Gibbs sampling for fitting a mixture model with diagonal covariance structure.
overfitting_q0(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfitting_q0(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files
Panagiotis Papastamoulis
and same error variance parameterization
Gibbs sampling for fitting a mixture model with diagonal covariance structure.
overfitting_q0_sameSigma(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfitting_q0_sameSigma(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files
Panagiotis Papastamoulis
UCU
model
Gibbs sampling for fitting a mixture model of factor analyzers.
overfittingMFA(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
CCC
model
Gibbs sampling for fitting a CCC mixture model of factor analyzers.
overfittingMFA_CCC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_CCC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CCC <- overfittingMFA_CCC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CCC <- overfittingMFA_CCC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
CCU
model
Gibbs sampling for fitting a CCU mixture model of factor analyzers.
overfittingMFA_CCU(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_CCU(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CCU(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CCU(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
CUC
model
Gibbs sampling for fitting a CUC mixture model of factor analyzers.
overfittingMFA_CUC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_CUC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CUC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CUC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
CUU
model
Gibbs sampling for fitting a CUU mixture model of factor analyzers.
overfittingMFA_CUU(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_CUU(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CUU(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_CUU(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
Gibbs sampling for fitting a mixture model of factor analyzers.
overfittingMFA_missing_values(missing_entries, x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_missing_values(missing_entries, x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
missing_entries |
list which contains the row number (1st entry) and column indexes (subsequent entries) for every row containing missing values. |
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files
Panagiotis Papastamoulis
UUU
model
Gibbs sampling for fitting a mixture model of factor analyzers.
overfittingMFA_Sj(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_Sj(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_Sj(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_Sj(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
Gibbs sampling for fitting a mixture model of factor analyzers.
overfittingMFA_Sj_missing_values(missing_entries, x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_Sj_missing_values(missing_entries, x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
missing_entries |
list which contains the row number (1st entry) and column indexes (subsequent entries) for every row containing missing values. |
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
List of files
Panagiotis Papastamoulis
UCC
model
Gibbs sampling for fitting a UCC mixture model of factor analyzers.
overfittingMFA_UCC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_UCC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_UCC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_UCC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
UUC
model
Gibbs sampling for fitting a UUC mixture model of factor analyzers.
overfittingMFA_UUC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
overfittingMFA_UUC(x_data, originalX, outputDirectory, Kmax, m, thinning, burn, g, h, alpha_prior, alpha_sigma, beta_sigma, start_values, q, zStart, gibbs_z, lowerTriangular)
x_data |
normalized data |
originalX |
observed raw data (only for plotting purpose) |
outputDirectory |
Name of the output folder |
Kmax |
Number of mixture components |
m |
Number of iterations |
thinning |
Thinning of chain |
burn |
Burn-in period |
g |
Prior parameter |
h |
Prior parameter |
alpha_prior |
Parameters of the Dirichlet prior distribution of mixture weights. |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
start_values |
Optional (not used) |
q |
Number of factors. |
zStart |
Optional (not used) |
gibbs_z |
Optional |
lowerTriangular |
logical value indicating whether a lower triangular parameterization should be imposed on the matrix of factor loadings (if TRUE) or not. Default: TRUE. |
Set of files written in outputDirectory
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_UUC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) colnames(syntheticDataset$data) <- paste0("x_",1:p) Kmax <- 4 # number of components for the overfitted mixture model set.seed(1) overfittingMFA_UUC(x_data = syntheticDataset$data, originalX = syntheticDataset$data, outputDirectory = 'outDir', Kmax = Kmax, m = 5, burn = 1, g = 0.5, h = 0.5, alpha_prior = rep(1, Kmax), alpha_sigma = 0.5, beta_sigma = 0.5, start_values = FALSE, q = 2, gibbs_z = 1) list.files('outDir') unlink('outDir', recursive = TRUE)
This function plots fabMix
function.
## S3 method for class 'fabMix.object' plot(x, what, variableSubset, class_mfrow, sig_correlation, confidence, ...)
## S3 method for class 'fabMix.object' plot(x, what, variableSubset, class_mfrow, sig_correlation, confidence, ...)
x |
An object of class |
what |
One of the "BIC", "classification_matplot", "classification_pairs", "correlation", "factor_loadings". The plot will display the BIC values per model and number of factors (along with the most probable number of clusters as text), a matplot per cluster for the selected model, scatterplots pairs, the estimated correlation matrix per cluster, and the MAP estimate of factor loadings, respectively. |
variableSubset |
An optional subset of the variables. By default, all variables are selected. |
class_mfrow |
An optional integer vector of length 2, that will be used to set the |
sig_correlation |
The “significance-level” for plotting the correlation between variables. Note that this is an estimate of a posterior probability and not a significance level as defined in frequentist statistics. Default value: NULL (all correlations are plotted). |
confidence |
Confidence level(s) for plotting the Highest Density Interval(s) (as shown via |
... |
ignored. |
When the BIC values are plotted, a number indicates the most probable number of “alive” clusters. The pairwise scatterplots (what = "classification_pairs"
) are created using the coordProj
function of the mclust
package. The what = "correlation"
is plotted using the corrplot
package. Note that the what = "classification_matplot"
plots the original data (before scaling and centering). On the other hand, the option what = "classification_pairs"
plots the centered and scaled data.
Panagiotis Papastamoulis
Luca Scrucca and Michael Fop and Thomas Brendan Murphy and Adrian E. Raftery (2017). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 205–233.
Taiyun Wei and Viliam Simko (2017). R package "corrplot": Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot
This function prints a summary of objects returned by the fabMix
function.
## S3 method for class 'fabMix.object' print(x, ...)
## S3 method for class 'fabMix.object' print(x, ...)
x |
An object of class |
... |
ignored. |
The function prints some basic information for a fabMix.object
.
Panagiotis Papastamoulis
Function to read Lambda values from file.
readLambdaValues(myFile,K,p,q)
readLambdaValues(myFile,K,p,q)
myFile |
File containing Lambda values |
K |
Number of components |
p |
Number of variables |
q |
Number of factors |
array of factor loadings.
Panagiotis Papastamoulis
Simulate data from a multivariate normal mixture using a mixture of factor analyzers mechanism.
simData(sameSigma, sameLambda, p, q, K.true, n, loading_means, loading_sd, sINV_values)
simData(sameSigma, sameLambda, p, q, K.true, n, loading_means, loading_sd, sINV_values)
sameSigma |
Logical. |
sameLambda |
Logical. |
p |
The dimension of the multivariate normal distribution ( |
q |
Number of factors. It should be strictly smaller than p. |
K.true |
The number of mixture components (clusters). |
n |
Sample size. |
loading_means |
A vector which contains the means of blocks of factor loadings. Default: |
loading_sd |
A vector which contains the standard deviations of blocks of factor loadings. Default: |
sINV_values |
A vector which contains the values of the diagonal of the (common) inverse covariance matrix, if Default: |
A list with the following entries:
data |
|
class |
|
factorLoadings |
|
means |
|
variance |
|
factors |
|
weights |
|
The marginal variance for cluster is equal to
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) summary(syntheticDataset)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) summary(syntheticDataset)
Simulate data from a multivariate normal mixture using a mixture of factor analyzers mechanism.
simData2(sameSigma, p, q, K.true, n, loading_means, loading_sd, sINV_values)
simData2(sameSigma, p, q, K.true, n, loading_means, loading_sd, sINV_values)
sameSigma |
Logical. |
p |
The dimension of the multivariate normal distribution ( |
q |
Number of factors. It should be strictly smaller than p. |
K.true |
The number of mixture components (clusters). |
n |
Sample size. |
loading_means |
A vector which contains the means of blocks of factor loadings. Default: |
loading_sd |
A vector which contains the standard deviations of blocks of factor loadings. Default: |
sINV_values |
A vector which contains the values of the diagonal of the (common) inverse covariance matrix, if Default: |
A list with the following entries:
data |
|
class |
|
factorLoadings |
|
means |
|
variance |
|
factors |
|
weights |
|
The marginal variance for cluster is equal to
.
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData2(K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) summary(syntheticDataset)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData2(K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) summary(syntheticDataset)
S3 method for printing a summary of a fabMix.object
.
## S3 method for class 'fabMix.object' summary(object, quantile_probs, ...)
## S3 method for class 'fabMix.object' summary(object, quantile_probs, ...)
object |
An object of class |
quantile_probs |
A vector of quantiles to evaluate for each variable. |
... |
Ignored. |
The function prints and returns a summary of the estimated posterior means for the parameters of the selected model for a fabMix.object
. In particular, the method prints the ergodic means of the mixing proportions, marginal means and covariance matrix per component, as well as the corresponding quantiles.
A list consisting of the following entries:
alive_cluster_labels |
The labels of the “alive” components of the overfitting mixture model. |
posterior_means |
Posterior means of mixing proportion, marginal means and covariance matrix per (alive) cluster. |
quantiles |
A matrix containing the quantiles for each parameter. |
The summary
function of the coda
package to the mcmc
object object$mcmc
is used for computing quantiles.
Panagiotis Papastamoulis
in xCx
model
Gibbs sampling for updating the factors for models with same variance of errors per component.
update_all_y(x_data, mu, SigmaINV, Lambda, z)
update_all_y(x_data, mu, SigmaINV, Lambda, z)
x_data |
|
mu |
|
SigmaINV |
|
Lambda |
|
z |
Allocation vector |
A matrix with generated factors
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate factors update_all_y(x_data = syntheticDataset$data, mu = syntheticDataset$means, SigmaINV = diag(1/diag(syntheticDataset$variance)), Lambda = syntheticDataset$factorLoadings, z = syntheticDataset$class)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate factors update_all_y(x_data = syntheticDataset$data, mu = syntheticDataset$means, SigmaINV = diag(1/diag(syntheticDataset$variance)), Lambda = syntheticDataset$factorLoadings, z = syntheticDataset$class)
in xUx
model
Gibbs sampling for updating the factors for models with different variance of errors per component.
update_all_y_Sj(x_data, mu, SigmaINV, Lambda, z)
update_all_y_Sj(x_data, mu, SigmaINV, Lambda, z)
x_data |
|
mu |
|
SigmaINV |
|
Lambda |
|
z |
Allocation vector |
A matrix with generated factors
Panagiotis Papastamoulis
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # add some noise here: SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate factors update_all_y_Sj(x_data = syntheticDataset$data, mu = syntheticDataset$means, SigmaINV = SigmaINV, Lambda = syntheticDataset$factorLoadings, z = syntheticDataset$class)
library('fabMix') n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # add some noise here: SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate factors update_all_y_Sj(x_data = syntheticDataset$data, mu = syntheticDataset$means, SigmaINV = SigmaINV, Lambda = syntheticDataset$factorLoadings, z = syntheticDataset$class)
Gibbs sampling for
update_OmegaINV(Lambda, K, g, h)
update_OmegaINV(Lambda, K, g, h)
Lambda |
Factor loadings |
K |
Number of components |
g |
Prior parameter |
h |
Prior parameter |
matrix
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_OmegaINV(Lambda = syntheticDataset$factorLoadings, K = K, g=0.5, h = 0.5)
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_OmegaINV(Lambda = syntheticDataset$factorLoadings, K = K, g=0.5, h = 0.5)
for Cxx model
Gibbs sampling for for Cxx model
update_OmegaINV_Cxx(Lambda, K, g, h)
update_OmegaINV_Cxx(Lambda, K, g, h)
Lambda |
Factor loadings, in the form of |
K |
Number of components |
g |
Prior parameter |
h |
Prior parameter |
matrix
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # Use the real values as input and simulate allocations. # Mmake sure that in this case Lambda[k,,] is the same # for all k = 1,..., K update_OmegaINV_Cxx(Lambda = syntheticDataset$factorLoadings, K = K, g=0.5, h = 0.5)
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # Use the real values as input and simulate allocations. # Mmake sure that in this case Lambda[k,,] is the same # for all k = 1,..., K update_OmegaINV_Cxx(Lambda = syntheticDataset$factorLoadings, K = K, g=0.5, h = 0.5)
Gibbs sampling for updating for the
xCU
model.
update_SigmaINV_faster(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
update_SigmaINV_faster(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
x_data |
|
z |
Allocation vector |
y |
|
Lambda |
|
mu |
|
K |
Number of components |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
matrix with the common variance of errors per component
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_faster(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_faster(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
per component for
Gibbs sampling for per component
update_SigmaINV_faster_q0( z, mu, K, alpha_sigma, beta_sigma, x_data)
update_SigmaINV_faster_q0( z, mu, K, alpha_sigma, beta_sigma, x_data)
z |
Allocation vector |
mu |
Marginal means |
K |
Number of components |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
x_data |
Data |
Panagiotis Papastamoulis
per component for
Gibbs sampling for per component
update_SigmaINV_faster_q0_sameSigma( z, mu, K, alpha_sigma, beta_sigma, x_data)
update_SigmaINV_faster_q0_sameSigma( z, mu, K, alpha_sigma, beta_sigma, x_data)
z |
Allocation vector |
mu |
Marginal means |
K |
Number of components |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
x_data |
Data |
Panagiotis Papastamoulis
per component
Gibbs sampling for updating for the
xUU
model.
update_SigmaINV_faster_Sj(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
update_SigmaINV_faster_Sj(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
x_data |
|
z |
Allocation vector |
y |
|
Lambda |
|
mu |
|
K |
Number of components |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
array with the variance of errors per component
,
.
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_faster_Sj(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_faster_Sj(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
for xCC models
Gibbs sampling for for xCC models
update_SigmaINV_xCC(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
update_SigmaINV_xCC(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
x_data |
|
z |
Allocation vector |
y |
|
Lambda |
|
mu |
|
K |
Number of components |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
matrix with the common variance of errors per component
.
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_xCC(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_xCC(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
per component for xUC models
Gibbs sampling for per component for xUC models
update_SigmaINV_xUC(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
update_SigmaINV_xUC(x_data, z, y, Lambda, mu, K, alpha_sigma, beta_sigma)
x_data |
|
z |
Allocation vector |
y |
|
Lambda |
|
mu |
|
K |
Number of components |
alpha_sigma |
Prior parameter |
beta_sigma |
Prior parameter |
array containing the inverse variance of errors per component under the restriction:
, where
.
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_xUC(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and update SigmaINV update_SigmaINV_xUC(x_data = syntheticDataset$data, z = syntheticDataset$class, y = syntheticDataset$factors, Lambda = syntheticDataset$factorLoadings, mu = syntheticDataset$means, K = K, alpha_sigma = 0.5, beta_sigma = 0.5)
Gibbs sampling for : here the full conditional distribution is being used (that is, the distribution is also conditioned on the values of factors
).
update_z_b(w, mu, Lambda, y, SigmaINV, K, x_data)
update_z_b(w, mu, Lambda, y, SigmaINV, K, x_data)
w |
vector with length |
mu |
|
Lambda |
|
y |
|
SigmaINV |
Precision matrix |
K |
Number of components |
x_data |
Data |
Allocation vector
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate allocations update_z_b(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, y = syntheticDataset$factors, SigmaINV = diag(1/diag(syntheticDataset$variance)), K = K, x_data = syntheticDataset$data)$z
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate allocations update_z_b(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, y = syntheticDataset$factors, SigmaINV = diag(1/diag(syntheticDataset$variance)), K = K, x_data = syntheticDataset$data)$z
Gibbs sampling for : here the full conditional distribution is being used (that is, the distribution is also conditioned on the values of factors
).
update_z_b_Sj(w, mu, Lambda, y, SigmaINV, K, x_data)
update_z_b_Sj(w, mu, Lambda, y, SigmaINV, K, x_data)
w |
vector with length |
mu |
|
Lambda |
|
y |
|
SigmaINV |
|
K |
Number of components |
x_data |
|
Allocation vector
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_z_b_Sj(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, y = syntheticDataset$factors, SigmaINV = SigmaINV, K = K, x_data = syntheticDataset$data)$z
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_z_b_Sj(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, y = syntheticDataset$factors, SigmaINV = SigmaINV, K = K, x_data = syntheticDataset$data)$z
for
Gibbs sampling for
update_z_q0(w, mu, SigmaINV, K, x_data)
update_z_q0(w, mu, SigmaINV, K, x_data)
w |
Mixture weights |
mu |
Marginal means |
SigmaINV |
Precision matrix per component |
K |
Number of components |
x_data |
Data |
Allocation vector
Panagiotis Papastamoulis
for
Gibbs sampling for
update_z_q0_sameSigma(w, mu, SigmaINV, K, x_data)
update_z_q0_sameSigma(w, mu, SigmaINV, K, x_data)
w |
Mixture weights |
mu |
Marginal means |
SigmaINV |
Precision matrix per component |
K |
Number of components |
x_data |
Data |
Allocation vector
Panagiotis Papastamoulis
using matrix inversion lemma
Collapsed Gibbs for using matrix inversion lemma
update_z2(w, mu, Lambda, SigmaINV, K, x_data)
update_z2(w, mu, Lambda, SigmaINV, K, x_data)
w |
vector with length |
mu |
|
Lambda |
|
SigmaINV |
|
K |
Number of components |
x_data |
|
Allocation vector
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate allocations update_z2(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = diag(1/diag(syntheticDataset$variance)), K = K, x_data = syntheticDataset$data)$z
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate allocations update_z2(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = diag(1/diag(syntheticDataset$variance)), K = K, x_data = syntheticDataset$data)$z
using matrix inversion lemma
Collapsed Gibbs for using matrix inversion lemma
update_z2_Sj(w, mu, Lambda, SigmaINV, K, x_data)
update_z2_Sj(w, mu, Lambda, SigmaINV, K, x_data)
w |
vector with length |
mu |
|
Lambda |
|
SigmaINV |
|
K |
Number of components |
x_data |
|
Allocation vector
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_z2_Sj(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = SigmaINV, K = K, x_data = syntheticDataset$data)$z
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_z2_Sj(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = SigmaINV, K = K, x_data = syntheticDataset$data)$z
Collapsed Gibbs for .
update_z4(w, mu, Lambda, SigmaINV, K, x_data)
update_z4(w, mu, Lambda, SigmaINV, K, x_data)
w |
vector with length |
mu |
|
Lambda |
|
SigmaINV |
|
K |
Number of components |
x_data |
|
A vector of length with the simulated allocation of each observation among the
components.
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate allocations update_z4(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = diag(1/diag(syntheticDataset$variance)), K = K, x_data = syntheticDataset$data)$z
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) # use the real values as input and simulate allocations update_z4(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = diag(1/diag(syntheticDataset$variance)), K = K, x_data = syntheticDataset$data)$z
Collapsed Gibbs for
update_z4_Sj(w, mu, Lambda, SigmaINV, K, x_data)
update_z4_Sj(w, mu, Lambda, SigmaINV, K, x_data)
w |
vector with length |
mu |
|
Lambda |
|
SigmaINV |
|
K |
Number of components |
x_data |
|
Allocation vector
Panagiotis Papastamoulis
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_z4_Sj(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = SigmaINV, K = K, x_data = syntheticDataset$data)$z
library('fabMix') # simulate some data n = 8 # sample size p = 5 # number of variables q = 2 # number of factors K = 2 # true number of clusters sINV_diag = 1/((1:p)) # diagonal of inverse variance of errors set.seed(100) syntheticDataset <- simData(sameLambda=TRUE,K.true = K, n = n, q = q, p = p, sINV_values = sINV_diag) SigmaINV <- array(data = 0, dim = c(K,p,p)) for(k in 1:K){ diag(SigmaINV[k,,]) <- 1/diag(syntheticDataset$variance) + rgamma(p, shape=1, rate = 1) } # use the real values as input and simulate allocations update_z4_Sj(w = syntheticDataset$weights, mu = syntheticDataset$means, Lambda = syntheticDataset$factorLoadings, SigmaINV = SigmaINV, K = K, x_data = syntheticDataset$data)$z
A subset of 1500 randomly sampled observations from the wave dataset (version 1), available from the UCI machine learning repository. It contains 3 classes of waves (variable class
with values “1”, “2” and “3”) and 21 attributes. Each class is generated from a combination of 2 of 3 base waves with noise.
data(waveDataset1500)
data(waveDataset1500)
A data frame with 1500 rows and 22 columns. The first column denotes the class of each observation.
https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator+(Version+1)
Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Breiman,L., Friedman,J.H., Olshen,R.A. and Stone,C.J. (1984). Classification and Regression Trees. Wadsworth International Group: Belmont, California.