Title: | Bayesian Nonparametrics for Automatic Gating of Flow-Cytometry Data |
---|---|
Description: | Dirichlet process mixture of multivariate normal, skew normal or skew t-distributions modeling oriented towards flow-cytometry data preprocessing applications. Method is detailed in: Hejblum, Alkhassimn, Gottardo, Caron & Thiebaut (2019) <doi: 10.1214/18-AOAS1209>. |
Authors: | Boris P Hejblum [aut, cre], Chariff Alkhassim [aut], Francois Caron [aut] |
Maintainer: | Boris P Hejblum <[email protected]> |
License: | LGPL-3 | file LICENSE |
Version: | 0.13.5 |
Built: | 2024-12-09 07:01:56 UTC |
Source: | CRAN |
Dirichlet process mixture of multivariate normal, skew normal or skew t-distributions modeling oriented towards flow-cytometry data pre-processing applications.
Package: | NPflow |
Type: | Package |
Version: | 0.13.5 |
Date: | 2024-01-13 |
License: | LGPL-3 |
The main function in this package is DPMpost
.
Boris P. Hejblum, Chariff Alkhassim, Francois Caron — Maintainer: Boris P. Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Useful links:
Report bugs at https://github.com/sistm/NPflow/issues
Utility function for burning MCMC iteration from a DPMMclust
object.
burn.DPMMclust(x, burnin = 0, thin = 1)
burn.DPMMclust(x, burnin = 0, thin = 1)
x |
a |
burnin |
the number of MCMC iterations to burn (default is |
thin |
the spacing at which MCMC iterations are kept.
Default is |
a DPMMclust
object minus the burnt iterations
Boris Hejblum
Get a point estimate of the partition using the Binder loss function.
cluster_est_binder(c, logposterior)
cluster_est_binder(c, logposterior)
c |
a list of vector of length |
logposterior |
vector of logposterior corresponding to each
partition from |
a list
:
c_est
: a vector of length n
. Point estimate of the partition
cost
: a vector of length N
. cost[j]
is the cost
associated to partition c[[j]]
similarity
: matrix of size n x n
. Similarity matrix
(see similarityMat
)
opt_ind
:the index of the optimal partition among the MCMC iterations.
Francois Caron, Boris Hejblum
F Caron, YW Teh, TB Murphy, Bayesian nonparametric Plackett-Luce models for the analysis of preferences for college degree programmes, Annals of Applied Statistics, 8(2):1145-1181, 2014.
DB Dahl, Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, Bayesian Inference for Gene Expression and Proteomics, K-A Do, P Muller, M Vannucci (Eds.), Cambridge University Press, 2006.
Get a point estimate of the partition using the F-measure as the cost function.
cluster_est_Fmeasure(c, logposterior)
cluster_est_Fmeasure(c, logposterior)
c |
a list of vector of length |
logposterior |
a vector of logposterior corresponding to each
partition from |
a list
:
c_est: |
a vector of length |
cost: |
a vector of length |
similarity: |
matrix of size |
opt_ind: |
the index of the optimal partition among the MCMC iterations. |
Francois Caron, Boris Hejblum
Get a point estimate of the partition using a modified Binder loss function for Gaussian components
cluster_est_Mbinder_norm(c, Mu, Sigma, lambda = 0, a = 1, b = a, logposterior)
cluster_est_Mbinder_norm(c, Mu, Sigma, lambda = 0, a = 1, b = a, logposterior)
c |
a list of vector of length |
Mu |
is a list of length |
Sigma |
is list of length |
lambda |
is a nonnegative tunning parameter allowing further control over the distance function. Default is 0. |
a |
nonnegative constant seen as the unit cost for pairwise misclassification. Default is 1. |
b |
nonnegative constant seen as the unit cost for the other kind of pairwise misclassification. Default is 1. |
logposterior |
vector of logposterior corresponding to each
partition from |
Note that he current implementation only allows Gaussian components.
The modified Binder loss function takes into account the distance between mixture components using #'the Bhattacharyya distance.
a list
:
c_est: |
a vector of length |
cost: |
a vector of length |
similarity: |
matrix of size |
opt_ind: |
the index of the optimal partition among the MCMC iterations. |
Chariff Alkhassim
JW Lau, PJ Green, Bayesian Model-Based Clustering Procedures, Journal of Computational and Graphical Statistics, 16(3):526-558, 2007.
DA Binder, Bayesian cluster analysis, Biometrika 65(1):31-38, 1978.
similarityMat
similarityMatC
similarityMat_nocostC
Gets a point estimate of the partition using posterior expected adjusted Rand index (PEAR)
cluster_est_pear(c)
cluster_est_pear(c)
c |
a list of vector of length |
a list
:
c_est: |
a vector of length |
pear: |
a vector of length |
similarity: |
matrix of size |
opt_ind: |
the index of the optimal partition among the MCMC iterations. |
Chariff Alkhassim
A. Fritsch, K. Ickstadt. Improved Criteria for Clustering Based on the Posterior Similarity Matrix, in Bayesian Analysis, Vol.4 : p.367-392 (2009)
Scatterplot of flow cytometry data
cytoScatter( cytomatrix, dims2plot = c(1, 2), gating = NULL, scale_log = FALSE, xlim = NULL, ylim = NULL, gg.add = list(theme()) )
cytoScatter( cytomatrix, dims2plot = c(1, 2), gating = NULL, scale_log = FALSE, xlim = NULL, ylim = NULL, gg.add = list(theme()) )
cytomatrix |
a |
dims2plot |
a vector of length at least 2, indicating of the dimensions to be plotted.
Default is |
gating |
an optional vector of length |
scale_log |
a logical Flag indicating whether the data should be plotted on the log scale.
Default is |
xlim |
a vector of length 2 to specify the x-axis limits. Only used if |
ylim |
a vector of length 2 to specify the y-axis limits. Only used if |
gg.add |
A list of instructions to add to the |
rm(list=ls()) #Number of data n <- 500 #n <- 2000 set.seed(1234) #set.seed(123) #set.seed(4321) # Sample data m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev <- array(dim=c(2,2,4)) sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=2, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } cytoScatter(z)
rm(list=ls()) #Number of data n <- 500 #n <- 2000 set.seed(1234) #set.seed(123) #set.seed(4321) # Sample data m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev <- array(dim=c(2,2,4)) sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=2, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } cytoScatter(z)
Slice Sampling of the Dirichlet Process Mixture Model with a prior on alpha
DPMGibbsN( z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, ... )
DPMGibbsN( z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, ... )
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior on the concentration parameter of the Dirichlet
Process. Default is |
b |
scale hyperparameter of the Gamma prior on the concentration parameter of the Dirichlet
Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not. Default to
|
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is estimated as a
diagonal matrix, or as a full matrix. Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
listU_mu: |
a list of length |
listU_Sigma: |
a list of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
rm(list=ls()) #Number of data n <- 500 d <- 4 #n <- 2000 set.seed(1234) #set.seed(123) #set.seed(4321) # Sample data m <- matrix(nrow=d, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev <- array(dim=c(d,d,4)) sdev[, ,1] <- 0.3*diag(d) sdev[, ,2] <- c(0.1, 0.3)*diag(d) sdev[, ,3] <- matrix(nrow=d, ncol=d, 0.15) diag(sdev[, ,3]) <- 0.3 sdev[, ,4] <- 0.3*diag(d) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["mu"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["nu"]] <- d+2 hyperG0[["lambda"]] <- diag(d)/10 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # Number of iterations N <- 30 # do some plots doPlot <- TRUE nbclust_init <- 30 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Toy example Data")) p ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white", bins=30) + geom_density(alpha=.6, fill="red", color=NA) + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) + theme_bw() ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample <- DPMGibbsN(z, hyperG0, a, b, N=500, doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) plot_ConvDPM(MCMCsample, from=2) s <- summary(MCMCsample, burnin = 200, thin=2, posterior_approx=FALSE, lossFn = "MBinderN") F <- FmeasureC(pred=s$point_estim$c_est, ref=c) postalpha <- data.frame("alpha"=MCMCsample$alpha[50:500], "distribution" = factor(rep("posterior",500-49), levels=c("prior", "posterior"))) p <- (ggplot(postalpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of alpha\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(alpha, na.rm=TRUE)), color="red", linetype="dashed", size=1) ) p p <- (ggplot(drop=FALSE, alpha=.6) + geom_density(aes(x=alpha, fill=distribution), color=NA, alpha=.6, data=prioralpha) #+ geom_density(aes(x=alpha, fill=distribution), # color=NA, alpha=.6, # data=postalpha) + ggtitle("Prior and posterior distributions of alpha\n") + scale_fill_discrete(drop=FALSE) + theme_bw() +xlim(0,10) +ylim(0, 1.3) ) p } # k-means comparison #################### plot(x=z[1,], y=z[2,], col=kmeans(t(z), centers=4)$cluster, xlab = "d = 1", ylab= "d = 2", main="k-means with K=4 clusters") KM <- kmeans(t(z), centers=4) dataKM <- data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(KM$cluster)) dataCenters <- data.frame("X"=KM$centers[,1], "Y"=KM$centers[,2], "Cluster"=rownames(KM$centers)) p <- (ggplot(dataKM) + geom_point(aes(x=X, y=Y, col=Cluster)) + geom_point(aes(x=X, y=Y, fill=Cluster, order=Cluster), data=dataCenters, shape=22, size=5) + scale_colour_discrete(name="Cluster") + ggtitle("K-means with K=4 clusters\n")) p
rm(list=ls()) #Number of data n <- 500 d <- 4 #n <- 2000 set.seed(1234) #set.seed(123) #set.seed(4321) # Sample data m <- matrix(nrow=d, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev <- array(dim=c(d,d,4)) sdev[, ,1] <- 0.3*diag(d) sdev[, ,2] <- c(0.1, 0.3)*diag(d) sdev[, ,3] <- matrix(nrow=d, ncol=d, 0.15) diag(sdev[, ,3]) <- 0.3 sdev[, ,4] <- 0.3*diag(d) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["mu"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["nu"]] <- d+2 hyperG0[["lambda"]] <- diag(d)/10 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # Number of iterations N <- 30 # do some plots doPlot <- TRUE nbclust_init <- 30 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Toy example Data")) p ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white", bins=30) + geom_density(alpha=.6, fill="red", color=NA) + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) + theme_bw() ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample <- DPMGibbsN(z, hyperG0, a, b, N=500, doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) plot_ConvDPM(MCMCsample, from=2) s <- summary(MCMCsample, burnin = 200, thin=2, posterior_approx=FALSE, lossFn = "MBinderN") F <- FmeasureC(pred=s$point_estim$c_est, ref=c) postalpha <- data.frame("alpha"=MCMCsample$alpha[50:500], "distribution" = factor(rep("posterior",500-49), levels=c("prior", "posterior"))) p <- (ggplot(postalpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of alpha\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(alpha, na.rm=TRUE)), color="red", linetype="dashed", size=1) ) p p <- (ggplot(drop=FALSE, alpha=.6) + geom_density(aes(x=alpha, fill=distribution), color=NA, alpha=.6, data=prioralpha) #+ geom_density(aes(x=alpha, fill=distribution), # color=NA, alpha=.6, # data=postalpha) + ggtitle("Prior and posterior distributions of alpha\n") + scale_fill_discrete(drop=FALSE) + theme_bw() +xlim(0,10) +ylim(0, 1.3) ) p } # k-means comparison #################### plot(x=z[1,], y=z[2,], col=kmeans(t(z), centers=4)$cluster, xlab = "d = 1", ylab= "d = 2", main="k-means with K=4 clusters") KM <- kmeans(t(z), centers=4) dataKM <- data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(KM$cluster)) dataCenters <- data.frame("X"=KM$centers[,1], "Y"=KM$centers[,2], "Cluster"=rownames(KM$centers)) p <- (ggplot(dataKM) + geom_point(aes(x=X, y=Y, col=Cluster)) + geom_point(aes(x=X, y=Y, fill=Cluster, order=Cluster), data=dataCenters, shape=22, size=5) + scale_colour_discrete(name="Cluster") + ggtitle("K-means with K=4 clusters\n")) p
Slice Sampling of the Dirichlet Process Mixture Model with a prior on alpha
DPMGibbsN_parallel( Ncpus, type_connec, z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, monitorfile = "", ... )
DPMGibbsN_parallel( Ncpus, type_connec, z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, monitorfile = "", ... )
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
listU_mu: |
a list of length |
listU_Sigma: |
a list of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
# Scaling up: ---- rm(list=ls()) #Number of data n <- 2000 set.seed(1234) # Sample data d <- 3 nclust <- 5 m <- matrix(nrow=d, ncol=nclust, runif(d*nclust)*8) # p: cluster probabilities p <- runif(nclust) p <- p/sum(p) # Covariance matrix of the clusters sdev <- array(dim=c(d, d, nclust)) for (j in 1:nclust){ sdev[, ,j] <- matrix(NA, nrow=d, ncol=d) diag(sdev[, ,j]) <- abs(rnorm(n=d, mean=0.3, sd=0.1)) sdev[, ,j][lower.tri(sdev[, ,j], diag = FALSE)] <- rnorm(n=d*(d-1)/2, mean=0, sd=0.05) sdev[, ,j][upper.tri(sdev[, ,j], diag = FALSE)] <- (sdev[, ,j][ lower.tri(sdev[, ,j], diag = FALSE)]) } c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # hyperprior on the Scale parameter of DPM a <- 0.001 b <- 0.001 # Number of iterations N <- 25 # do some plots doPlot <- TRUE # Set parameters of G0 hyperG0 <- list() hyperG0[["mu"]] <- rep(0, d) hyperG0[["kappa"]] <- 0.01 hyperG0[["nu"]] <- d + 2 hyperG0[["lambda"]] <- diag(d)/10 nbclust_init <- 30 if(interactive()){ library(doParallel) MCMCsample <- DPMGibbsN_parallel(Ncpus=2, type_connec="FORK", z, hyperG0, a, b, N=1000, doPlot=FALSE, nbclust_init=30, plotevery=100, gg.add=list(ggplot2::theme_bw(), ggplot2::guides(shape = ggplot2::guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) }
# Scaling up: ---- rm(list=ls()) #Number of data n <- 2000 set.seed(1234) # Sample data d <- 3 nclust <- 5 m <- matrix(nrow=d, ncol=nclust, runif(d*nclust)*8) # p: cluster probabilities p <- runif(nclust) p <- p/sum(p) # Covariance matrix of the clusters sdev <- array(dim=c(d, d, nclust)) for (j in 1:nclust){ sdev[, ,j] <- matrix(NA, nrow=d, ncol=d) diag(sdev[, ,j]) <- abs(rnorm(n=d, mean=0.3, sd=0.1)) sdev[, ,j][lower.tri(sdev[, ,j], diag = FALSE)] <- rnorm(n=d*(d-1)/2, mean=0, sd=0.05) sdev[, ,j][upper.tri(sdev[, ,j], diag = FALSE)] <- (sdev[, ,j][ lower.tri(sdev[, ,j], diag = FALSE)]) } c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # hyperprior on the Scale parameter of DPM a <- 0.001 b <- 0.001 # Number of iterations N <- 25 # do some plots doPlot <- TRUE # Set parameters of G0 hyperG0 <- list() hyperG0[["mu"]] <- rep(0, d) hyperG0[["kappa"]] <- 0.01 hyperG0[["nu"]] <- d + 2 hyperG0[["lambda"]] <- diag(d)/10 nbclust_init <- 30 if(interactive()){ library(doParallel) MCMCsample <- DPMGibbsN_parallel(Ncpus=2, type_connec="FORK", z, hyperG0, a, b, N=1000, doPlot=FALSE, nbclust_init=30, plotevery=100, gg.add=list(ggplot2::theme_bw(), ggplot2::guides(shape = ggplot2::guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) }
Slice Sampling of Dirichlet Process Mixture of Gaussian distributions
DPMGibbsN_SeqPrior( z, prior_inform, hyperG0, N, nbclust_init, add.vagueprior = TRUE, weightnoninfo = NULL, doPlot = TRUE, plotevery = N/10, diagVar = TRUE, verbose = TRUE, ... )
DPMGibbsN_SeqPrior( z, prior_inform, hyperG0, N, nbclust_init, add.vagueprior = TRUE, weightnoninfo = NULL, doPlot = TRUE, plotevery = N/10, diagVar = TRUE, verbose = TRUE, ... )
z |
data matrix |
prior_inform |
an informative prior such as the approximation computed by |
hyperG0 |
a non informative prior component for the mixing distribution.
Only used if |
N |
number of MCMC iterations. |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
add.vagueprior |
logical flag indicating whether a non informative component should
be added to the informative prior. Default is |
weightnoninfo |
a real between 0 and 1 giving the weights of the non informative component in the prior. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
listU_mu: |
a list of length |
listU_Sigma: |
a list of length |
U_SS_list: |
a list of length |
weights_list: |
|
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum, Chariff Alkhassim
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
postProcess.DPMMclust
DPMGibbsN
rm(list=ls()) library(NPflow) #Number of data n <- 1500 # Sample data #m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, 0.5, -2)) m <- matrix(nrow=2, ncol=4, c(-.8, .7, .5, .7, .5, -.7, -.5, -.7)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev <- array(dim=c(2,2,4)) sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=2, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } d<-2 # Set parameters of G0 hyperG0 <- list() hyperG0[["mu"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["nu"]] <- d+2 hyperG0[["lambda"]] <- diag(d)/10 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # Number of iterations N <- 30 # do some plots doPlot <- TRUE nbclust_init <- 20 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Toy example Data")) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample <- DPMGibbsN(z, hyperG0, a, b, N=1500, doPlot, nbclust_init, plotevery=200, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample, posterior_approx=TRUE, burnin = 1000, thin=5) F1 <- FmeasureC(pred=s$point_estim$c_est, ref=c) F1 MCMCsample2 <- DPMGibbsN_SeqPrior(z, prior_inform=s$param_posterior, hyperG0, N=1500, add.vagueprior = TRUE, doPlot=TRUE, plotevery=100, nbclust_init=nbclust_init, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s2 <- summary(MCMCsample2, burnin = 500, thin=5) F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c) F2 }
rm(list=ls()) library(NPflow) #Number of data n <- 1500 # Sample data #m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, 0.5, -2)) m <- matrix(nrow=2, ncol=4, c(-.8, .7, .5, .7, .5, -.7, -.5, -.7)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev <- array(dim=c(2,2,4)) sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=2, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } d<-2 # Set parameters of G0 hyperG0 <- list() hyperG0[["mu"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["nu"]] <- d+2 hyperG0[["lambda"]] <- diag(d)/10 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # Number of iterations N <- 30 # do some plots doPlot <- TRUE nbclust_init <- 20 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Toy example Data")) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample <- DPMGibbsN(z, hyperG0, a, b, N=1500, doPlot, nbclust_init, plotevery=200, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample, posterior_approx=TRUE, burnin = 1000, thin=5) F1 <- FmeasureC(pred=s$point_estim$c_est, ref=c) F1 MCMCsample2 <- DPMGibbsN_SeqPrior(z, prior_inform=s$param_posterior, hyperG0, N=1500, add.vagueprior = TRUE, doPlot=TRUE, plotevery=100, nbclust_init=nbclust_init, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s2 <- summary(MCMCsample2, burnin = 500, thin=5) F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c) F2 }
Slice Sampling of Dirichlet Process Mixture of skew normal distributions
DPMGibbsSkewN( z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, ... )
DPMGibbsSkewN( z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, ... )
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of a cluster is a diagonal matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
|
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
rm(list=ls()) #Number of data n <- 1000 set.seed(123) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) #xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1)) ##xi <- matrix(nrow=d, ncol=ncl, c(-0.3, 0, 0.5, 0.5, 0.5, -1.2, -1, 1)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -1.2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.0001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d) # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots doPlot <- TRUE nbclust_init <- 30 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p c2plot <- factor(c) levels(c2plot) <- c("3", "2", "4", "1") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) + ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))) pp ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample_sn <- DPMGibbsSkewN(z, hyperG0, a, b, N=2500, doPlot, nbclust_init, plotevery=200, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample_sn, burnin = 2000, thin=10) #cluster_est_binder(MCMCsample_sn$mcmc_partitions[1000:1500]) print(s) plot(s) #plot_ConvDPM(MCMCsample_sn, from=2) # k-means plot(x=z[1,], y=z[2,], col=kmeans(t(z), centers=4)$cluster, xlab = "d = 1", ylab= "d = 2", main="k-means with K=4 clusters") KM <- kmeans(t(z), centers=4) KMclust <- factor(KM$cluster) levels(KMclust) <- c("2", "4", "1", "3") dataKM <- data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(KMclust)) dataCenters <- data.frame("X"=KM$centers[,1], "Y"=KM$centers[,2], "Cluster"=c("2", "4", "1", "3")) p <- (ggplot(dataKM) + geom_point(aes(x=X, y=Y, col=Cluster)) + geom_point(aes(x=X, y=Y, fill=Cluster, order=Cluster), data=dataCenters, shape=22, size=5) + scale_colour_discrete(name="Cluster", guide=guide_legend(override.aes=list(size=6, shape=22))) + ggtitle("K-means with K=4 clusters\n") + theme_bw() ) p postalpha <- data.frame("alpha"=MCMCsample_sn$alpha[501:1000], "distribution" = factor(rep("posterior",1000-500), levels=c("prior", "posterior"))) postK <- data.frame("K"=sapply(lapply(postalpha$alpha, "["), function(x){sum(x/(x+0:(1000-1)))})) p <- (ggplot(postalpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of alpha\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(alpha, na.rm=T)), color="red", linetype="dashed", size=1) ) p p <- (ggplot(postK, aes(x=K)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of predicted K\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(K, na.rm=T)), color="red", linetype="dashed", size=1) #+ scale_x_continuous(breaks=c(0:6)*2, minor_breaks=c(0:6)*2+1) + scale_x_continuous(breaks=c(1:12)) ) p p <- (ggplot(drop=FALSE, alpha=.6) + geom_density(aes(x=alpha, fill=distribution), color=NA, alpha=.6, data=postalpha) + geom_density(aes(x=alpha, fill=distribution), color=NA, alpha=.6, data=prioralpha) + ggtitle("Prior and posterior distributions of alpha\n") + scale_fill_discrete(drop=FALSE) + theme_bw() + xlim(0,100) ) p #Skew Normal n=100000 xi <- 0 d <- 0.995 alpha <- d/sqrt(1-d^2) z <- rtruncnorm(n,a=0, b=Inf) e <- rnorm(n, mean = 0, sd = 1) x <- d*z + sqrt(1-d^2)*e o <- 1 y <- xi+o*x nu=1.3 w <- rgamma(n,scale=nu/2, shape=nu/2) yy <- xi+o*x/w snd <- data.frame("Y"=y,"YY"=yy) p <- (ggplot(snd)+geom_density(aes(x=Y), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("Y~SN(0,1,10)\n") + xlim(-1,6) + ylim(0,0.8) ) p p <- (ggplot(snd)+geom_density(aes(x=YY), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("Y~ST(0,1,10,1.3)\n") + xlim(-2,40) + ylim(0,0.8) ) p p <- (ggplot(snd) + geom_density(aes(x=Y, fill="blue"), alpha=.2) + geom_density(aes(x=YY, fill="red"), alpha=.2) + theme_bw() +theme(legend.text = element_text(size = 13), legend.position="bottom") + ylab("Density") + xlim(-2,40) + ylim(0,0.8) + scale_fill_manual(name="", labels=c("Y~SN(0,1,10) ", "Y~ST(0,1,10,1.3)"), guide="legend", values=c("blue", "red")) ) p #Variations n=100000 xi <- -1 d <- 0.995 alpha <- d/sqrt(1-d^2) z <- rtruncnorm(n,a=0, b=Inf) e <- rnorm(n, mean = 0, sd = 1) x <- d*z + sqrt(1-d^2)*e snd <- data.frame("X"=x) p <- (ggplot(snd)+geom_density(aes(x=X), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("X~SN(10)\n") + xlim(-1.5,4) + ylim(0,1.6) ) p o <- 0.5 y <- xi+o*x snd <- data.frame("Y"=y) p <- (ggplot(snd)+geom_density(aes(x=Y), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("Y~SN(-1,1,10)\n") + xlim(-1.5,4) + ylim(0,1.6) ) p #Simple toy example ################### n <- 500 set.seed(12345) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -1.2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) #' # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.0001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) cat(k, "/", n, " observations simulated\n", sep="") } MCMCsample_sn_sep <- DPMGibbsSkewN(z, hyperG0, a, b, N=600, doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=TRUE) s <- summary(MCMCsample_sn, burnin = 400) }
rm(list=ls()) #Number of data n <- 1000 set.seed(123) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) #xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1)) ##xi <- matrix(nrow=d, ncol=ncl, c(-0.3, 0, 0.5, 0.5, 0.5, -1.2, -1, 1)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -1.2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.0001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d) # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots doPlot <- TRUE nbclust_init <- 30 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p c2plot <- factor(c) levels(c2plot) <- c("3", "2", "4", "1") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) + ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))) pp ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample_sn <- DPMGibbsSkewN(z, hyperG0, a, b, N=2500, doPlot, nbclust_init, plotevery=200, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample_sn, burnin = 2000, thin=10) #cluster_est_binder(MCMCsample_sn$mcmc_partitions[1000:1500]) print(s) plot(s) #plot_ConvDPM(MCMCsample_sn, from=2) # k-means plot(x=z[1,], y=z[2,], col=kmeans(t(z), centers=4)$cluster, xlab = "d = 1", ylab= "d = 2", main="k-means with K=4 clusters") KM <- kmeans(t(z), centers=4) KMclust <- factor(KM$cluster) levels(KMclust) <- c("2", "4", "1", "3") dataKM <- data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(KMclust)) dataCenters <- data.frame("X"=KM$centers[,1], "Y"=KM$centers[,2], "Cluster"=c("2", "4", "1", "3")) p <- (ggplot(dataKM) + geom_point(aes(x=X, y=Y, col=Cluster)) + geom_point(aes(x=X, y=Y, fill=Cluster, order=Cluster), data=dataCenters, shape=22, size=5) + scale_colour_discrete(name="Cluster", guide=guide_legend(override.aes=list(size=6, shape=22))) + ggtitle("K-means with K=4 clusters\n") + theme_bw() ) p postalpha <- data.frame("alpha"=MCMCsample_sn$alpha[501:1000], "distribution" = factor(rep("posterior",1000-500), levels=c("prior", "posterior"))) postK <- data.frame("K"=sapply(lapply(postalpha$alpha, "["), function(x){sum(x/(x+0:(1000-1)))})) p <- (ggplot(postalpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of alpha\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(alpha, na.rm=T)), color="red", linetype="dashed", size=1) ) p p <- (ggplot(postK, aes(x=K)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of predicted K\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(K, na.rm=T)), color="red", linetype="dashed", size=1) #+ scale_x_continuous(breaks=c(0:6)*2, minor_breaks=c(0:6)*2+1) + scale_x_continuous(breaks=c(1:12)) ) p p <- (ggplot(drop=FALSE, alpha=.6) + geom_density(aes(x=alpha, fill=distribution), color=NA, alpha=.6, data=postalpha) + geom_density(aes(x=alpha, fill=distribution), color=NA, alpha=.6, data=prioralpha) + ggtitle("Prior and posterior distributions of alpha\n") + scale_fill_discrete(drop=FALSE) + theme_bw() + xlim(0,100) ) p #Skew Normal n=100000 xi <- 0 d <- 0.995 alpha <- d/sqrt(1-d^2) z <- rtruncnorm(n,a=0, b=Inf) e <- rnorm(n, mean = 0, sd = 1) x <- d*z + sqrt(1-d^2)*e o <- 1 y <- xi+o*x nu=1.3 w <- rgamma(n,scale=nu/2, shape=nu/2) yy <- xi+o*x/w snd <- data.frame("Y"=y,"YY"=yy) p <- (ggplot(snd)+geom_density(aes(x=Y), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("Y~SN(0,1,10)\n") + xlim(-1,6) + ylim(0,0.8) ) p p <- (ggplot(snd)+geom_density(aes(x=YY), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("Y~ST(0,1,10,1.3)\n") + xlim(-2,40) + ylim(0,0.8) ) p p <- (ggplot(snd) + geom_density(aes(x=Y, fill="blue"), alpha=.2) + geom_density(aes(x=YY, fill="red"), alpha=.2) + theme_bw() +theme(legend.text = element_text(size = 13), legend.position="bottom") + ylab("Density") + xlim(-2,40) + ylim(0,0.8) + scale_fill_manual(name="", labels=c("Y~SN(0,1,10) ", "Y~ST(0,1,10,1.3)"), guide="legend", values=c("blue", "red")) ) p #Variations n=100000 xi <- -1 d <- 0.995 alpha <- d/sqrt(1-d^2) z <- rtruncnorm(n,a=0, b=Inf) e <- rnorm(n, mean = 0, sd = 1) x <- d*z + sqrt(1-d^2)*e snd <- data.frame("X"=x) p <- (ggplot(snd)+geom_density(aes(x=X), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("X~SN(10)\n") + xlim(-1.5,4) + ylim(0,1.6) ) p o <- 0.5 y <- xi+o*x snd <- data.frame("Y"=y) p <- (ggplot(snd)+geom_density(aes(x=Y), fill="blue", alpha=.2) + theme_bw() + ylab("Density") + ggtitle("Y~SN(-1,1,10)\n") + xlim(-1.5,4) + ylim(0,1.6) ) p #Simple toy example ################### n <- 500 set.seed(12345) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -1.2)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) #' # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.0001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) cat(k, "/", n, " observations simulated\n", sep="") } MCMCsample_sn_sep <- DPMGibbsSkewN(z, hyperG0, a, b, N=600, doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=TRUE) s <- summary(MCMCsample_sn, burnin = 400) }
If the monitorfile
argument is a character string naming a file to
write into, in the case of a new file that does not exist yet, such a new
file will be created. A line is written at each MCMC iteration.
DPMGibbsSkewN_parallel( Ncpus, type_connec, z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = FALSE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = FALSE, monitorfile = "", ... )
DPMGibbsSkewN_parallel( Ncpus, type_connec, z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = FALSE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = FALSE, monitorfile = "", ... )
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
|
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
rm(list=ls()) #Number of data n <- 2000 set.seed(1234) d <- 4 ncl <- 5 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(runif(n=d*ncl,min=0,max=3))) psi <- matrix(nrow=d, ncol=ncl, c(runif(n=d*ncl,min=-1,max=1))) p <- runif(n=ncl) p <- p/sum(p) sdev0 <- diag(runif(n=d, min=0.05, max=0.6)) for (j in 1:ncl){ sdev[, ,j] <- invwishrnd(n = d+2, lambda = sdev0) } c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d)/10 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots doPlot <- TRUE nbclust_init <- 30 z <- z*200 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p # Gibbs sampler for Dirichlet Process Mixtures ############################################## if(interactive()){ MCMCsample_sn_par <- DPMGibbsSkewN_parallel(Ncpus=parallel::detectCores()-1, type_connec="SOCK", z, hyperG0, a, b, N=5000, doPlot, nbclust_init, plotevery=25, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) plot_ConvDPM(MCMCsample_sn_par, from=2) }
rm(list=ls()) #Number of data n <- 2000 set.seed(1234) d <- 4 ncl <- 5 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(runif(n=d*ncl,min=0,max=3))) psi <- matrix(nrow=d, ncol=ncl, c(runif(n=d*ncl,min=-1,max=1))) p <- runif(n=ncl) p <- p/sum(p) sdev0 <- diag(runif(n=d, min=0.05, max=0.6)) for (j in 1:ncl){ sdev[, ,j] <- invwishrnd(n = d+2, lambda = sdev0) } c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d)/10 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots doPlot <- TRUE nbclust_init <- 30 z <- z*200 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p # Gibbs sampler for Dirichlet Process Mixtures ############################################## if(interactive()){ MCMCsample_sn_par <- DPMGibbsSkewN_parallel(Ncpus=parallel::detectCores()-1, type_connec="SOCK", z, hyperG0, a, b, N=5000, doPlot, nbclust_init, plotevery=25, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) plot_ConvDPM(MCMCsample_sn_par, from=2) }
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
DPMGibbsSkewT( z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, ... )
DPMGibbsSkewT( z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = TRUE, ... )
z |
data matrix |
hyperG0 |
parameters of the prior mixing distribution in a
|
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Fruhwirth-Schnatter S, Pyne S, Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions, Biostatistics, 2010.
rm(list=ls()) #Number of data n <- 2000 set.seed(4321) d <- 2 ncl <- 4 # Sample data library(truncnorm) sdev <- array(dim=c(d,d,ncl)) #xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) #xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1)) xi <- matrix(nrow=d, ncol=ncl, c(-0.2, 0.5, 2.4, 0.4, 0.6, -1.3, -0.9, -2.7)) psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9)) nu <- c(100,25,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() #+ ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p #pdf(height=8.5, width=8.5) c2plot <- factor(c) levels(c2plot) <- c("4", "1", "3", "2") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) #+ ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))) pp #pdf(height=7, width=7.5) ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=1500, doPlot=TRUE, nbclust_init=30, plotevery=100, diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 1000, thin=10, lossFn = "Binder") print(s) plot(s, hm=TRUE) #pdf(height=8.5, width=10.5) #png(height=700, width=720) plot_ConvDPM(MCMCsample_st, from=2) #cluster_est_binder(MCMCsample_st$mcmc_partitions[900:1000]) }
rm(list=ls()) #Number of data n <- 2000 set.seed(4321) d <- 2 ncl <- 4 # Sample data library(truncnorm) sdev <- array(dim=c(d,d,ncl)) #xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) #xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1)) xi <- matrix(nrow=d, ncol=ncl, c(-0.2, 0.5, 2.4, 0.4, 0.6, -1.3, -0.9, -2.7)) psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9)) nu <- c(100,25,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() #+ ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p #pdf(height=8.5, width=8.5) c2plot <- factor(c) levels(c2plot) <- c("4", "1", "3", "2") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) #+ ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))) pp #pdf(height=7, width=7.5) ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=1500, doPlot=TRUE, nbclust_init=30, plotevery=100, diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 1000, thin=10, lossFn = "Binder") print(s) plot(s, hm=TRUE) #pdf(height=8.5, width=10.5) #png(height=700, width=720) plot_ConvDPM(MCMCsample_st, from=2) #cluster_est_binder(MCMCsample_st$mcmc_partitions[900:1000]) }
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
DPMGibbsSkewT_parallel( Ncpus, type_connec, z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = FALSE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = FALSE, monitorfile = "", ... )
DPMGibbsSkewT_parallel( Ncpus, type_connec, z, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = FALSE, nbclust_init = 30, plotevery = N/10, diagVar = TRUE, use_variance_hyperprior = TRUE, verbose = FALSE, monitorfile = "", ... )
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
rm(list=ls()) #Number of data n <- 2000 set.seed(123) #set.seed(4321) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- (xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d) # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots doPlot <- TRUE nbclust_init <- 30 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p c2plot <- factor(c) levels(c2plot) <- c("3", "2", "4", "1") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) + ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))) pp ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 350) print(s) plot(s) plot_ConvDPM(MCMCsample_st, from=2) cluster_est_binder(MCMCsample_st$mcmc_partitions[1500:2000]) }
rm(list=ls()) #Number of data n <- 2000 set.seed(123) #set.seed(4321) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) z[,k] <- (xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rep(0,d) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d + 1 hyperG0[["lambda"]] <- diag(d) # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots doPlot <- TRUE nbclust_init <- 30 ## Data ######## library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) p c2plot <- factor(c) levels(c2plot) <- c("3", "2", "4", "1") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) + ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))) pp ## alpha priors plots ##################### prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p if(interactive()){ # Gibbs sampler for Dirichlet Process Mixtures ############################################## MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 350) print(s) plot(s) plot_ConvDPM(MCMCsample_st, from=2) cluster_est_binder(MCMCsample_st$mcmc_partitions[1500:2000]) }
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
DPMGibbsSkewT_SeqPrior( z, prior_inform, hyperG0, N, nbclust_init, add.vagueprior = TRUE, weightnoninfo = NULL, doPlot = TRUE, plotevery = N/10, diagVar = TRUE, verbose = TRUE, ... )
DPMGibbsSkewT_SeqPrior( z, prior_inform, hyperG0, N, nbclust_init, add.vagueprior = TRUE, weightnoninfo = NULL, doPlot = TRUE, plotevery = N/10, diagVar = TRUE, verbose = TRUE, ... )
z |
data matrix |
prior_inform |
an informative prior such as the approximation computed by |
hyperG0 |
prior mixing distribution. |
N |
number of MCMC iterations. |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
add.vagueprior |
logical flag indicating whether a non informative component should
be added to the informative prior. Default is |
weightnoninfo |
a real between 0 and 1 giving the weights of the non informative component in the prior. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
rm(list=ls()) #Number of data n <- 2000 set.seed(123) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) nu <- c(100,15,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots nbclust_init <- 30 ## Plot Data library(ggplot2) q <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) q if(interactive()){ MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=TRUE, plotevery=250, nbclust_init, diagVar=FALSE, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) s <- summary(MCMCsample_st, burnin = 1500, thin=2, posterior_approx=TRUE) F <- FmeasureC(pred=s$point_estim$c_est, ref=c) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) cat(k, "/", n, " observations simulated\n", sep="") } MCMCsample_st2 <- DPMGibbsSkewT_SeqPrior(z, prior=s$param_posterior, hyperG0, N=2000, doPlot=TRUE, plotevery=100, nbclust_init, diagVar=FALSE, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) s2 <- summary(MCMCsample_st2, burnin = 1500, thin=5) F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c) # MCMCsample_st2_par <- DPMGibbsSkewT_SeqPrior_parallel(Ncpus= 2, type_connec="SOCK", # z, prior_inform=s$param_posterior, # hyperG0, N=2000, # doPlot=TRUE, plotevery=50, # nbclust_init, diagVar=FALSE, # gg.add=list(theme_bw(), # guides(shape=guide_legend(override.aes = list(fill="grey45")))) }
rm(list=ls()) #Number of data n <- 2000 set.seed(123) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) nu <- c(100,15,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots nbclust_init <- 30 ## Plot Data library(ggplot2) q <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) q if(interactive()){ MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=TRUE, plotevery=250, nbclust_init, diagVar=FALSE, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) s <- summary(MCMCsample_st, burnin = 1500, thin=2, posterior_approx=TRUE) F <- FmeasureC(pred=s$point_estim$c_est, ref=c) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) cat(k, "/", n, " observations simulated\n", sep="") } MCMCsample_st2 <- DPMGibbsSkewT_SeqPrior(z, prior=s$param_posterior, hyperG0, N=2000, doPlot=TRUE, plotevery=100, nbclust_init, diagVar=FALSE, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) s2 <- summary(MCMCsample_st2, burnin = 1500, thin=5) F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c) # MCMCsample_st2_par <- DPMGibbsSkewT_SeqPrior_parallel(Ncpus= 2, type_connec="SOCK", # z, prior_inform=s$param_posterior, # hyperG0, N=2000, # doPlot=TRUE, plotevery=50, # nbclust_init, diagVar=FALSE, # gg.add=list(theme_bw(), # guides(shape=guide_legend(override.aes = list(fill="grey45")))) }
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
DPMGibbsSkewT_SeqPrior_parallel( Ncpus, type_connec, z, prior_inform, hyperG0, N, nbclust_init, add.vagueprior = TRUE, weightnoninfo = NULL, doPlot = FALSE, plotevery = N/10, diagVar = TRUE, verbose = TRUE, monitorfile = "", ... )
DPMGibbsSkewT_SeqPrior_parallel( Ncpus, type_connec, z, prior_inform, hyperG0, N, nbclust_init, add.vagueprior = TRUE, weightnoninfo = NULL, doPlot = FALSE, plotevery = N/10, diagVar = TRUE, verbose = TRUE, monitorfile = "", ... )
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
prior_inform |
an informative prior such as the approximation computed by |
hyperG0 |
prior mixing distribution. |
N |
number of MCMC iterations. |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
add.vagueprior |
logical flag indicating whether a non informative component should
be added to the informative prior. Default is |
weightnoninfo |
a real between 0 and 1 giving the weights of the non informative component in the prior. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
rm(list=ls()) #Number of data n <- 2000 set.seed(123) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) #xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) #psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) xi <- matrix(nrow=d, ncol=ncl, c(-0.2, 0.5, 2.4, 0.4, 0.6, -1.3, -0.9, -2.7)) psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9)) nu <- c(100,15,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots nbclust_init <- 30 ## Plot Data library(ggplot2) q <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) q if(interactive()){ MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=TRUE, plotevery=250, nbclust_init, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 1500, thin=5, posterior_approx=TRUE) F <- FmeasureC(pred=s$point_estim$c_est, ref=c) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } MCMCsample_st2 <- DPMGibbsSkewT_SeqPrior_parallel(Ncpus=2, type_connec="SOCK", z, prior_inform=s$param_posterior, hyperG0, N=3000, doPlot=TRUE, plotevery=100, nbclust_init, diagVar=FALSE, verbose=FALSE, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) s2 <- summary(MCMCsample_st2, burnin = 2000, thin=5) F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c) }
rm(list=ls()) #Number of data n <- 2000 set.seed(123) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) #xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) #psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) xi <- matrix(nrow=d, ncol=ncl, c(-0.2, 0.5, 2.4, 0.4, 0.6, -1.3, -0.9, -2.7)) psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9)) nu <- c(100,15,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 # do some plots nbclust_init <- 30 ## Plot Data library(ggplot2) q <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Simple example in 2d data") +xlab("D1") +ylab("D2") +theme_bw()) q if(interactive()){ MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=TRUE, plotevery=250, nbclust_init, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45")))), diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 1500, thin=5, posterior_approx=TRUE) F <- FmeasureC(pred=s$point_estim$c_est, ref=c) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } MCMCsample_st2 <- DPMGibbsSkewT_SeqPrior_parallel(Ncpus=2, type_connec="SOCK", z, prior_inform=s$param_posterior, hyperG0, N=3000, doPlot=TRUE, plotevery=100, nbclust_init, diagVar=FALSE, verbose=FALSE, gg.add=list(theme_bw(), guides(shape=guide_legend(override.aes = list(fill="grey45"))))) s2 <- summary(MCMCsample_st2, burnin = 2000, thin=5) F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c) }
Partially collapse slice Gibbs sampling for Dirichlet process mixture of multivariate normal, skew normal or skew t distributions.
DPMpost( data, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = floor(N/10), diagVar = TRUE, verbose = TRUE, distrib = c("gaussian", "skewnorm", "skewt"), ncores = 1, type_connec = "SOCK", informPrior = NULL, ... )
DPMpost( data, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE, nbclust_init = 30, plotevery = floor(N/10), diagVar = TRUE, verbose = TRUE, distrib = c("gaussian", "skewnorm", "skewt"), ncores = 1, type_connec = "SOCK", informPrior = NULL, ... )
data |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
distrib |
the distribution used for the clustering. Current possibilities are
|
ncores |
number of cores to use. |
type_connec |
The type of connection between the processors. Supported
cluster types are |
informPrior |
an optional informative prior such as the approximation computed
by |
... |
additional arguments to be passed to |
This function is a wrapper around the following functions:
DPMGibbsN
, DPMGibbsN_parallel
,
DPMGibbsN_SeqPrior
, DPMGibbsSkewN
, DPMGibbsSkewN_parallel
,
DPMGibbsSkewT
, DPMGibbsSkewT_parallel
,
DPMGibbsSkewT_SeqPrior
, DPMGibbsSkewT_SeqPrior_parallel
.
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component |
hyperG0: |
the prior on the cluster location |
Boris Hejblum
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
#rm(list=ls()) set.seed(123) # Exemple in 2 dimensions with skew-t distributions # Generate data: n <- 2000 # number of data points d <- 2 # dimensions ncl <- 4 # number of true clusters sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9)) nu <- c(100,25,8,5) proba <- c(0.15, 0.05, 0.5, 0.3) # cluster frequencies sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.2)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=proba)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) } # Define hyperprior hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 if(interactive()){ # Plot data cytoScatter(z) # Estimate posterior MCMCsample_st <- DPMpost(data=z, hyperG0=hyperG0, N=2000, distrib="skewt", gg.add=list(ggplot2::theme_bw(), ggplot2::guides(shape=ggplot2::guide_legend(override.aes = list(fill="grey45")))) ) s <- summary(MCMCsample_st, burnin = 1600, thin=5, lossFn = "Binder") s plot(s) #plot(s, hm=TRUE) # this can take a few sec... # more data plotting: library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Unsupervised data") + xlab("D1") + ylab("D2") + theme_bw() ) p c2plot <- factor(c) levels(c2plot) <- c("4", "1", "3", "2") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) + ggtitle("True clusters") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))) ) pp } # Exemple in 2 dimensions with Gaussian distributions set.seed(1234) # Generate data n <- 2000 # number of data points d <- 2 # dimensions ncl <- 4 # number of true clusters m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) # cluster means sdev <- array(dim=c(2, 2, 4)) # cluster standard-deviations sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) proba <- c(0.15, 0.05, 0.5, 0.3) # cluster frequencies c <- rep(0,n) z <- matrix(0, nrow=2, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=proba)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1) } # Define hyperprior hyperG0 <- list() hyperG0[["mu"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["nu"]] <- d+2 hyperG0[["lambda"]] <- diag(d) if(interactive()){ # Plot data cytoScatter(z) # Estimate posterior MCMCsample_n <- DPMpost(data=z, hyperG0=hyperG0, N=2000, distrib="gaussian", diagVar=FALSE, gg.add=list(ggplot2::theme_bw(), ggplot2::guides(shape=ggplot2::guide_legend(override.aes = list(fill="grey45")))) ) s <- summary(MCMCsample_n, burnin = 1500, thin=5, lossFn = "Binder") s plot(s) #plot(s, hm=TRUE) # this can take a few sec... # more data plotting: library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Unsupervised data") + xlab("D1") + ylab("D2") + theme_bw() ) p c2plot <- factor(c) levels(c2plot) <- c("4", "1", "3", "2") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) #+ ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))) + ggtitle("True clusters") ) pp }
#rm(list=ls()) set.seed(123) # Exemple in 2 dimensions with skew-t distributions # Generate data: n <- 2000 # number of data points d <- 2 # dimensions ncl <- 4 # number of true clusters sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9)) nu <- c(100,25,8,5) proba <- c(0.15, 0.05, 0.5, 0.3) # cluster frequencies sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.2)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=proba)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) } # Define hyperprior hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 if(interactive()){ # Plot data cytoScatter(z) # Estimate posterior MCMCsample_st <- DPMpost(data=z, hyperG0=hyperG0, N=2000, distrib="skewt", gg.add=list(ggplot2::theme_bw(), ggplot2::guides(shape=ggplot2::guide_legend(override.aes = list(fill="grey45")))) ) s <- summary(MCMCsample_st, burnin = 1600, thin=5, lossFn = "Binder") s plot(s) #plot(s, hm=TRUE) # this can take a few sec... # more data plotting: library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Unsupervised data") + xlab("D1") + ylab("D2") + theme_bw() ) p c2plot <- factor(c) levels(c2plot) <- c("4", "1", "3", "2") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) + ggtitle("True clusters") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))) ) pp } # Exemple in 2 dimensions with Gaussian distributions set.seed(1234) # Generate data n <- 2000 # number of data points d <- 2 # dimensions ncl <- 4 # number of true clusters m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) # cluster means sdev <- array(dim=c(2, 2, 4)) # cluster standard-deviations sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) proba <- c(0.15, 0.05, 0.5, 0.3) # cluster frequencies c <- rep(0,n) z <- matrix(0, nrow=2, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=proba)!=0) z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1) } # Define hyperprior hyperG0 <- list() hyperG0[["mu"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["nu"]] <- d+2 hyperG0[["lambda"]] <- diag(d) if(interactive()){ # Plot data cytoScatter(z) # Estimate posterior MCMCsample_n <- DPMpost(data=z, hyperG0=hyperG0, N=2000, distrib="gaussian", diagVar=FALSE, gg.add=list(ggplot2::theme_bw(), ggplot2::guides(shape=ggplot2::guide_legend(override.aes = list(fill="grey45")))) ) s <- summary(MCMCsample_n, burnin = 1500, thin=5, lossFn = "Binder") s plot(s) #plot(s, hm=TRUE) # this can take a few sec... # more data plotting: library(ggplot2) p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y)) + geom_point() + ggtitle("Unsupervised data") + xlab("D1") + ylab("D2") + theme_bw() ) p c2plot <- factor(c) levels(c2plot) <- c("4", "1", "3", "2") pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot))) + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster)) #+ ggtitle("Slightly overlapping skew-normal simulation\n") + xlab("D1") + ylab("D2") + theme_bw() + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))) + ggtitle("True clusters") ) pp }
Evaluate the loss of a point estimate of the partition compared to a gold standard according to a given loss function
evalClustLoss(c, gs, lossFn = "F-measure", a = 1, b = 1)
evalClustLoss(c, gs, lossFn = "F-measure", a = 1, b = 1)
c |
vector of length |
gs |
vector of length |
lossFn |
character string specifying the loss function to be used. Either "F-measure" or "Binder" (see Details). Default is "F-measure". |
a |
only relevant if |
b |
only relevant if |
The cost of a point estimate partition is calculated using either a pairwise coincidence loss function (Binder), or 1-Fmeasure (F-measure).
the cost of the point estimate c
in regard of the
gold standard gs
for a given loss function.
Boris Hejblum
J.W. Lau & P.J. Green. Bayesian Model-Based Clustering Procedures, Journal of Computational and Graphical Statistics, 16(3): 526-558, 2007.
D. B. Dahl. Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, K.-A. Do, P. Muller, M. Vannucci (Eds.), Cambridge University Press, 2006.
similarityMat
, cluster_est_binder
A limited version of F-measure that only takes into account small clusters
Flimited(n_small_clst, pred, ref)
Flimited(n_small_clst, pred, ref)
n_small_clst |
an integer for limit size of the small cluster |
pred |
vector of a predicted partition |
ref |
vector of a reference partition |
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
pred <- c(rep(1, 5),rep(2, 8),rep(3,10)) ref <- c(rep(1, 5),rep(c(2,3), 4),rep(c(3,2),5)) FmeasureC(pred, ref) Flimited(6, pred, ref)
pred <- c(rep(1, 5),rep(2, 8),rep(3,10)) ref <- c(rep(1, 5),rep(c(2,3), 4),rep(c(3,2),5)) FmeasureC(pred, ref) Flimited(6, pred, ref)
C++ implementation of multiple cost computations with the F-measure as the loss function using the Armadillo library
Fmeasure_costC(c)
Fmeasure_costC(c)
c |
a matrix where each column is one MCMC partition |
a list with the following elements:
Fmeas: |
TODO |
cost: |
TODO |
library(NPflow) c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1)) #Fmeasure_costC(sapply(c, "[")) if(interactive()){ c2 <- list() for(i in 1:100){ c2 <- c(c2, list(rmultinom(n=1, size=2000, prob=rexp(n=2000)))) } Fmeasure_costC(sapply(c2, "[")) }
library(NPflow) c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1)) #Fmeasure_costC(sapply(c, "[")) if(interactive()){ c2 <- list() for(i in 1:100){ c2 <- c(c2, list(rmultinom(n=1, size=2000, prob=rexp(n=2000)))) } Fmeasure_costC(sapply(c2, "[")) }
C++ implementation of the F-measure computation
FmeasureC(pred, ref)
FmeasureC(pred, ref)
pred |
vector of a predicted partition |
ref |
vector of a reference partition |
pred <- c(1,1,2,3,2,3) ref <- c(2,2,1,1,1,3) FmeasureC(pred, ref)
pred <- c(1,1,2,3,2,3) ref <- c(2,2,1,1,1,3) FmeasureC(pred, ref)
Aghaeepour in FlowCAP 1 ignore the reference class labeled "0"
FmeasureC_no0(pred, ref)
FmeasureC_no0(pred, ref)
pred |
vector of a predicted partition |
ref |
vector of a reference partition |
N Aghaeepour, G Finak, H Hoos, TR Mosmann, RR Brinkman, R Gottardo, RH Scheuermann, Critical assessment of automated flow cytometry data analysis techniques, Nature Methods, 10(3):228-38, 2013.
library(NPflow) pred <- c(1,1,2,3,2,3) ref <- c(2,2,0,0,0,3) FmeasureC(pred, ref) FmeasureC_no0(pred, ref)
library(NPflow) pred <- c(1,1,2,3,2,3) ref <- c(2,2,0,0,0,3) FmeasureC(pred, ref) FmeasureC_no0(pred, ref)
Multivariate log gamma function
lgamma_mv(x, p)
lgamma_mv(x, p)
x |
strictly positive real number |
p |
integer |
Maximum A Posteriori (MAP) estimation of mixture of Normal inverse Wishart distributed observations with an EM algorithm
MAP_sNiW_mmEM( xi_list, psi_list, S_list, hyperG0, init = NULL, K, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE ) MAP_sNiW_mmEM_weighted( xi_list, psi_list, S_list, obsweight_list, hyperG0, K, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE ) MAP_sNiW_mmEM_vague( xi_list, psi_list, S_list, hyperG0, K = 10, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE )
MAP_sNiW_mmEM( xi_list, psi_list, S_list, hyperG0, init = NULL, K, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE ) MAP_sNiW_mmEM_weighted( xi_list, psi_list, S_list, obsweight_list, hyperG0, K, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE ) MAP_sNiW_mmEM_vague( xi_list, psi_list, S_list, hyperG0, K = 10, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE )
xi_list |
a list of length |
psi_list |
a list of length |
S_list |
a list of length |
hyperG0 |
prior mixing distribution used if |
init |
a list for initializing the algorithm with the following elements: |
K |
integer giving the number of mixture components. |
maxit |
integer giving the maximum number of iteration for the EM algorithm.
Default is |
tol |
real number giving the tolerance for the stopping of the EM algorithm.
Default is |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted.
Default is |
verbose |
logical flag indicating whether plot should be drawn. Default is |
obsweight_list |
a list of length |
MAP_sNiW_mmEM
provides an estimation for the MAP of mixtures of
Normal inverse Wishart distributed observations. MAP_sNiW_mmEM_vague
provides
an estimates incorporating a vague component in the mixture.
MAP_sNiW_mmEM_weighted
provides a weighted version of the algorithm.
Boris Hejblum, Chariff Alkhassim
set.seed(1234) hyperG0 <- list() hyperG0$b_xi <- c(0.3, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 100 hyperG0$D_psi <- 100 hyperG0$nu <- 20 hyperG0$lambda <- diag(c(0.25,0.35)) hyperG0 <- list() hyperG0$b_xi <- c(1, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.1 hyperG0$D_xi <- 1 hyperG0$D_psi <- 1 hyperG0$nu <- 2 hyperG0$lambda <- diag(c(0.25,0.35)) xi_list <- list() psi_list <- list() S_list <- list() w_list <- list() for(k in 1:200){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] w_list [[k]] <- 0.75 } hyperG02 <- list() hyperG02$b_xi <- c(-1, 2) hyperG02$b_psi <- c(-0.1, 0.5) hyperG02$kappa <- 0.1 hyperG02$D_xi <- 1 hyperG02$D_psi <- 1 hyperG02$nu <- 4 hyperG02$lambda <- 0.5*diag(2) for(k in 201:400){ NNiW <- rNNiW(hyperG02, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] w_list [[k]] <- 0.25 } map <- MAP_sNiW_mmEM(xi_list, psi_list, S_list, hyperG0, K=2, tol=0.1)
set.seed(1234) hyperG0 <- list() hyperG0$b_xi <- c(0.3, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 100 hyperG0$D_psi <- 100 hyperG0$nu <- 20 hyperG0$lambda <- diag(c(0.25,0.35)) hyperG0 <- list() hyperG0$b_xi <- c(1, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.1 hyperG0$D_xi <- 1 hyperG0$D_psi <- 1 hyperG0$nu <- 2 hyperG0$lambda <- diag(c(0.25,0.35)) xi_list <- list() psi_list <- list() S_list <- list() w_list <- list() for(k in 1:200){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] w_list [[k]] <- 0.75 } hyperG02 <- list() hyperG02$b_xi <- c(-1, 2) hyperG02$b_psi <- c(-0.1, 0.5) hyperG02$kappa <- 0.1 hyperG02$D_xi <- 1 hyperG02$D_psi <- 1 hyperG02$nu <- 4 hyperG02$lambda <- 0.5*diag(2) for(k in 201:400){ NNiW <- rNNiW(hyperG02, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] w_list [[k]] <- 0.25 } map <- MAP_sNiW_mmEM(xi_list, psi_list, S_list, hyperG0, K=2, tol=0.1)
Maximum likelihood estimation of Gamma distributed observations distribution parameters
MLE_gamma(g)
MLE_gamma(g)
g |
a list of Gamma distributed observation. |
g_list <- list() for(i in 1:1000){ g_list <- c(g_list, rgamma(1, shape=100, rate=5)) } mle <- MLE_gamma(g_list) mle
g_list <- list() for(i in 1:1000){ g_list <- c(g_list, rgamma(1, shape=100, rate=5)) } mle <- MLE_gamma(g_list) mle
Maximum likelihood estimation of mixture of Normal inverse Wishart distributed observations with an EM algorithm
MLE_NiW_mmEM( mu_list, S_list, hyperG0, K, maxit = 100, tol = 0.1, doPlot = TRUE )
MLE_NiW_mmEM( mu_list, S_list, hyperG0, K, maxit = 100, tol = 0.1, doPlot = TRUE )
mu_list |
a list of length |
S_list |
a list of length |
hyperG0 |
prior mixing distribution used for randomly initializing the algorithm. |
K |
integer giving the number of mixture components. |
maxit |
integer giving the maximum number of iteration for the EM algorithm.
Default is |
tol |
real number giving the tolerance for the stopping of the EM algorithm.
Default is |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted. Default is |
set.seed(123) U_mu <- list() U_Sigma <- list() U_nu<-list() U_kappa<-list() d <- 2 hyperG0 <- list() hyperG0[["mu"]] <- rep(1,d) hyperG0[["kappa"]] <- 0.01 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(d) for(k in 1:200){ NiW <- rNiW(hyperG0, diagVar=FALSE) U_mu[[k]] <-NiW[["mu"]] U_Sigma[[k]] <-NiW[["S"]] } hyperG02 <- list() hyperG02[["mu"]] <- rep(2,d) hyperG02[["kappa"]] <- 1 hyperG02[["nu"]] <- d+10 hyperG02[["lambda"]] <- diag(d)/10 for(k in 201:400){ NiW <- rNiW(hyperG02, diagVar=FALSE) U_mu[[k]] <-NiW[["mu"]] U_Sigma[[k]] <-NiW[["S"]] } mle <- MLE_NiW_mmEM( U_mu, U_Sigma, hyperG0, K=2) hyperG0[["mu"]] hyperG02[["mu"]] mle$U_mu hyperG0[["lambda"]] hyperG02[["lambda"]] mle$U_lambda hyperG0[["nu"]] hyperG02[["nu"]] mle$U_nu hyperG0[["kappa"]] hyperG02[["kappa"]] mle$U_kappa
set.seed(123) U_mu <- list() U_Sigma <- list() U_nu<-list() U_kappa<-list() d <- 2 hyperG0 <- list() hyperG0[["mu"]] <- rep(1,d) hyperG0[["kappa"]] <- 0.01 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(d) for(k in 1:200){ NiW <- rNiW(hyperG0, diagVar=FALSE) U_mu[[k]] <-NiW[["mu"]] U_Sigma[[k]] <-NiW[["S"]] } hyperG02 <- list() hyperG02[["mu"]] <- rep(2,d) hyperG02[["kappa"]] <- 1 hyperG02[["nu"]] <- d+10 hyperG02[["lambda"]] <- diag(d)/10 for(k in 201:400){ NiW <- rNiW(hyperG02, diagVar=FALSE) U_mu[[k]] <-NiW[["mu"]] U_Sigma[[k]] <-NiW[["S"]] } mle <- MLE_NiW_mmEM( U_mu, U_Sigma, hyperG0, K=2) hyperG0[["mu"]] hyperG02[["mu"]] mle$U_mu hyperG0[["lambda"]] hyperG02[["lambda"]] mle$U_lambda hyperG0[["nu"]] hyperG02[["nu"]] mle$U_nu hyperG0[["kappa"]] hyperG02[["kappa"]] mle$U_kappa
Maximum likelihood estimation of Normal inverse Wishart distributed observations
MLE_sNiW(xi_list, psi_list, S_list, doPlot = TRUE)
MLE_sNiW(xi_list, psi_list, S_list, doPlot = TRUE)
xi_list |
a list of length |
psi_list |
a list of length |
S_list |
a list of length |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted.
Default is |
Boris Hejblum, Chariff Alkhassim
hyperG0 <- list() hyperG0$b_xi <- c(0.3, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 100 hyperG0$D_psi <- 100 hyperG0$nu <- 35 hyperG0$lambda <- diag(c(0.25,0.35)) xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:1000){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mle <- MLE_sNiW(xi_list, psi_list, S_list) mle
hyperG0 <- list() hyperG0$b_xi <- c(0.3, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 100 hyperG0$D_psi <- 100 hyperG0$nu <- 35 hyperG0$lambda <- diag(c(0.25,0.35)) xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:1000){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mle <- MLE_sNiW(xi_list, psi_list, S_list) mle
Maximum likelihood estimation of mixture of Normal inverse Wishart distributed observations with an EM algorithm
MLE_sNiW_mmEM( xi_list, psi_list, S_list, hyperG0, K, init = NULL, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE )
MLE_sNiW_mmEM( xi_list, psi_list, S_list, hyperG0, K, init = NULL, maxit = 100, tol = 0.1, doPlot = TRUE, verbose = TRUE )
xi_list |
a list of length |
psi_list |
a list of length |
S_list |
a list of length |
hyperG0 |
prior mixing distribution used if |
K |
integer giving the number of mixture components. |
init |
a list for initializing the algorithm with the following elements: |
maxit |
integer giving the maximum number of iteration for the EM algorithm.
Default is |
tol |
real number giving the tolerance for the stopping of the EM algorithm.
Default is |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted. Default is |
verbose |
logical flag indicating whether plot should be drawn. Default is |
Boris Hejblum, Chariff Alkhassim
set.seed(1234) hyperG0 <- list() hyperG0$b_xi <- c(0.3, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 100 hyperG0$D_psi <- 100 hyperG0$nu <- 3 hyperG0$lambda <- diag(c(0.25,0.35)) xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:200){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } hyperG02 <- list() hyperG02$b_xi <- c(-1, 2) hyperG02$b_psi <- c(-0.1, 0.5) hyperG02$kappa <- 0.001 hyperG02$D_xi <- 10 hyperG02$D_psi <- 10 hyperG02$nu <- 3 hyperG02$lambda <- 0.5*diag(2) for(k in 201:400){ NNiW <- rNNiW(hyperG02, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mle <- MLE_sNiW_mmEM(xi_list, psi_list, S_list, hyperG0, K=2)
set.seed(1234) hyperG0 <- list() hyperG0$b_xi <- c(0.3, -1.5) hyperG0$b_psi <- c(0, 0) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 100 hyperG0$D_psi <- 100 hyperG0$nu <- 3 hyperG0$lambda <- diag(c(0.25,0.35)) xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:200){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } hyperG02 <- list() hyperG02$b_xi <- c(-1, 2) hyperG02$b_psi <- c(-0.1, 0.5) hyperG02$kappa <- 0.001 hyperG02$D_xi <- 10 hyperG02$D_psi <- 10 hyperG02$nu <- 3 hyperG02$lambda <- 0.5*diag(2) for(k in 201:400){ NNiW <- rNNiW(hyperG02, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mle <- MLE_sNiW_mmEM(xi_list, psi_list, S_list, hyperG0, K=2)
multivariate Normal inverse Wishart probability density function for multiple inputs
mmNiWpdf(mu, Sigma, U_mu0, U_kappa0, U_nu0, U_lambda0, Log = TRUE)
mmNiWpdf(mu, Sigma, U_mu0, U_kappa0, U_nu0, U_lambda0, Log = TRUE)
mu |
data matrix of dimension |
Sigma |
list of length |
U_mu0 |
mean vectors matrix of dimension |
U_kappa0 |
vector of length |
U_nu0 |
vector of length |
U_lambda0 |
list of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
matrix of densities of dimension K x n
C++ implementation of multivariate Normal inverse Wishart probability density function for multiple inputs
mmNiWpdfC(Mu, Sigma, U_Mu0, U_Kappa0, U_Nu0, U_Sigma0, Log = TRUE)
mmNiWpdfC(Mu, Sigma, U_Mu0, U_Kappa0, U_Nu0, U_Sigma0, Log = TRUE)
Mu |
data matrix of dimension |
Sigma |
list of length |
U_Mu0 |
mean vectors matrix of dimension |
U_Kappa0 |
vector of length |
U_Nu0 |
vector of length |
U_Sigma0 |
list of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
matrix of densities of dimension K x n
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209>. <arXiv: 1702.04407>. https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Probability density function of structured Normal inverse Wishart (sNiW) for multiple inputs, on the log scale.
mmsNiWlogpdf(U_xi, U_psi, U_Sigma, U_xi0, U_psi0, U_B0, U_Sigma0, U_df0)
mmsNiWlogpdf(U_xi, U_psi, U_Sigma, U_xi0, U_psi0, U_B0, U_Sigma0, U_df0)
U_xi |
a list of length n of observed mean vectors, each of dimension p |
U_psi |
a list of length n of observed skew vectors of dimension p |
U_Sigma |
a list of length n of observed covariance matrices, each of dimension p x p |
U_xi0 |
a list of length K of mean vector parameters for sNiW, each of dimension p |
U_psi0 |
a list of length K of mean vector parameters for sNiW, each of dimension p |
U_B0 |
a list of length K of structuring matrix parameters for sNiW, each of dimension 2 x 2 |
U_Sigma0 |
a list of length K of covariance matrix parameters for sNiW, each of dimension p x p |
U_df0 |
a list of length K of degrees of freedom parameters for sNiW, each of dimension p x p |
hyperG0 <- list() hyperG0$b_xi <- c(-1.6983129, -0.4819131) hyperG0$b_psi <- c(-0.0641866, -0.7606068) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 16.951313 hyperG0$D_psi <- 1.255192 hyperG0$nu <- 27.67656 hyperG0$lambda <- matrix(c(2.3397761, -0.3975259,-0.3975259, 1.9601773), ncol=2) xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:1000){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mmsNiWlogpdf(U_xi=xi_list, U_psi=psi_list, U_Sigma=S_list, U_xi0=list(hyperG0$b_xi), U_psi0=list(hyperG0$b_psi) , U_B0=list(diag(c(hyperG0$D_xi, hyperG0$D_psi))) , U_Sigma0=list(hyperG0$lambda), U_df0=list(hyperG0$nu)) hyperG0 <- list() hyperG0$b_xi <- c(-1.6983129) hyperG0$b_psi <- c(-0.0641866) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 16.951313 hyperG0$D_psi <- 1.255192 hyperG0$nu <- 27.67656 hyperG0$lambda <- matrix(c(2.3397761), ncol=1) #'xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:1000){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mmsNiWlogpdf(U_xi=xi_list, U_psi=psi_list, U_Sigma=S_list, U_xi0=list(hyperG0$b_xi), U_psi0=list(hyperG0$b_psi) , U_B0=list(diag(c(hyperG0$D_xi, hyperG0$D_psi))) , U_Sigma0=list(hyperG0$lambda), U_df0=list(hyperG0$nu))
hyperG0 <- list() hyperG0$b_xi <- c(-1.6983129, -0.4819131) hyperG0$b_psi <- c(-0.0641866, -0.7606068) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 16.951313 hyperG0$D_psi <- 1.255192 hyperG0$nu <- 27.67656 hyperG0$lambda <- matrix(c(2.3397761, -0.3975259,-0.3975259, 1.9601773), ncol=2) xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:1000){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mmsNiWlogpdf(U_xi=xi_list, U_psi=psi_list, U_Sigma=S_list, U_xi0=list(hyperG0$b_xi), U_psi0=list(hyperG0$b_psi) , U_B0=list(diag(c(hyperG0$D_xi, hyperG0$D_psi))) , U_Sigma0=list(hyperG0$lambda), U_df0=list(hyperG0$nu)) hyperG0 <- list() hyperG0$b_xi <- c(-1.6983129) hyperG0$b_psi <- c(-0.0641866) hyperG0$kappa <- 0.001 hyperG0$D_xi <- 16.951313 hyperG0$D_psi <- 1.255192 hyperG0$nu <- 27.67656 hyperG0$lambda <- matrix(c(2.3397761), ncol=1) #'xi_list <- list() psi_list <- list() S_list <- list() for(k in 1:1000){ NNiW <- rNNiW(hyperG0, diagVar=FALSE) xi_list[[k]] <- NNiW[["xi"]] psi_list[[k]] <- NNiW[["psi"]] S_list[[k]] <- NNiW[["S"]] } mmsNiWlogpdf(U_xi=xi_list, U_psi=psi_list, U_Sigma=S_list, U_xi0=list(hyperG0$b_xi), U_psi0=list(hyperG0$b_psi) , U_B0=list(diag(c(hyperG0$D_xi, hyperG0$D_psi))) , U_Sigma0=list(hyperG0$lambda), U_df0=list(hyperG0$nu))
C++ implementation of multivariate structured Normal inverse Wishart probability density function for multiple inputs
mmsNiWpdfC(xi, psi, Sigma, U_xi0, U_psi0, U_B0, U_Sigma0, U_df0, Log = TRUE)
mmsNiWpdfC(xi, psi, Sigma, U_xi0, U_psi0, U_B0, U_Sigma0, U_df0, Log = TRUE)
xi |
data matrix of dimensions |
psi |
data matrix of dimensions |
Sigma |
list of length |
U_xi0 |
mean vectors matrix of dimension |
U_psi0 |
skew parameter vectors matrix of dimension |
U_B0 |
list of length |
U_Sigma0 |
list of length |
U_df0 |
vector of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
matrix of densities of dimension K x n
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209>. <arXiv: 1702.04407>. https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
C++ implementation of multivariate Normal probability density function for multiple inputs
mmvnpdfC(x, mean, varcovM, Log = TRUE)
mmvnpdfC(x, mean, varcovM, Log = TRUE)
x |
data matrix of dimension |
mean |
mean vectors matrix of dimension |
varcovM |
list of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
matrix of densities of dimension K x n
.
if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), mmvnpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), Log=FALSE), times=1000L) microbenchmark(mvnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(-0.2, 0.3), varcovM=matrix(c(2, 0.2, 0.2, 2), ncol=2), Log=FALSE), mvnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(-0.2, 0.3), varcovM=matrix(c(2, 0.2, 0.2, 2), ncol=2), Log=FALSE), mmvnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=matrix(c(-0.2, 0.3), nrow=2, ncol=1), varcovM=list(matrix(c(2, 0.2, 0.2, 2), ncol=2)), Log=FALSE), times=1000L) microbenchmark(mvnpdf(x=matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), mean=list(c(0,0),c(-1,-1), c(1.5,1.5)), varcovM=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), mmvnpdfC(matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), mean=matrix(c(0,0,-1,-1, 1.5,1.5), nrow=2, ncol=3), varcovM=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), times=1000L) }else{ cat("package 'microbenchmark' not available\n") }
if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), mmvnpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), Log=FALSE), times=1000L) microbenchmark(mvnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(-0.2, 0.3), varcovM=matrix(c(2, 0.2, 0.2, 2), ncol=2), Log=FALSE), mvnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(-0.2, 0.3), varcovM=matrix(c(2, 0.2, 0.2, 2), ncol=2), Log=FALSE), mmvnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=matrix(c(-0.2, 0.3), nrow=2, ncol=1), varcovM=list(matrix(c(2, 0.2, 0.2, 2), ncol=2)), Log=FALSE), times=1000L) microbenchmark(mvnpdf(x=matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), mean=list(c(0,0),c(-1,-1), c(1.5,1.5)), varcovM=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), mmvnpdfC(matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), mean=matrix(c(0,0,-1,-1, 1.5,1.5), nrow=2, ncol=3), varcovM=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), times=1000L) }else{ cat("package 'microbenchmark' not available\n") }
C++ implementation of multivariate skew Normal probability density function for multiple inputs
mmvsnpdfC(x, xi, psi, sigma, Log = TRUE)
mmvsnpdfC(x, xi, psi, sigma, Log = TRUE)
x |
data matrix of dimension |
xi |
mean vectors matrix of dimension |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length K of variance-covariance matrices,
each of dimensions |
Log |
logical flag for returning the log of the probability density
function. Default is |
matrix of densities of dimension K x n
.
Boris Hejblum
mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), Log=FALSE ) mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)) ) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), Log=FALSE), mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), Log=FALSE), times=1000L ) microbenchmark(mvsnpdf(x=matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), xi=list(c(0,0),c(-1,-1), c(1.5,1.5)), psi=list(c(0.1,0.1),c(-0.1,-1), c(0.5,-1.5)), sigma=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), mmvsnpdfC(matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), xi=matrix(c(0,0,-1,-1, 1.5,1.5), nrow=2, ncol=3), psi=matrix(c(0.1,0.1,-0.1,-1, 0.5,-1.5), nrow=2, ncol=3), sigma=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), times=1000L) }else{ cat("package 'microbenchmark' not available\n") }
mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), Log=FALSE ) mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)) ) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), Log=FALSE), mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), Log=FALSE), times=1000L ) microbenchmark(mvsnpdf(x=matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), xi=list(c(0,0),c(-1,-1), c(1.5,1.5)), psi=list(c(0.1,0.1),c(-0.1,-1), c(0.5,-1.5)), sigma=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), mmvsnpdfC(matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2), xi=matrix(c(0,0,-1,-1, 1.5,1.5), nrow=2, ncol=3), psi=matrix(c(0.1,0.1,-0.1,-1, 0.5,-1.5), nrow=2, ncol=3), sigma=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE), times=1000L) }else{ cat("package 'microbenchmark' not available\n") }
C++ implementation of multivariate Normal probability density function for multiple inputs
mmvstpdfC(x, xi, psi, sigma, df, Log = TRUE)
mmvstpdfC(x, xi, psi, sigma, df, Log = TRUE)
x |
data matrix of dimension |
xi |
mean vectors matrix of dimension |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length |
df |
vector of length K of degree of freedom parameters. |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
matrix of densities of dimension K x n
.
Boris Hejblum
mmvstpdfC(x = matrix(c(3.399890,-5.936962), ncol=1), xi=matrix(c(0.2528859,-2.4234067)), psi=matrix(c(11.20536,-12.51052), ncol=1), sigma=list(matrix(c(0.2134011, -0.0382573, -0.0382573, 0.2660086), ncol=2)), df=c(7.784106) ) mvstpdf(x = matrix(c(3.399890,-5.936962), ncol=1), xi=c(0.2528859,-2.4234067), psi=c(11.20536,-12.51052), sigma=matrix(c(0.2134011, -0.0382573, -0.0382573, 0.2660086), ncol=2), df=c(7.784106) ) #skew-normal limit mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)) ) mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=100000000 ) mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), df=100000000 ) #non-skewed limit mmvtpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=matrix(c(0, 0)), varcovM=list(diag(2)), df=10 ) mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(0, 0),ncol=1), sigma=list(diag(2)), df=10 ) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=10), mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), df=10), times=1000L) }else{ cat("package 'microbenchmark' not available\n") }
mmvstpdfC(x = matrix(c(3.399890,-5.936962), ncol=1), xi=matrix(c(0.2528859,-2.4234067)), psi=matrix(c(11.20536,-12.51052), ncol=1), sigma=list(matrix(c(0.2134011, -0.0382573, -0.0382573, 0.2660086), ncol=2)), df=c(7.784106) ) mvstpdf(x = matrix(c(3.399890,-5.936962), ncol=1), xi=c(0.2528859,-2.4234067), psi=c(11.20536,-12.51052), sigma=matrix(c(0.2134011, -0.0382573, -0.0382573, 0.2660086), ncol=2), df=c(7.784106) ) #skew-normal limit mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)) ) mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=100000000 ) mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), df=100000000 ) #non-skewed limit mmvtpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=matrix(c(0, 0)), varcovM=list(diag(2)), df=10 ) mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(0, 0),ncol=1), sigma=list(diag(2)), df=10 ) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=10), mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), df=10), times=1000L) }else{ cat("package 'microbenchmark' not available\n") }
C++ implementation of multivariate Normal probability density function for multiple inputs
mmvtpdfC(x, mean, varcovM, df, Log = TRUE)
mmvtpdfC(x, mean, varcovM, df, Log = TRUE)
x |
data matrix of dimension |
mean |
mean vectors matrix of dimension |
varcovM |
list of length |
df |
vector of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
matrix of densities of dimension K x n
.
Boris Hejblum
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000, Log=FALSE) mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10000000, Log=FALSE) mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1)) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000) mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10000000) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10) mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=1, Log=FALSE), mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=c(1), Log=FALSE), times=10000L) }else{ cat("package 'microbenchmark' not available\n") }
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000, Log=FALSE) mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10000000, Log=FALSE) mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1)) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000) mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10000000) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10) mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=1, Log=FALSE), mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=c(1), Log=FALSE), times=10000L) }else{ cat("package 'microbenchmark' not available\n") }
C++ implementation of multivariate Normal probability density function for multiple inputs
mvnlikC(x, c, clustval, mu, sigma, loglik = TRUE)
mvnlikC(x, c, clustval, mu, sigma, loglik = TRUE)
x |
data matrix of dimension p x n, p being the dimension of the data and n the number of data points |
c |
integer vector of cluster allocations with values from 1 to K |
clustval |
vector of unique values from c in the order corresponding to
the storage of cluster parameters in |
mu |
mean vectors matrix of dimension p x K, K being the number of clusters |
sigma |
list of length |
loglik |
logical flag or returning the log-likelihood instead of the likelihood.
Default is |
a list:
"indiv": |
vector of likelihood of length n; |
"clust": |
vector of likelihood of length K; |
"total": |
total (log)-likelihood; |
Boris Hejblum
multivariate-Normal probability density function
mvnpdf(x, mean, varcovM, Log = TRUE)
mvnpdf(x, mean, varcovM, Log = TRUE)
x |
p x n data matrix with n the number of observations and p the number of dimensions |
mean |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
varcovM |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Boris P. Hejblum
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) dnorm(1.96) mvnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(0, 0), varcovM=diag(2), Log=FALSE )
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) dnorm(1.96) mvnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(0, 0), varcovM=diag(2), Log=FALSE )
Based on the implementation from Nino Hardt and Dicko Ahmadou https://gallery.rcpp.org/articles/dmvnorm_arma/ (accessed in August 2014)
mvnpdfC(x, mean, varcovM, Log = TRUE)
mvnpdfC(x, mean, varcovM, Log = TRUE)
x |
data matrix |
mean |
mean vector |
varcovM |
variance covariance matrix |
Log |
logical flag for returning the log of the probability density
function. Default is |
vector of densities
Boris P. Hejblum
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1)) mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1)) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(dnorm(1.96), mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), times=10000L) }else{ cat("package 'microbenchmark' not available\n") }
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1)) mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1)) if(require(microbenchmark)){ library(microbenchmark) microbenchmark(dnorm(1.96), mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE), times=10000L) }else{ cat("package 'microbenchmark' not available\n") }
C++ implementation of multivariate skew normal likelihood function for multiple inputs
mvsnlikC(x, c, clustval, xi, psi, sigma, loglik = TRUE)
mvsnlikC(x, c, clustval, xi, psi, sigma, loglik = TRUE)
x |
data matrix of dimension p x n, p being the dimension of the data and n the number of data points |
c |
integer vector of cluster allocations with values from 1 to K |
clustval |
vector of unique values from c in the order corresponding to
the storage of cluster parameters in |
xi |
mean vectors matrix of dimension p x K, K being the number of clusters |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length |
loglik |
logical flag or returning the log-likelihood instead of the likelihood.
Default is |
a list:
"indiv": |
vector of likelihood of length n; |
"clust": |
vector of likelihood of length K; |
"total": |
total (log)-likelihood; |
Boris Hejblum
multivariate Skew-Normal probability density function
mvsnpdf(x, xi, sigma, psi, Log = TRUE)
mvsnpdf(x, xi, sigma, psi, Log = TRUE)
x |
p x n data matrix with n the number of observations and p the number of dimensions |
xi |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
sigma |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
psi |
skew parameter vector or list of skew parameter vectors (either a vector, a matrix or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) dnorm(1.96) mvsnpdf(x=matrix(rep(1.96,1), nrow=1, ncol=1), xi=c(0), psi=c(0), sigma=diag(1), Log=FALSE ) mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2) ) N=50000#00 Yn <- rnorm(n=N, mean=0, sd=1) Z <- rtruncnorm(n=N, a=0, b=Inf, mean=0, sd=1) eps <- rnorm(n=N, mean=0, sd=1) psi <- 10 Ysn <- psi*Z + eps nu <- 1.5 W <- rgamma(n=N, shape=nu/2, rate=nu/2) Yst=Ysn/sqrt(W) library(reshape2) library(ggplot2) data2plot <- melt(cbind.data.frame(Ysn, Yst)) #pdf(file="ExSNST.pdf", height=5, width=4) p <- (ggplot(data=data2plot) + geom_density(aes(x=value, fill=variable, alpha=variable), col="black")#, lwd=1.1) + theme_bw() + xlim(-15,100) + theme(legend.position="bottom") + scale_fill_manual(values=alpha(c("#F8766D", "#00B0F6"),c(0.2,0.45)), name =" ", labels=c("Y~SN(0,1,10) ", "Y~ST(0,1,10,1.5)") ) + scale_alpha_manual(guide=FALSE, values=c(0.25, 0.45)) + xlab("Y") + ylim(0,0.08) + ylab("Density") + guides(fill = guide_legend(override.aes = list(colour = NULL))) + theme(legend.key = element_rect(colour = "black")) ) p #dev.off()
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE) dnorm(1.96) mvsnpdf(x=matrix(rep(1.96,1), nrow=1, ncol=1), xi=c(0), psi=c(0), sigma=diag(1), Log=FALSE ) mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2) ) N=50000#00 Yn <- rnorm(n=N, mean=0, sd=1) Z <- rtruncnorm(n=N, a=0, b=Inf, mean=0, sd=1) eps <- rnorm(n=N, mean=0, sd=1) psi <- 10 Ysn <- psi*Z + eps nu <- 1.5 W <- rgamma(n=N, shape=nu/2, rate=nu/2) Yst=Ysn/sqrt(W) library(reshape2) library(ggplot2) data2plot <- melt(cbind.data.frame(Ysn, Yst)) #pdf(file="ExSNST.pdf", height=5, width=4) p <- (ggplot(data=data2plot) + geom_density(aes(x=value, fill=variable, alpha=variable), col="black")#, lwd=1.1) + theme_bw() + xlim(-15,100) + theme(legend.position="bottom") + scale_fill_manual(values=alpha(c("#F8766D", "#00B0F6"),c(0.2,0.45)), name =" ", labels=c("Y~SN(0,1,10) ", "Y~ST(0,1,10,1.5)") ) + scale_alpha_manual(guide=FALSE, values=c(0.25, 0.45)) + xlab("Y") + ylim(0,0.08) + ylab("Density") + guides(fill = guide_legend(override.aes = list(colour = NULL))) + theme(legend.key = element_rect(colour = "black")) ) p #dev.off()
C++ implementation of multivariate skew t likelihood function for multiple inputs
mvstlikC(x, c, clustval, xi, psi, sigma, df, loglik = TRUE)
mvstlikC(x, c, clustval, xi, psi, sigma, df, loglik = TRUE)
x |
data matrix of dimension p x n, p being the dimension of the data and n the number of data points |
c |
integer vector of cluster allocations with values from 1 to K |
clustval |
vector of unique values from c in the order corresponding to
the storage of cluster parameters in |
xi |
mean vectors matrix of dimension p x K, K being the number of clusters |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length |
df |
vector of length |
loglik |
logical flag or returning the log-likelihood instead of the likelihood.
Default is |
a list:
"indiv": |
vector of likelihood of length n; |
"clust": |
vector of likelihood of length K; |
"total": |
total (log)-likelihood; |
Boris Hejblum
multivariate skew-t probability density function
mvstpdf(x, xi, sigma, psi, df, Log = TRUE)
mvstpdf(x, xi, sigma, psi, df, Log = TRUE)
x |
|
xi |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
sigma |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
psi |
skew parameter vector or list of skew parameter vectors (either a vector, a matrix or a list) |
df |
a numeric vector or a list of the degrees of freedom (either a vector or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
mvtpdf
, mvsnpdf
, mmvstpdfC
, mvstlikC
mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=100000000, Log=FALSE ) mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), Log=FALSE ) mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=100000000 ) mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2) )
mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=100000000, Log=FALSE ) mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), Log=FALSE ) mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2), df=100000000 ) mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1), sigma=diag(2) )
multivariate Student's t-distribution probability density function
mvtpdf(x, mean, varcovM, df, Log = TRUE)
mvtpdf(x, mean, varcovM, df, Log = TRUE)
x |
|
mean |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
varcovM |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
df |
a numeric vector or a list of the degrees of freedom (either a vector or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000) mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1)) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10) mvtpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(0, 0), varcovM=diag(2), df=10 )
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000) mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1)) mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10) mvtpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(0, 0), varcovM=diag(2), df=10 )
C++ implementation of similarity matrix computation using pre-computed distances
NuMatParC(c, d)
NuMatParC(c, d)
c |
an MCMC partitions of length |
d |
a symmetric |
Boris Hejblum, Chariff Alkhassim
c <- c(1,1,2,3,2,3) d <- matrix(runif(length(c)^2),length(c)) NuMatParC(c,d)
c <- c(1,1,2,3,2,3) d <- matrix(runif(length(c)^2),length(c)) NuMatParC(c,d)
Convergence diagnostic plots
plot_ConvDPM( MCMCsample, from = 1, to = length(MCMCsample$logposterior_list), shift = 0, thin = 1, ... )
plot_ConvDPM( MCMCsample, from = 1, to = length(MCMCsample$logposterior_list), shift = 0, thin = 1, ... )
MCMCsample |
a |
from |
the MCMC iteration from which the plot should start.
Default is |
to |
the MCMC iteration up until which the plot should stop.
Default is |
shift |
a number of initial iterations not to be displayed. Default is |
thin |
integer giving the spacing at which MCMC iterations are kept.
Default is |
... |
further arguments passed to or from other methods |
Plot of a Dirichlet process mixture of gaussian distribution partition
plot_DPM( z, U_mu = NULL, U_Sigma = NULL, m, c, i, alpha = "?", U_SS = NULL, dims2plot = 1:nrow(z), ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE), gg.add = list(theme()) )
plot_DPM( z, U_mu = NULL, U_Sigma = NULL, m, c, i, alpha = "?", U_SS = NULL, dims2plot = 1:nrow(z), ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE), gg.add = list(theme()) )
z |
data matrix |
U_mu |
either a list or a matrix containing the current estimates of mean vectors
of length |
U_Sigma |
either a list or an array containing the |
m |
vector of length |
c |
allocation vector of length |
i |
current MCMC iteration number. |
alpha |
current value of the DP concentration parameter. |
U_SS |
a list containing |
dims2plot |
index vector, subset of |
ellipses |
a logical flag indicating whether ellipses should be drawn around clusters. Default
is |
gg.add |
a list of instructions to add to the |
Boris Hejblum
Plot of a Dirichlet process mixture of skew normal distribution partition
plot_DPMsn( z, c, i = "", alpha = "?", U_SS, dims2plot = 1:nrow(z), ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE), gg.add = list(theme()), nbsim_dens = 1000 )
plot_DPMsn( z, c, i = "", alpha = "?", U_SS, dims2plot = 1:nrow(z), ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE), gg.add = list(theme()), nbsim_dens = 1000 )
z |
data matrix |
c |
allocation vector of length |
i |
current MCMC iteration number. |
alpha |
current value of the DP concentration parameter. |
U_SS |
a list containing |
dims2plot |
index vector, subset of |
ellipses |
a logical flag indicating whether ellipses should be drawn around clusters. Default
is |
gg.add |
A list of instructions to add to the |
nbsim_dens |
number of simulated points used for computing clusters density contours in 2D
plots. Default is |
Boris Hejblum
Plot of a Dirichlet process mixture of skew t-distribution partition
plot_DPMst( z, c, i = "", alpha = "?", U_SS, dims2plot = 1:nrow(z), ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE), gg.add = list(theme()), nbsim_dens = 1000, nice = FALSE )
plot_DPMst( z, c, i = "", alpha = "?", U_SS, dims2plot = 1:nrow(z), ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE), gg.add = list(theme()), nbsim_dens = 1000, nice = FALSE )
z |
data matrix |
c |
allocation vector of length |
i |
current MCMC iteration number. |
alpha |
current value of the DP concentration parameter. |
U_SS |
a list containing |
dims2plot |
index vector, subset of |
ellipses |
a logical flag indicating whether ellipses should be drawn around clusters. Default
is |
gg.add |
A list of instructions to add to the |
nbsim_dens |
number of simulated points used for computing clusters density contours in 2D
plots. Default is |
nice |
logical flag changing the plot looks. Default is |
Boris Hejblum
Post-processing Dirichlet Process Mixture Models results to get a mixture distribution of the posterior locations
postProcess.DPMMclust( x, burnin = 0, thin = 1, gs = NULL, lossFn = "F-measure", K = 10, ... )
postProcess.DPMMclust( x, burnin = 0, thin = 1, gs = NULL, lossFn = "F-measure", K = 10, ... )
x |
a |
burnin |
integer giving the number of MCMC iterations to burn (defaults is half) |
thin |
integer giving the spacing at which MCMC iterations are kept.
Default is |
gs |
optional vector of length |
lossFn |
character string specifying the loss function to be used. Either "F-measure" or "Binder" (see Details). Default is "F-measure". |
K |
integer giving the number of mixture components. Default is |
... |
further arguments passed to or from other methods |
The cost of a point estimate partition is calculated using either a pairwise coincidence loss function (Binder), or 1-Fmeasure (F-measure).
a list
:
burnin: |
an integer passing along the |
thin: |
an integer passing along the |
lossFn: |
a character string passing along the |
point_estim: |
|
loss: |
|
index_estim: |
Boris Hejblum
similarityMat
summary.DPMMclust
DPMMclust
objectMethods for a summary of a DPMMclust
object
## S3 method for class 'summaryDPMMclust' print(x, ...) ## S3 method for class 'summaryDPMMclust' plot( x, hm = FALSE, nbsim_densities = 5000, hm_subsample = NULL, hm_order_by_clust = TRUE, gg.add = list(theme_bw()), ... )
## S3 method for class 'summaryDPMMclust' print(x, ...) ## S3 method for class 'summaryDPMMclust' plot( x, hm = FALSE, nbsim_densities = 5000, hm_subsample = NULL, hm_order_by_clust = TRUE, gg.add = list(theme_bw()), ... )
x |
a |
... |
further arguments passed to or from other methods |
hm |
logical flag to plot the heatmap of the similarity matrix.
Default is |
nbsim_densities |
the number of simulated observations to be used to plot the density lines of the clusters in the point estimate partition plot |
hm_subsample |
a integer designating the number of observations to use when plotting the heatmap.
Used only if |
hm_order_by_clust |
logical flag indicating whether observations should be ordered according to
the point estimate first. Used only if |
gg.add |
a list of instructions to add to the |
Boris Hejblum
Construction of an Empirical based prior
priormix(sDPMclust, nu0add = 5)
priormix(sDPMclust, nu0add = 5)
sDPMclust |
an object of class |
nu0add |
an additional value integer added to hyperprior parameter nu (increase to avoid non positive definite matrix sampling) |
summary.DPMMclust
rm(list=ls()) #Number of data n <- 2000 set.seed(123) #set.seed(4321) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) #xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) nu <- c(100,15,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 nbclust_init <- 30 if(interactive()){ MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=FALSE, nbclust_init, diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 1500, thin=5, posterior_approx=TRUE) pmix <- priormix(s) }
rm(list=ls()) #Number of data n <- 2000 set.seed(123) #set.seed(4321) d <- 2 ncl <- 4 # Sample data sdev <- array(dim=c(d,d,ncl)) xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3)) #xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1)) psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8)) nu <- c(100,15,8,5) p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3)) sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3)) sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3)) sdev[, ,4] <- .3*diag(2) c <- rep(0,n) w <- rep(1,n) z <- matrix(0, nrow=d, ncol=n) for(k in 1:n){ c[k] = which(rmultinom(n=1, size=1, prob=p)!=0) w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2) z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) + (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1) #cat(k, "/", n, " observations simulated\n", sep="") } # Set parameters of G0 hyperG0 <- list() hyperG0[["b_xi"]] <- rowMeans(z) hyperG0[["b_psi"]] <- rep(0,d) hyperG0[["kappa"]] <- 0.001 hyperG0[["D_xi"]] <- 100 hyperG0[["D_psi"]] <- 100 hyperG0[["nu"]] <- d+1 hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3 # hyperprior on the Scale parameter of DPM a <- 0.0001 b <- 0.0001 nbclust_init <- 30 if(interactive()){ MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=FALSE, nbclust_init, diagVar=FALSE) s <- summary(MCMCsample_st, burnin = 1500, thin=5, posterior_approx=TRUE) pmix <- priormix(s) }
Generating cluster data from the Chinese Restaurant Process
rCRP(n = 1000, alpha = 2, hyperG0, verbose = TRUE)
rCRP(n = 1000, alpha = 2, hyperG0, verbose = TRUE)
n |
number of observations |
alpha |
concentration parameter |
hyperG0 |
base distribution hyperparameter |
verbose |
logical flag indicating whether info is written in the console. |
rm(list=ls()) d=2 hyperG0 <- list() hyperG0[["NNiW"]] <- list() hyperG0[["NNiW"]][["b_xi"]] <- rep(0,d) hyperG0[["NNiW"]][["b_psi"]] <- rep(0,d) hyperG0[["NNiW"]][["D_xi"]] <- 100 hyperG0[["NNiW"]][["D_psi"]] <- 8 hyperG0[["NNiW"]][["nu"]] <- d+1 hyperG0[["NNiW"]][["lambda"]] <- diag(c(1,1)) hyperG0[["scale"]] <- list() set.seed(4321) N <- 200 alph <- runif(n=1,0.2,2) GvHD_sims <- rCRP(n=2*N, alpha=alph, hyperG0=hyperG0) library(ggplot2) q <- (ggplot(data=cbind.data.frame("D1"=GvHD_sims$data[1,], "D2"=GvHD_sims$data[2,], "Cluster"=GvHD_sims$cluster), aes(x=D1, y=D2)) + geom_point(aes(colour=Cluster), alpha=0.6) + theme_bw() ) q #q + stat_density2d(alpha=0.15, geom="polygon") if(interactive()){ MCMCy1 <- DPMGibbsSkewT(z=GvHD_sims$data[,1:N], hyperG0$NNiW, a=0.0001, b=0.0001, N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500, gg.add=list(theme_bw()), diagVar=FALSE) s1 <- summary(MCMCy1, burnin=4000, thin=5, posterior_approx=TRUE) F1 <- FmeasureC(ref=GvHD_sims$cluster[1:N], pred=s1$point_estim$c_est) # s <- summary(MCMCy1, burnin=4000, thin=5, # posterior_approx=TRUE, K=1) # s2 <- summary(MCMCy1, burnin=4000, thin=5, # posterior_approx=TRUE, K=2) # MCMCy2_seqPost<- DPMGibbsSkewT(z=GvHD_sims$data[,(N+1):(2*N)], # hyperG0=s1$param_post$parameters, # a=s1$param_post$alpha_param$shape, # b=s1$param_post$alpha_param$rate, # N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500, # gg.add=list(theme_bw()), diagVar=FALSE) MCMCy2_seqPost <- DPMGibbsSkewT_SeqPrior(z=GvHD_sims$data[,(N+1):(2*N)], prior=s1$param_post, hyperG0=hyperG0$NNiW, , N=1000, doPlot=TRUE, nbclust_init=10, plotevery=100, gg.add=list(theme_bw()), diagVar=FALSE) s2_seqPost <- summary(MCMCy2_seqPost, burnin=600, thin=2) F2_seqPost <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=s2_seqPost$point_estim$c_est) MCMCy2 <- DPMGibbsSkewT(z=GvHD_sims$data[,(N+1):(2*N)], hyperG0$NNiW, a=0.0001, b=0.0001, N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500, gg.add=list(theme_bw()), diagVar=FALSE) s2 <- summary(MCMCy2, burnin=4000, thin=5) F2 <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=s2$point_estim$c_est) MCMCtot <- DPMGibbsSkewT(z=GvHD_sims$data, hyperG0$NNiW, a=0.0001, b=0.0001, N=5000, doPlot=TRUE, nbclust_init=10, plotevery=500, gg.add=list(theme_bw()), diagVar=FALSE) stot <- summary(MCMCtot, burnin=4000, thin=5) F2tot <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=stot$point_estim$c_est[(N+1):(2*N)]) c(F1, F2, F2_seqPost, F2tot) }
rm(list=ls()) d=2 hyperG0 <- list() hyperG0[["NNiW"]] <- list() hyperG0[["NNiW"]][["b_xi"]] <- rep(0,d) hyperG0[["NNiW"]][["b_psi"]] <- rep(0,d) hyperG0[["NNiW"]][["D_xi"]] <- 100 hyperG0[["NNiW"]][["D_psi"]] <- 8 hyperG0[["NNiW"]][["nu"]] <- d+1 hyperG0[["NNiW"]][["lambda"]] <- diag(c(1,1)) hyperG0[["scale"]] <- list() set.seed(4321) N <- 200 alph <- runif(n=1,0.2,2) GvHD_sims <- rCRP(n=2*N, alpha=alph, hyperG0=hyperG0) library(ggplot2) q <- (ggplot(data=cbind.data.frame("D1"=GvHD_sims$data[1,], "D2"=GvHD_sims$data[2,], "Cluster"=GvHD_sims$cluster), aes(x=D1, y=D2)) + geom_point(aes(colour=Cluster), alpha=0.6) + theme_bw() ) q #q + stat_density2d(alpha=0.15, geom="polygon") if(interactive()){ MCMCy1 <- DPMGibbsSkewT(z=GvHD_sims$data[,1:N], hyperG0$NNiW, a=0.0001, b=0.0001, N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500, gg.add=list(theme_bw()), diagVar=FALSE) s1 <- summary(MCMCy1, burnin=4000, thin=5, posterior_approx=TRUE) F1 <- FmeasureC(ref=GvHD_sims$cluster[1:N], pred=s1$point_estim$c_est) # s <- summary(MCMCy1, burnin=4000, thin=5, # posterior_approx=TRUE, K=1) # s2 <- summary(MCMCy1, burnin=4000, thin=5, # posterior_approx=TRUE, K=2) # MCMCy2_seqPost<- DPMGibbsSkewT(z=GvHD_sims$data[,(N+1):(2*N)], # hyperG0=s1$param_post$parameters, # a=s1$param_post$alpha_param$shape, # b=s1$param_post$alpha_param$rate, # N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500, # gg.add=list(theme_bw()), diagVar=FALSE) MCMCy2_seqPost <- DPMGibbsSkewT_SeqPrior(z=GvHD_sims$data[,(N+1):(2*N)], prior=s1$param_post, hyperG0=hyperG0$NNiW, , N=1000, doPlot=TRUE, nbclust_init=10, plotevery=100, gg.add=list(theme_bw()), diagVar=FALSE) s2_seqPost <- summary(MCMCy2_seqPost, burnin=600, thin=2) F2_seqPost <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=s2_seqPost$point_estim$c_est) MCMCy2 <- DPMGibbsSkewT(z=GvHD_sims$data[,(N+1):(2*N)], hyperG0$NNiW, a=0.0001, b=0.0001, N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500, gg.add=list(theme_bw()), diagVar=FALSE) s2 <- summary(MCMCy2, burnin=4000, thin=5) F2 <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=s2$point_estim$c_est) MCMCtot <- DPMGibbsSkewT(z=GvHD_sims$data, hyperG0$NNiW, a=0.0001, b=0.0001, N=5000, doPlot=TRUE, nbclust_init=10, plotevery=500, gg.add=list(theme_bw()), diagVar=FALSE) stot <- summary(MCMCtot, burnin=4000, thin=5) F2tot <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=stot$point_estim$c_est[(N+1):(2*N)]) c(F1, F2, F2_seqPost, F2tot) }
Sampler updating the concentration parameter of a Dirichlet process given
the number of observations and a Gamma(a
, b
) prior, following the augmentation
strategy of West, and of Escobar and West.
sample_alpha(alpha_old, n, K, a = 1e-04, b = 1e-04)
sample_alpha(alpha_old, n, K, a = 1e-04, b = 1e-04)
alpha_old |
the current value of alpha |
n |
the number of data points |
K |
current number of cluster |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process.
Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process.
Default is |
A Gamma prior is used.
M West, Hyperparameter estimation in Dirichlet process mixture models, Technical Report, Duke University, 1992.
MD Escobar, M West, Bayesian Density Estimation and Inference Using Mixtures Journal of the American Statistical Association, 90(430):577-588, 1995.
#Test with a fixed K #################### alpha_init <- 1000 N <- 10000 #n=500 n=10000 K <- 80 a <- 0.0001 b <- a alphas <- numeric(N) alphas[1] <- alpha_init for (i in 2:N){ alphas[i] <- sample_alpha(alpha_old = alphas[i-1], n=n, K=K, a=a, b=b) } postalphas <- alphas[floor(N/2):N] alphaMMSE <- mean(postalphas) alphaMAP <- density(postalphas)$x[which.max(density(postalphas)$y)] expK <- sum(alphaMMSE/(alphaMMSE+0:(n-1))) round(expK) prioralpha <- data.frame("alpha"=rgamma(n=5000, a,1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) library(ggplot2) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p postalpha.df <- data.frame("alpha"=postalphas, "distribution" = factor(rep("posterior",length(postalphas)), levels=c("prior", "posterior"))) p <- (ggplot(postalpha.df, aes(x=alpha)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of alpha\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(alpha, na.rm=TRUE)), color="red", linetype="dashed", size=1) ) p
#Test with a fixed K #################### alpha_init <- 1000 N <- 10000 #n=500 n=10000 K <- 80 a <- 0.0001 b <- a alphas <- numeric(N) alphas[1] <- alpha_init for (i in 2:N){ alphas[i] <- sample_alpha(alpha_old = alphas[i-1], n=n, K=K, a=a, b=b) } postalphas <- alphas[floor(N/2):N] alphaMMSE <- mean(postalphas) alphaMAP <- density(postalphas)$x[which.max(density(postalphas)$y)] expK <- sum(alphaMMSE/(alphaMMSE+0:(n-1))) round(expK) prioralpha <- data.frame("alpha"=rgamma(n=5000, a,1/b), "distribution" =factor(rep("prior",5000), levels=c("prior", "posterior"))) library(ggplot2) p <- (ggplot(prioralpha, aes(x=alpha)) + geom_histogram(aes(y=..density..), colour="black", fill="white") + geom_density(alpha=.2, fill="red") + ggtitle(paste("Prior distribution on alpha: Gamma(", a, ",", b, ")\n", sep="")) ) p postalpha.df <- data.frame("alpha"=postalphas, "distribution" = factor(rep("posterior",length(postalphas)), levels=c("prior", "posterior"))) p <- (ggplot(postalpha.df, aes(x=alpha)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="blue") + ggtitle("Posterior distribution of alpha\n") # Ignore NA values for mean # Overlay with transparent density plot + geom_vline(aes(xintercept=mean(alpha, na.rm=TRUE)), color="red", linetype="dashed", size=1) ) p
Computes the co-clustering (or similarity) matrix
similarityMat(c, step = 1)
similarityMat(c, step = 1)
c |
a list of vector of length |
step |
provide co-clustering every |
A matrix of size n x n
whose term [i,j]
is the proportion of MCMC iterations where observation i
and
observations j
are allocated to the same cluster.
Boris Hejblum
C++ implementation
similarityMat_nocostC(cc)
similarityMat_nocostC(cc)
cc |
a matrix whose columns each represents a ()MCMC) partition |
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1)) similarityMat_nocostC(sapply(c, "[")) c2 <- list() for(i in 1:10){ c2 <- c(c2, list(rmultinom(n=1, size=1000, prob=rexp(n=1000)))) } c3 <- sapply(c2, "[") if(require(microbenchmark)){ library(microbenchmark) microbenchmark(similarityMat(c3), similarityMat_nocostC(c3), times=2L) }else{ cat("package 'microbenchmark' not available\n") }
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1)) similarityMat_nocostC(sapply(c, "[")) c2 <- list() for(i in 1:10){ c2 <- c(c2, list(rmultinom(n=1, size=1000, prob=rexp(n=1000)))) } c3 <- sapply(c2, "[") if(require(microbenchmark)){ library(microbenchmark) microbenchmark(similarityMat(c3), similarityMat_nocostC(c3), times=2L) }else{ cat("package 'microbenchmark' not available\n") }
C++ implementation
similarityMatC(cc)
similarityMatC(cc)
cc |
a matrix whose columns each represents a (MCMC) partition |
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1)) similarityMatC(sapply(c, "[")) c2 <- list() for(i in 1:10){ c2 <- c(c2, list(rmultinom(n=1, size=200, prob=rexp(n=200)))) } similarityMatC(sapply(c2, "["))
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1)) similarityMatC(sapply(c, "[")) c2 <- list() for(i in 1:10){ c2 <- c(c2, list(rmultinom(n=1, size=200, prob=rexp(n=200)))) } similarityMatC(sapply(c2, "["))
Summary methods for DPMMclust
objects.
## S3 method for class 'DPMMclust' summary( object, burnin = 0, thin = 1, gs = NULL, lossFn = "Binder", posterior_approx = FALSE, ... )
## S3 method for class 'DPMMclust' summary( object, burnin = 0, thin = 1, gs = NULL, lossFn = "Binder", posterior_approx = FALSE, ... )
object |
a |
burnin |
integer giving the number of MCMC iterations to burn (defaults is half) |
thin |
integer giving the spacing at which MCMC iterations are kept.
Default is |
gs |
optional vector of length |
lossFn |
character string specifying the loss function to be used. Either "F-measure" or "Binder" (see Details). Default is "Binder". |
posterior_approx |
logical flag whether a parametric approximation of the posterior should be
computed. Default is |
... |
further arguments passed to or from other methods |
The cost of a point estimate partition is calculated using either a pairwise coincidence loss function (Binder), or 1-Fmeasure (F-measure).
The number of retained sampled partitions is m = (N - burnin)/thin
a list
containing the following elements:
nb_mcmcit
: an integer giving the value of m
, the number of retained
sampled partitions, i.e. (N - burnin)/thin
burnin
: an integer passing along the burnin
argument
thin
: an integer passing along the thin
argument
lossFn
: a character string passing along the lossFn
argument
clust_distrib
: a character string passing along the clust_distrib
argument
point_estim
: a list
containing:
c_est
: a vector of length n
containing the point estimated clustering for each observations
cost
: a vector of length m
containing the cost of each sampled partition
Fmeas
: if lossFn
is 'F-measure'
, the m x m
matrix of total F-measures for each pair of sampled partitions
opt_ind
: the index of the point estimate partition among the m
sampled
loss
: the loss for the point estimate. NA
if lossFn
is not 'Binder'
param_posterior
:a list containing the parametric approximation of the posterior, suitable to be plugged in as prior for a new MCMC algorithm run
mcmc_partitions
: a list containing the m
sampled partitions
alpha
: a vector of length m
with the values of the alpha
DP parameter
index_estim
: the index of the point estimate partition among the m
sampled
hyperG0
: a list passing along the prior, i.e. the hyperG0
argument
logposterior_list
: a list of length m
containing the logposterior and its decomposition, for each sampled partition
U_SS_list
: a list of length m
containing the containing the lists of sufficient statistics for all the mixture components,
for each sampled partition
data
: a d x n
matrix containing the clustered data
Boris Hejblum
C++ implementation
vclust2mcoclustC(c)
vclust2mcoclustC(c)
c |
is an MCMC partition |
Chariff Alkhassim
cc <- c(1,1,2,3,2,3) vclust2mcoclustC(cc)
cc <- c(1,1,2,3,2,3) vclust2mcoclustC(cc)