Title: | Model-Based Tensor Clustering |
---|---|
Description: | Performs model-based tensor clustering methods including Tensor Gaussian Mixture Model (TGMM), Tensor Envelope Mixture Model (TEMM) by Deng and Zhang (2021) <DOI: 10.1111/biom.13486>, Doubly-Enhanced EM (DEEM) algorithm by Mai, Zhang, Pan and Deng (2021) <DOI: 10.1080/01621459.2021.1904959>. |
Authors: | Kai Deng [aut, cre], Yuqing Pan [aut], Xin Zhang [aut], Qing Mai [aut] |
Maintainer: | Kai Deng <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.2 |
Built: | 2024-12-25 06:37:49 UTC |
Source: | CRAN |
Doubly-enhanced EM algorithm for tensor clustering
DEEM(X, nclass, niter = 100, lambda = NULL, dfmax = n, pmax = nvars, pf = rep(1, nvars), eps = 1e-04, maxit = 1e+05, sml = 1e-06, verbose = FALSE, ceps = 0.1, initial = TRUE, vec_x = NULL)
DEEM(X, nclass, niter = 100, lambda = NULL, dfmax = n, pmax = nvars, pf = rep(1, nvars), eps = 1e-04, maxit = 1e+05, sml = 1e-06, verbose = FALSE, ceps = 0.1, initial = TRUE, vec_x = NULL)
X |
Input tensor (or matrix) list of length |
nclass |
Number of clusters. |
niter |
Maximum iteration times for EM algorithm. Default value is 100. |
lambda |
A user-specified |
dfmax |
The maximum number of selected variables in the model. Default is the number of observations |
pmax |
The maximum number of potential selected variables during iteration. In middle step, the algorithm can select at most |
pf |
Weight of lasso penalty. Default is a vector of value |
eps |
Convergence threshold for coordinate descent difference between iterations. Default value is |
maxit |
Maximum iteration times for coordinate descent for all lambda. Default value is |
sml |
Threshold for ratio of loss function change after each iteration to old loss function value. Default value is |
verbose |
Indicates whether print out lambda during iteration or not. Default value is |
ceps |
Convergence threshold for cluster mean difference between iterations. Default value is |
initial |
Whether to initialize algorithm with K-means clustering. Default value is |
vec_x |
Vectorized tensor data. Default value is |
The DEEM
function implements the Doubly-Enhanced EM algorithm (DEEM) for tensor clustering. The observations are assumed to be following the tensor normal mixture model (TNMM) with common covariances across different clusters:
where is the prior probability for
to be in the
-th cluster such that
,
is the cluster mean of the
-th cluster and
are the common covariances across different clusters. Under the TNMM framework, the optimal clustering rule can be showed as
where . In the enhanced E-step,
DEEM
imposes sparsity directly on the optimal clustering rule as a flexible alternative to popular low-rank assumptions on tensor coefficients as
where is a tuning parameter. In the enhanced M-step,
DEEM
employs a new estimator for the tensor correlation structure, which facilitates both the computation and the theoretical studies.
pi |
A vector of estimated prior probabilities for clusters. |
mu |
A list of estimated cluster means. |
sigma |
A list of estimated covariance matrices. |
gamma |
A |
y |
A vector of estimated labels. |
iter |
Number of iterations until convergence. |
df |
Average zero elements in beta over iterations. |
beta |
A matrix of vectorized |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Mai, Q., Zhang, X., Pan, Y. and Deng, K. (2021). A Doubly-Enhanced EM Algorithm for Model-Based Tensor Clustering. Journal of the American Statistical Association.
dimen = c(5,5,5) nvars = prod(dimen) K = 2 n = 100 sigma = array(list(),3) sigma[[1]] = sigma[[2]] = sigma[[3]] = diag(5) B2=array(0,dim=dimen) B2[1:3,1,1]=2 y = c(rep(1,50),rep(2,50)) M = array(list(),K) M[[1]] = array(0,dim=dimen) M[[2]] = B2 vec_x=matrix(rnorm(n*prod(dimen)),ncol=n) X=array(list(),n) for (i in 1:n){ X[[i]] = array(vec_x[,i],dim=dimen) X[[i]] = M[[y[i]]] + X[[i]] } myfit = DEEM(X, nclass=2, lambda=0.05)
dimen = c(5,5,5) nvars = prod(dimen) K = 2 n = 100 sigma = array(list(),3) sigma[[1]] = sigma[[2]] = sigma[[3]] = diag(5) B2=array(0,dim=dimen) B2[1:3,1,1]=2 y = c(rep(1,50),rep(2,50)) M = array(list(),K) M[[1]] = array(0,dim=dimen) M[[2]] = B2 vec_x=matrix(rnorm(n*prod(dimen)),ncol=n) X=array(list(),n) for (i in 1:n){ X[[i]] = array(vec_x[,i],dim=dimen) X[[i]] = M[[y[i]]] + X[[i]] } myfit = DEEM(X, nclass=2, lambda=0.05)
Fit the Tensor Envelope Mixture Model (TEMM)
TEMM(Xn, u, K, initial = "kmeans", iter.max = 500, stop = 1e-3, trueY = NULL, print = FALSE)
TEMM(Xn, u, K, initial = "kmeans", iter.max = 500, stop = 1e-3, trueY = NULL, print = FALSE)
Xn |
The tensor for clustering, should be array type, the last dimension is the sample size |
u |
A vector of envelope dimension |
K |
Number of clusters, greater than or equal to |
initial |
Initialization meth0d for the regularized EM algorithm. Default value is "kmeans". |
iter.max |
Maximum number of iterations. Default value is |
stop |
Convergence threshold of relative change in cluster means. Default value is |
trueY |
A vector of true cluster labels of each observation. Default value is NULL. |
print |
Whether to print information including current iteration number, relative change in cluster means
and clustering error ( |
The TEMM
function fits the Tensor Envelope Mixture Model (TEMM) through a subspace-regularized EM algorithm. For mode , let
be an orthogonal matrix where
,
, represents the material part. Specifically, the material part
follows a tensor normal mixture distribution, while the immaterial part
is unimodal, independent of the material part and hence can be eliminated without loss of clustering information. Dimension reduction is achieved by focusing on the material part
. Collectively, the joint reduction from each mode is
where and
are the dimension-reduced clustering parameters and
does not vary with cluster index
. In the E-step, the membership weights are evaluated as
where denotes the conditional probability density function of
within the
-th cluster. In the subspace-regularized M-step, the envelope subspace is iteratively estimated through a Grassmann manifold optimization that minimize the following log-likelihood-based objective function:
where and
are given by
The intermediate estimators can be viewed the mode-
conditional variation estimate of
and
is the mode-
marginal variation estimate of
.
id |
A vector of estimated labels. |
pi |
A vector of estimated prior probabilities for clusters. |
eta |
A |
Mu.est |
A list of estimated cluster means. |
SIG.est |
A list of estimated covariance matrices. |
Mm |
Estimation of |
Nm |
Estimation of |
Gamma.est |
A list of estimated envelope basis. |
PGamma.est |
A list of envelope projection matrices. |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Deng, K. and Zhang, X. (2021). Tensor Envelope Mixture Model for Simultaneous Clustering and Multiway Dimension Reduction. Biometrics.
TGMM
, tune_u_sep
, tune_u_joint
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) myfit = TEMM(A,u=c(2,2),K=2)
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) myfit = TEMM(A,u=c(2,2),K=2)
Fit the Tensor Gaussian Mixture Model (TGMM)
TGMM(Xn, K, shape = "shared", initial = "kmeans", iter.max = 500, stop = 1e-3, trueY = NULL, print = FALSE)
TGMM(Xn, K, shape = "shared", initial = "kmeans", iter.max = 500, stop = 1e-3, trueY = NULL, print = FALSE)
Xn |
The tensor for clustering, should be array type, the last dimension is the sample size |
K |
Number of clusters, greater than or equal to |
shape |
"shared" if assume common covariance across mixtures, "distinct" if allow different covariance structures. Default value is "shared". |
initial |
Initialization meth0d for the regularized EM algorithm. Default value is "kmeans". |
iter.max |
Maximum number of iterations. Default value is |
stop |
Convergence threshold of relative change in cluster means. Default value is |
trueY |
A vector of true cluster labels of each observation. Default value is NULL. |
print |
Whether to print information including current iteration number, relative change in cluster means
and clustering error ( |
The TGMM
function fits the Tensor Gaussian Mixture Model (TGMM) through the classical EM algorithm. TGMM assumes the following tensor normal mixture distribution of M-way tensor data :
where is the prior probability for
to be in the
-th cluster such that
,
is the mean of the
-th cluster,
is the set of covariances of the
-th cluster. If
's are the same for
, call
TGMM
with argument shape="shared"
.
id |
A vector of estimated labels. |
pi |
A vector of estimated prior probabilities for clusters. |
eta |
A |
Mu.est |
A list of estimated cluster means. |
SIG.est |
A list of estimated covariance matrices. |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Deng, K. and Zhang, X. (2021). Tensor Envelope Mixture Model for Simultaneous Clustering and Multiway Dimension Reduction. Biometrics.
Tait, P. A. and McNicholas, P. D. (2019). Clustering higher order data: Finite mixtures of multidimensional arrays. arXiv:1907.08566.
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) myfit = TGMM(A,K=2,shape="shared")
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) myfit = TGMM(A,K=2,shape="shared")
K
in DEEMSelect the number of clusters K
along with tuning parameter lambda
through BIC in DEEM.
tune_K(X, seqK, seqlamb, initial = TRUE, vec_x = NULL)
tune_K(X, seqK, seqlamb, initial = TRUE, vec_x = NULL)
X |
Input tensor (or matrix) list of length |
seqK |
A sequence of user-specified number of clusters. |
seqlamb |
A sequence of user-specified |
initial |
Whether to initialize algorithm with K-means clustering. Default value is |
vec_x |
Vectorized tensor data. Default value is |
The tune_K
function runs tune_lamb
function length(seqK)
times to choose the tuning parameter and number of clusters
simultaneously. Let
be the output of
DEEM
with the tuning parameter and number of clusters fixed at and
respectively,
tune_K
looks for the values of and
that minimizes
where is the set of nonzero elements in
. The
tune_K
function intrinsically selects the initial point and return the optimal estimated labels.
opt_K |
Selected number of clusters that leads to optimal BIC. |
opt_lamb |
Tuned |
Krank |
A selection summary. |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Mai, Q., Zhang, X., Pan, Y. and Deng, K. (2021). A Doubly-Enhanced EM Algorithm for Model-Based Tensor Clustering. Journal of the American Statistical Association.
dimen = c(5,5,5) nvars = prod(dimen) K = 2 n = 100 sigma = array(list(),3) sigma[[1]] = sigma[[2]] = sigma[[3]] = diag(5) B2=array(0,dim=dimen) B2[1:3,1,1]=2 y = c(rep(1,50),rep(2,50)) M = array(list(),K) M[[1]] = array(0,dim=dimen) M[[2]] = B2 vec_x=matrix(rnorm(n*prod(dimen)),ncol=n) X=array(list(),n) for (i in 1:n){ X[[i]] = array(vec_x[,i],dim=dimen) X[[i]] = M[[y[i]]] + X[[i]] } mytune = tune_K(X, seqK=2:4, seqlamb=seq(0.01,0.1,by=0.01))
dimen = c(5,5,5) nvars = prod(dimen) K = 2 n = 100 sigma = array(list(),3) sigma[[1]] = sigma[[2]] = sigma[[3]] = diag(5) B2=array(0,dim=dimen) B2[1:3,1,1]=2 y = c(rep(1,50),rep(2,50)) M = array(list(),K) M[[1]] = array(0,dim=dimen) M[[2]] = B2 vec_x=matrix(rnorm(n*prod(dimen)),ncol=n) X=array(list(),n) for (i in 1:n){ X[[i]] = array(vec_x[,i],dim=dimen) X[[i]] = M[[y[i]]] + X[[i]] } mytune = tune_K(X, seqK=2:4, seqlamb=seq(0.01,0.1,by=0.01))
Perform parameter tuning through BIC in DEEM.
tune_lamb(X, K, seqlamb, initial = TRUE, vec_x = NULL)
tune_lamb(X, K, seqlamb, initial = TRUE, vec_x = NULL)
X |
Input tensor (or matrix) list of length |
K |
Number of clusters. |
seqlamb |
A sequence of user-specified |
initial |
Whether to initialize algorithm with K-means clustering. Default value is |
vec_x |
Vectorized tensor data. Default value is |
The tune_lamb
function adopts a BIC-type criterion to select the tuning parameter in the enhanced E-step. Let
be the output of
DEEM
with the tuning parameter fixed at ,
tune_lamb
looks for the value of that minimizes
where is the set of nonzero elements in
. The
tune_lamb
function intrinsically selects the initial point and return the optimal estimated labels.
opt_lamb |
Tuned |
opt_bic |
BIC value. |
opt_y |
Estimated labels fitted by DEEM with tuned |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Mai, Q., Zhang, X., Pan, Y. and Deng, K. (2021). A Doubly-Enhanced EM Algorithm for Model-Based Tensor Clustering. Journal of the American Statistical Association.
dimen = c(5,5,5) nvars = prod(dimen) K = 2 n = 100 sigma = array(list(),3) sigma[[1]] = sigma[[2]] = sigma[[3]] = diag(5) B2=array(0,dim=dimen) B2[1:3,1,1]=2 y = c(rep(1,50),rep(2,50)) M = array(list(),K) M[[1]] = array(0,dim=dimen) M[[2]] = B2 vec_x=matrix(rnorm(n*prod(dimen)),ncol=n) X=array(list(),n) for (i in 1:n){ X[[i]] = array(vec_x[,i],dim=dimen) X[[i]] = M[[y[i]]] + X[[i]] } mytune = tune_lamb(X, K=2, seqlamb=seq(0.01,0.1,by=0.01))
dimen = c(5,5,5) nvars = prod(dimen) K = 2 n = 100 sigma = array(list(),3) sigma[[1]] = sigma[[2]] = sigma[[3]] = diag(5) B2=array(0,dim=dimen) B2[1:3,1,1]=2 y = c(rep(1,50),rep(2,50)) M = array(list(),K) M[[1]] = array(0,dim=dimen) M[[2]] = B2 vec_x=matrix(rnorm(n*prod(dimen)),ncol=n) X=array(list(),n) for (i in 1:n){ X[[i]] = array(vec_x[,i],dim=dimen) X[[i]] = M[[y[i]]] + X[[i]] } mytune = tune_lamb(X, K=2, seqlamb=seq(0.01,0.1,by=0.01))
Tuning envelope dimension jointly by BIC in TEMM.
tune_u_joint(u_candi, K, X, iter.max = 500, stop = 0.001, trueY = NULL)
tune_u_joint(u_candi, K, X, iter.max = 500, stop = 0.001, trueY = NULL)
u_candi |
A list of length |
K |
Number of clusters, greater than or equal to |
X |
The tensor for clustering, should be array type, the last dimension is the sample size |
iter.max |
Maximum number of iterations. Default value is |
stop |
Convergence threshold of relative change in cluster means. Default value is |
trueY |
A vector of true cluster labels of each observation. Default value is NULL. |
The tune_u_joint
function searches over all the combinations of in the neighborhood of
,
, that minimizes
In the above BIC, is the total number of parameters in TEMM,
and
are the estimated parameters with envelope dimension fixed at
. The
tune_u_joint
function intrinsically selects the initial point and return the optimal estimated labels.
opt.u |
Optimal envelope dimension selected. |
opt.id |
Estimated labels fitted by TEMM with the optimal envelope dimension. |
opt.Mu |
Estimated cluster means fitted by TEMM with the optimal envelope dimension. |
bic |
BIC value. |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Deng, K. and Zhang, X. (2021). Tensor Envelope Mixture Model for Simultaneous Clustering and Multiway Dimension Reduction. Biometrics.
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) mytune = tune_u_joint(u_candi=list(1:2,1:2),K=2,A)
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) mytune = tune_u_joint(u_candi=list(1:2,1:2),K=2,A)
Tuning envelope dimension separately by BIC in TEMM.
tune_u_sep(m, u_candi, K, X, C = 1, oneD = TRUE, iter.max = 500, stop = 0.001, trueY = NULL)
tune_u_sep(m, u_candi, K, X, C = 1, oneD = TRUE, iter.max = 500, stop = 0.001, trueY = NULL)
m |
The tensor mode to be tuned, can take value in |
u_candi |
A vector of candidate envelope dimension. |
K |
Number of clusters, greater than or equal to |
X |
The tensor for clustering, should be array type, the last dimension is the sample size |
C |
Constant in separate BIC criterion. Default value is |
oneD |
Whether to apply 1D-BIC tuning. Default value is TRUE. |
iter.max |
Maximum number of iterations. Default value is |
stop |
Convergence threshold of relative change in cluster means. Default value is |
trueY |
A vector of true cluster labels of each observation. Default value is NULL. |
For tensor mode , the
tune_u_sep
function selects the envelope dimension by minimizing the following BIC-type criterion over the set
,
This separate selection over each mode is less sensitive to the complex interrelationships of each mode of the tensor. The default constant
is set as
as suggested by Zhang and Mai (2018).
opt.u |
Optimal envelope dimension selected. |
bic |
BIC value. |
Kai Deng, Yuqing Pan, Xin Zhang and Qing Mai
Deng, K. and Zhang, X. (2021). Tensor Envelope Mixture Model for Simultaneous Clustering and Multiway Dimension Reduction. Biometrics.
Zhang, X. and Mai, Q. (2018). Model-free envelope dimension selection. Electronic Journal of Statistics 12, 2193-2216.
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) mytune = tune_u_sep(1,1:2,K=2,A)
A = array(c(rep(1,20),rep(2,20))+rnorm(40),dim=c(2,2,10)) mytune = tune_u_sep(1,1:2,K=2,A)