Title: | Visualizing Class Specific Heterogeneous Tendencies in Categorical Data |
---|---|
Description: | Performing multiple-class cluster correspondence analysis(MCCCA). The main functions are create.MCCCAdata() to create a list to be applied to MCCCA, MCCCA() to apply MCCCA, and plot.mccca() for visualizing MCCCA result. Methods used in the package refers to Mariko Takagishi and Michel van de Velden (2022)<doi:10.1080/10618600.2022.2035737>. |
Authors: | Mariko Takagishi [aut, cre] |
Maintainer: | Mariko Takagishi <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.0.1 |
Built: | 2024-11-20 06:29:44 UTC |
Source: | CRAN |
Creates a list (named mcccadata.list
) applied to MCCCA.
create.MCCCAdata(dat,ext.mat=ext.mat,clstr0.vec=NULL)
create.MCCCAdata(dat,ext.mat=ext.mat,clstr0.vec=NULL)
dat |
An (NxJ) matrix of categorical data (N:the number of observations, J:the number of variables). If |
ext.mat |
An (NxH) external variable matrix (H:the number of external variable). |
clstr0.vec |
An integer vector of length N giving each observation's true cluster. |
Returns a list with the following elements.
data.mat |
data matrix same as |
data.list |
A list of C (NxJ) categorical data matrices for each class (C:the number of classes). |
clstr0.list |
A list of C vectors where each vector indicates the true cluster (given in |
N.vec |
A vector of length C giving the number of observations in each class. |
Ktrue.vec |
A vector of length C giving the true number of clusters in each class (NULL if |
q.vec |
A vector of length J giving the number of categories in each of J categorical variables. |
class.n.vec |
An integer (from 1:C) vector of length N giving the class index of each observation. |
classname.n.vec |
A characteristic vector of length N giving the class label each observation belongs to. |
classlabel |
A characteristic vector of length C giving the classlabel for each class. |
classlab.mat |
(Cx(H+1)) table, showing which combinations of categories of external variables each class index and class name corresponds to. The first H columns indicate the categories for each of the H external variables, and the last H+1th column indicates the corresponding class label (same as |
oriindex.list |
A list of length C, where each list element corresponds to a row (observation) in data.list, indicating which row of observations (in |
Takagishi & Michel van de Velden (2022): Visualizing Class Specific Heterogeneous Tendencies in Categorical Data, Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2022.2035737
#setting N <- 100 ; J <- 5 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.2 extcate.vec=c(2,3)#the number of categories for each external variable #generate categorical variable data catedata.list <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop) data.cate=catedata.list$data.mat clstr0.vec=catedata.list$clstr0.vec #generate external variable data data.ext=generate.ext(N,extcate.vec=extcate.vec) #create mccca.list to be applied to MCCCA function mccca.data=create.MCCCAdata(data.cate,ext.mat=data.ext,clstr0.vec =clstr0.vec) #check which class each observation belongs to. (given by class name) mccca.data$classname.n.vec #A table showing that which combinations of categories of external variables # each class index and class name corresponds to. mccca.data$classlab.mat
#setting N <- 100 ; J <- 5 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.2 extcate.vec=c(2,3)#the number of categories for each external variable #generate categorical variable data catedata.list <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop) data.cate=catedata.list$data.mat clstr0.vec=catedata.list$clstr0.vec #generate external variable data data.ext=generate.ext(N,extcate.vec=extcate.vec) #create mccca.list to be applied to MCCCA function mccca.data=create.MCCCAdata(data.cate,ext.mat=data.ext,clstr0.vec =clstr0.vec) #check which class each observation belongs to. (given by class name) mccca.data$classname.n.vec #A table showing that which combinations of categories of external variables # each class index and class name corresponds to. mccca.data$classlab.mat
Creates a list length J of category proportion for each cluster.
create.prop( J = J, q.vec = q.vec, Ktrue = Ktrue, strongprop = 0.8, which.noise = NULL )
create.prop( J = J, q.vec = q.vec, Ktrue = Ktrue, strongprop = 0.8, which.noise = NULL )
J |
The number of active variable.!!! |
q.vec |
A vector of length J giving the number of categories for each active variable. |
Ktrue |
The number of clusters in J active variables. |
strongprop |
A numeric value giving the strongest proportion of categories (common for all J active variables). |
which.noise |
A vector of length (<= J) giving the index of noise variables in J active variables. NULL indicating all variable is non-noise. |
Returns a list length J, each of which is a (Ktrue x qj) matrix giving the proportion for each qj category in each Ktrue cluster.
Generate an (NxJ) categorical data matrix given by prop.J.list and true cluster allocation.
generate.cate.list(N = N, prop.list = prop.list)
generate.cate.list(N = N, prop.list = prop.list)
N |
The number of observations. |
prop.list |
a list length J, each of which is a vector of length qj giving the proportion for each categories. |
an (NxJ) categorical data matrix.
Generate an (NxJ) clustered categorical data matrix given by prop.J.list and true cluster allocation.
generate.catecls( N = N, J = J, q.vec = q.vec, Ktrue = Ktrue, prop.J.list = prop.J.list, clstr.vec = clstr.vec )
generate.catecls( N = N, J = J, q.vec = q.vec, Ktrue = Ktrue, prop.J.list = prop.J.list, clstr.vec = clstr.vec )
N |
The number of observations. |
J |
The number of active variables. |
q.vec |
A vector of length J giving the number of categories for each active variable. |
Ktrue |
An integer indicating the number of content-based clusters used for CCRS estimation. |
prop.J.list |
a list of length J, where each list is a (Ktrue x qj) matrix giving the proportion for each qj category in each of the |
clstr.vec |
A vector of length N giving true clusters for each observations. |
an (NxJ) clustered categorical data matrix.
Generates an artificial (NxH) external variable matrix.
generate.ext(N,extcate.vec=extcate.vec,unbala.cate=FALSE)
generate.ext(N,extcate.vec=extcate.vec,unbala.cate=FALSE)
N |
The number of observation. |
extcate.vec |
A vector of length H, each element indicates the number of category for each H external variables. |
unbala.cate |
logical value. If TRUE, the proportion of categories in the external variable is unbalanced. The default is FALSE. |
An (NxH) external variable matrix.
###data setting N <- 30 ; extcate.vec=c(2,3) ext.mat=generate.ext(N,extcate.vec=extcate.vec)
###data setting N <- 30 ; extcate.vec=c(2,3) ext.mat=generate.ext(N,extcate.vec=extcate.vec)
Generate (NxJ) categorical data matrix.
generate.onedata(N=100,J=5,Ktrue=3,q.vec=rep(3,5),noise.prop=0.3)
generate.onedata(N=100,J=5,Ktrue=3,q.vec=rep(3,5),noise.prop=0.3)
N |
The number of observations. Default is 100. |
J |
The number of active variables. Default is 5. |
Ktrue |
The number of true clusters. Default is 3. |
q.vec |
A vector of length J giving the number of categories for each active variable. Default is rep(3,5). |
noise.prop |
A numeric value between 0 and 1 indicating the proportion of noise variables among J variables. Default is 0.3. |
Returns a list with the following elements.
data.mat |
A (NxJ) data frame of categorical data. |
clstr0.vec |
A vector of integers (from 1:Ktrue) length N giving the cluster to which each observation is allocated. |
###data setting N <- 30 ; J <- 10 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.3 datagene <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop)
###data setting N <- 30 ; J <- 10 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.3 datagene <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop)
Applies MCCCA to mcccadata.list
.
MCCCA( mccca.data, K.vec = K.vec, known.vec = NULL, knowncluster.list = NULL, nstart = 3, maxit = 50, p = 2, tol = 1e-08, verbose = TRUE, remove.miss = TRUE, kmeans.initial = TRUE )
MCCCA( mccca.data, K.vec = K.vec, known.vec = NULL, knowncluster.list = NULL, nstart = 3, maxit = 50, p = 2, tol = 1e-08, verbose = TRUE, remove.miss = TRUE, kmeans.initial = TRUE )
mccca.data |
A list created in |
K.vec |
An integer vector of length C (the number of classes). Each element corresponds to the number of clusters in each class specified for estimation. |
known.vec |
A vector of length C giving logical values indicating whether a cluster allocation in each class is known or not. The default is all |
knowncluster.list |
A vector of length C giving logical values indicating whether a cluster allocation in each class is known or not. The default is all |
nstart |
An integer indicating the number of random initial values. |
maxit |
An integer indicating the maximum number of iterations. |
p |
An integer indicating the dimension of quantification.The default is 2. |
tol |
A numeric value indicating the absolute convergence tolerance. |
verbose |
A logical value indicating. If |
remove.miss |
A logical value indicating whether categories nobody choose are removed nor not. The default is |
kmeans.initial |
A logical value indicating whether the 1st initial value for indicator matrix is generated by kmeans or not. The default is |
Bg
,Gg
and Qg
are scaled B
,G
and Q
respectively, such that the average squared deviation from the origin of the row and column points is the same (See section 2.3 in the paper).
If you want to specify the cluster allocation for some or all classes, prepare the following two.
-knowncluster.list
: A list of C vectors. The length of each vector in the list should be the same as the number of rows in each matrix in the data.list
(ex. length(knowncluster.list[[c]])=nrow(data.list[[c]])
, (c=1,..,C)).
For example, suppose that data.list
is a list of 4 matrices (meaning C=4),
and the cluster assignment is known only for the second class,
and the assignments in other classes are estimated. In this case,
the second vector of knowncluster.list
should be specified as the vector of cluster indexes
to which the observations in each row of data.list[[2]]
belong, with length nrow(data.list[[2]])
,
and the other vectors (1, 3, and 4) in the list can be specified as NA
. For each vector in the knowncluster.list
,
the specified cluster index should start from 1, and there should not be any skipping numbers.
-known.vec
: A vector of logical values of length C. For example,
if C=4 and you want to know the cluster assignment of only the second class, it should be known.vec=c(FALSE,TRUE,FALSE,FALSE)
.
Returns a list with the following elements.
G |
A (Kxp) quantification matrix for all clusters (K= |
Gg |
Scaled |
B |
A (Qxp) quantification matrix for all categories (Q= |
Bg |
Scaled |
Q |
A (Nxp) quantification matrix for all observations. |
Qg |
Scaled |
clses.list |
A list of C vectors, giving the estimated cluster index for each observation in each class. |
clses.vec |
A vector of length N, where each element represents the cluster index to which the observations in the rows of |
optval |
A numeric value giving the optimized value of the objective function that is the smallest among all initial values. |
optval.vec |
A numeric vector of length |
stepconv |
An integer giving the number of iterations until convergence at the initial value where the objective function was the smallest. |
stepconv.vec |
An integer vector of length |
catename.vec |
A characteristic vector of length |
catename.vari.vec |
A characteristic vector of length |
cate.removed |
If there is a category that no one chooses and |
cluster.vec |
An integer vector of length K, where each index in the |
q.vec |
A vector of length J, same as the one given in |
K.vec |
A vector of length C, which is used as an input in this |
classlabel |
A characteristic vector of length C, same as the one given in |
Takagishi & Michel van de Velden (2022): Visualizing Class Specific Heterogeneous Tendencies in Categorical Data, Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2022.2035737
#setting N <- 100 ; J <- 5 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.2 extcate.vec=c(2,3)#the number of categories for each external variable #generate categorical variable data catedata.list <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop) data.cate=catedata.list$data.mat clstr0.vec=catedata.list$clstr0.vec #generate external variable data data.ext=generate.ext(N,extcate.vec=extcate.vec) #create mccca.list to be applied to MCCCA function mccca.data=create.MCCCAdata(data.cate,ext.mat=data.ext,clstr0.vec =clstr0.vec) #specify the number of cluster for each of C classes C=length(mccca.data$data.list) K.vec=rep(2,C) #apply MCCCA mccca.res=MCCCA(mccca.data,K.vec=K.vec) #plot MCCCA result plot(mccca.res) #if you want to specify cluster allocation in the 2nd class: knowncluster.list=rep(list(NA),C) #specify cluster index for the 2nd class N2=nrow(mccca.data$data.list[[2]]) knowncluster.list[[2]]=rep(c(1,2),times=c(2,N2-2)) known.vec=c(FALSE,TRUE,FALSE,FALSE,FALSE,FALSE) mccca.res=MCCCA(mccca.data,K.vec=K.vec,known.vec=known.vec,knowncluster.list = knowncluster.list)
#setting N <- 100 ; J <- 5 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.2 extcate.vec=c(2,3)#the number of categories for each external variable #generate categorical variable data catedata.list <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop) data.cate=catedata.list$data.mat clstr0.vec=catedata.list$clstr0.vec #generate external variable data data.ext=generate.ext(N,extcate.vec=extcate.vec) #create mccca.list to be applied to MCCCA function mccca.data=create.MCCCAdata(data.cate,ext.mat=data.ext,clstr0.vec =clstr0.vec) #specify the number of cluster for each of C classes C=length(mccca.data$data.list) K.vec=rep(2,C) #apply MCCCA mccca.res=MCCCA(mccca.data,K.vec=K.vec) #plot MCCCA result plot(mccca.res) #if you want to specify cluster allocation in the 2nd class: knowncluster.list=rep(list(NA),C) #specify cluster index for the 2nd class N2=nrow(mccca.data$data.list[[2]]) knowncluster.list[[2]]=rep(c(1,2),times=c(2,N2-2)) known.vec=c(FALSE,TRUE,FALSE,FALSE,FALSE,FALSE) mccca.res=MCCCA(mccca.data,K.vec=K.vec,known.vec=known.vec,knowncluster.list = knowncluster.list)
mccca
object.plot mccca
object.
## S3 method for class 'mccca' plot( x, main = "MCCCA result", catelabel = NULL, classlabel = NULL, classlabel.legend = NULL, xlim = NULL, ylim = NULL, sort.clssize = TRUE, break.size = NULL, output.coord = FALSE, connect.cord = TRUE, include.variname = TRUE, scale.gamma = TRUE, scatter.level = 2, plot.setting = list(alp.point = 0.3, alp.seg = 0.8, txtsize = 3, txtsize.legend = 10), ... )
## S3 method for class 'mccca' plot( x, main = "MCCCA result", catelabel = NULL, classlabel = NULL, classlabel.legend = NULL, xlim = NULL, ylim = NULL, sort.clssize = TRUE, break.size = NULL, output.coord = FALSE, connect.cord = TRUE, include.variname = TRUE, scale.gamma = TRUE, scatter.level = 2, plot.setting = list(alp.point = 0.3, alp.seg = 0.8, txtsize = 3, txtsize.legend = 10), ... )
x |
An object of class |
main |
A character giving the title of biplot. |
catelabel |
A characteristic vector of length Q giving labels for all categories to be displayed on the biplot (Q= |
classlabel |
A characteristic vector of length C (C:the number of class) giving labels for all classes to be displayed on the biplot. If |
classlabel.legend |
A characteristic vector of length C giving labels for all classes to be used on the legend (this can be longer). If |
xlim |
A numeric vector of length 2 giving the range of plot on the x (horizontal) axis. If NULL, the range is automatically determined. |
ylim |
A numeric vector of length 2 for the y (vertical) axis (same role as |
sort.clssize |
If |
break.size |
An integer vector that adjusts the size of bubble displayed on the legend. |
output.coord |
If |
connect.cord |
If |
include.variname |
If |
scale.gamma |
If |
scatter.level |
A numeric value that adjusts the scatter of points in the biplot. The higher the value, the more scattered the points are. The default is 2. |
plot.setting |
A list of biplot settings. See details. |
... |
Additional arguments passed to |
Parameters in plot.setting
are as follows:
-alp.point
:A numeric value from 0 to 1 which adjusts the transparency of the bubble point. The default is 0.3.
-alp.seg
:A numeric value from 0 to 1 which adjusts the transparency of the segments between texts and points. The default is 0.8.
-txtsize
:A numeric value which adjusts the textsize on the biplot. The default is 3.
-txtsize.legend
:A numeric value which adjusts the textsize of the legend on the biplot. The default is 10.
If output.coord
is TRUE
, returns a list with the following elements.
Cocls.mat |
A (Kx4) coordinate matrix of clusters, where the last two columns are the coordinates estimated by MCCCA, and the first two columns are the coordinates moved from the estimated coordinates to prevent overlap. |
Cocate.mat |
A (Kx4) coordinate matrix of categories (each column plays the same role as |
Takagishi & Michel van de Velden (2022): Visualizing Class Specific Heterogeneous Tendencies in Categorical Data, Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2022.2035737
#setting N <- 100 ; J <- 5 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.2 extcate.vec=c(2,3)#the number of categories for each external variable #generate categorical variable data catedata.list <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop) data.cate=catedata.list$data.mat clstr0.vec=catedata.list$clstr0.vec #generate external variable data data.ext=generate.ext(N,extcate.vec=extcate.vec) #create mccca.list to be applied to MCCCA function mccca.data=create.MCCCAdata(data.cate,ext.mat=data.ext,clstr0.vec =clstr0.vec) #specify the number of cluster for each of C classes C=length(mccca.data$data.list) K.vec=rep(2,C) #apply MCCCA mccca.res=MCCCA(mccca.data,K.vec=K.vec) #plot MCCCA result plot(mccca.res)
#setting N <- 100 ; J <- 5 ; Ktrue <- 2 ; q.vec <- rep(5,J) ; noise.prop <- 0.2 extcate.vec=c(2,3)#the number of categories for each external variable #generate categorical variable data catedata.list <- generate.onedata(N=N,J=J,Ktrue=Ktrue,q.vec=q.vec,noise.prop = noise.prop) data.cate=catedata.list$data.mat clstr0.vec=catedata.list$clstr0.vec #generate external variable data data.ext=generate.ext(N,extcate.vec=extcate.vec) #create mccca.list to be applied to MCCCA function mccca.data=create.MCCCAdata(data.cate,ext.mat=data.ext,clstr0.vec =clstr0.vec) #specify the number of cluster for each of C classes C=length(mccca.data$data.list) K.vec=rep(2,C) #apply MCCCA mccca.res=MCCCA(mccca.data,K.vec=K.vec) #plot MCCCA result plot(mccca.res)