Title: | Random Cluster Generation (with Specified Degree of Separation) |
---|---|
Description: | We developed the clusterGeneration package to provide functions for generating random clusters, generating random covariance/correlation matrices, calculating a separation index (data and population version) for pairs of clusters or cluster distributions, and 1-D and 2-D projection plots to visualize clusters. The package also contains a function to generate random clusters based on factorial designs with factors such as degree of separation, number of clusters, number of variables, number of noisy variables. |
Authors: | Weiliang Qiu <[email protected]>, Harry Joe <[email protected]>. |
Maintainer: | Weiliang Qiu <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.3.8 |
Built: | 2024-12-24 06:39:41 UTC |
Source: | CRAN |
Generate an orthogonal matrix with given dimension.
genOrthogonal(dim)
genOrthogonal(dim)
dim |
integer. Dimension of the orthogonal matrix. |
An orthogonal matrix with dimension dim
.
set.seed(12345) Q = genOrthogonal(3) print(Q) A = Q print(A)
set.seed(12345) Q = genOrthogonal(3) print(Q) A = Q print(A)
Generate a positive definite matrix/covariance matrix.
genPositiveDefMat( dim, covMethod = c("eigen", "onion", "c-vine", "unifcorrmat"), eigenvalue = NULL, alphad = 1, eta = 1, rangeVar = c(1, 10), lambdaLow = 1, ratioLambda = 10)
genPositiveDefMat( dim, covMethod = c("eigen", "onion", "c-vine", "unifcorrmat"), eigenvalue = NULL, alphad = 1, eta = 1, rangeVar = c(1, 10), lambdaLow = 1, ratioLambda = 10)
dim |
Dimension of the matrix to be generated. |
covMethod |
Method to generate positive definite matrices/covariance matrices. Choices are “eigen”, “onion”, “c-vine”, or “unifcorrmat”; see details below. |
eigenvalue |
numeric. user-specified eigenvalues when |
alphad |
parameter for unifcorrmat method to generate random correlation matrix
|
eta |
parameter for “c-vine” and “onion” methods to generate random correlation matrix
|
rangeVar |
Range for variances of a covariance matrix (see details).
The default range is |
lambdaLow |
Lower bound on the eigenvalues of cluster covariance matrices.
If the argument |
ratioLambda |
The ratio of the upper bound of the eigenvalues to the lower bound of the
eigenvalues of cluster covariance matrices. See |
The current version of the function genPositiveDefMat
implements four
methods to generate random covariance matrices. The first method, denoted by
“eigen”, first randomly generates eigenvalues
() for the covariance matrix
(
), then
uses columns of a randomly generated orthogonal matrix
(
)
as eigenvectors. The covariance matrix
is then
contructed as
.
The remaining methods, denoted as “onion”, “c-vine”, and “unifcorrmat”
respectively, first generates a random
correlation matrix () via the method mentioned and proposed in Joe (2006),
then randomly generates variances (
) from
an interval specified by the argument
rangeVar
. The covariance matrix
is then constructed as
.
egvalues |
eigenvalues of Sigma |
Sigma |
positive definite matrix/covariance matrix |
Weiliang Qiu [email protected]
Harry Joe [email protected]
Joe, H. (2006) Generating Random Correlation Matrices Based on Partial Correlations. Journal of Multivariate Analysis, 97, 2177–2189.
Ghosh, S., Henderson, S. G. (2003). Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(3), 276–294.
Kurowicka and Cooke, 2006. Uncertainty Analysis with High Dimensional Dependence Modelling, Wiley, 2006.
genPositiveDefMat( dim = 4, covMethod = "unifcorrmat") aa <- genPositiveDefMat( dim = 3, covMethod = "eigen", eigenvalue = c(3, 2, 1)) print(aa) print(eigen(aa$Sigma))
genPositiveDefMat( dim = 4, covMethod = "unifcorrmat") aa <- genPositiveDefMat( dim = 3, covMethod = "eigen", eigenvalue = c(3, 2, 1)) print(aa) print(eigen(aa$Sigma))
Generate cluster data sets with specified degree of separation. The separation between any cluster and its nearest neighboring cluster can be set to a specified value. The covariance matrices of clusters can have arbitrary diameters, shapes and orientations.
genRandomClust(numClust, sepVal = 0.01, numNonNoisy = 2, numNoisy = 0, numOutlier = 0, numReplicate = 3, fileName = "test", clustszind = 2, clustSizeEq = 50, rangeN = c(50,200), clustSizes = NULL, covMethod = c("eigen", "onion", "c-vine", "unifcorrmat"), eigenvalue = NULL, rangeVar = c(1, 10), lambdaLow = 1, ratioLambda = 10, alphad = 1, eta = 1, rotateind = TRUE, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE, outputDatFlag = TRUE, outputLogFlag = TRUE, outputEmpirical = TRUE, outputInfo = TRUE)
genRandomClust(numClust, sepVal = 0.01, numNonNoisy = 2, numNoisy = 0, numOutlier = 0, numReplicate = 3, fileName = "test", clustszind = 2, clustSizeEq = 50, rangeN = c(50,200), clustSizes = NULL, covMethod = c("eigen", "onion", "c-vine", "unifcorrmat"), eigenvalue = NULL, rangeVar = c(1, 10), lambdaLow = 1, ratioLambda = 10, alphad = 1, eta = 1, rotateind = TRUE, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE, outputDatFlag = TRUE, outputLogFlag = TRUE, outputEmpirical = TRUE, outputInfo = TRUE)
numClust |
Number of clusters in a data set. |
sepVal |
Desired value of the separation index between a cluster
and its nearest neighboring cluster. Theoretically, |
numNonNoisy |
Number of non-noisy variables. |
numNoisy |
Number of noisy variables.
The default values of |
numOutlier |
Number or ratio of outliers. If |
numReplicate |
Number of data sets to be generated for the same cluster structure specified
by the other arguments of the function |
fileName |
The first part of the names of data files that record the generated data sets
and associated information, such as cluster membership of data points, labels
of noisy variables, separation index matrix, projection directions, etc.
(see details). The default value of |
clustszind |
Cluster size indicator.
|
clustSizeEq |
Cluster size.
If the argument |
rangeN |
The range of cluster sizes.
If |
clustSizes |
The sizes of clusters.
If |
covMethod |
Method to generate covariance matrices for clusters (see details). The default method is 'eigen' so that the user can directly specify the range of the diameters of clusters. |
eigenvalue |
numeric. user-specified eigenvalues when |
rangeVar |
Range for variances of a covariance matrix (see details).
The default range is |
lambdaLow |
Lower bound of the eigenvalues of cluster covariance matrices.
If the argument “covMethod="eigen"”, we need to generate eigenvalues for cluster covariance matrices.
The eigenvalues are randomly generated from the
interval [ |
ratioLambda |
The ratio of the upper bound of the eigenvalues to the lower bound of the
eigenvalues of cluster covariance matrices.
If the argument |
alphad |
parameter for unifcorrmat method to generate random correlation matrix
|
eta |
parameter for “c-vine” and “onion” methods to generate random correlation matrix
|
rotateind |
Rotation indicator.
|
iniProjDirMethod |
Indicating the method to get initial projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
projDirMethod |
Indicating the method to get the optimal projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
alpha |
Tuning parameter reflecting the percentage in the two
tails of a projected cluster that might be outlying.
We set |
ITMAX |
Maximum iteration allowed when iteratively calculating the optimal projection direction. The actual number of iterations is usually much less than the default value 20. |
eps |
Convergence threshold. A small positive number to check if a quantitiy |
quiet |
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is |
outputDatFlag |
Indicates if data set should be output to file. |
outputLogFlag |
Indicates if log info should be output to file. |
outputEmpirical |
Indicates if empirical separation indices and projection directions should be
calculated. This option is useful when generating clusters with sizes which
are not large enough so that the sample covariance matrices may be singular.
Hence, by default, |
outputInfo |
Indicates if theoretical and empirical separation information data frames
should be output to a file with format |
The function genRandomClust
is an implementation of the random cluster
generation method proposed in Qiu and Joe (2006a) which improve the cluster
generation method proposed in Milligan (1985) so that the degree of separation
between any cluster and its nearest neighboring cluster could be set to a
specified value while the cluster covariance matrices can be arbitrary positive definite matrices, and so that clusters generated might not be visualized
by pair-wise scatterplots of variables. The separation between a pair of
clusters is measured by the separation index proposed in Qiu and Joe (2006b).
The current version of the function genRandomClust
implements two
methods to generate covariance matrices for clusters. The first method,
denoted by eigen
, first randomly generates eigenvalues
() for the covariance matrix
(
), then uses columns of a randomly generated
orthogonal matrix
(
)
as eigenvectors. The covariance matrix
is then contructed as
.
The second method, denoted as “unifcorrmax”, first generates a random
correlation matrix (
) via the method proposed in Joe (2006),
then randomly generates variances (
) from
an interval specified by the argument
rangeVar
. The covariance matrix
is then constructed as
.
For each data set generated, the function genRandomClust
outputs
four files: data file, log file, membership file, and noisy set file.
All four files have the same format: [fileName]_[i].[extension]
,
where indicates the replicate number, and ‘extension’ can be
‘dat’, ‘log’, ‘mem’, and ‘noisy’.
The data file with file extension ‘dat’ contains rows and
columns, where
is the number of data points and
is the number of variables. The first row is the variable names.
The log file with file extension ‘log’ contains information such
as cluster sizes, mean vectors, covariance matrices, projection directions,
separation index matrices, etc. The membership file with file extension
‘mem’ contains
rows and one column of cluster memberships for
data points. The noisy set file with file extension ‘noisy’ contains
a row of labels of noisy variables.
When generating clusters, population covariance matrices are all
positive-definite. However sample covariance matrices might be
semi-positive-definite due to small cluster sizes. In this case, the
function genRandomClust
will automatically use the
“fixedpoint” method to search the optimal projection direction.
The current version of the function genPositiveDefMat
implements four
methods to generate random covariance matrices. The first method, denoted by
“eigen”, first randomly generates eigenvalues
() for the covariance matrix
(
), then
uses columns of a randomly generated orthogonal matrix
(
)
as eigenvectors. The covariance matrix
is then
contructed as
.
The remaining methods, denoted as “onion”, “c-vine”, and “unifcorrmat”
respectively, first generates a random
correlation matrix () via the method mentioned and proposed in Joe (2006),
then randomly generates variances (
) from
an interval specified by the argument
rangeVar
. The covariance matrix
is then constructed as
.
The function outputs four data files for each data set (see details).
This function also returns separation information data frames
infoFrameTheory
and infoFrameData
based on population
and empirical mean vectors and covariance matrices of clusters for all
the data sets generated. Both infoFrameTheory
and infoFrameData
contain the following seven columns:
Column 1: |
Labels of clusters ( |
Column 2: |
Labels of the corresponding nearest neighbors. |
Column 3: |
Separation indices of the clusters to their nearest neighboring clusters. |
Column 4: |
Labels of the corresponding farthest neighboring clusters. |
Column 5: |
Separation indices of the clusters to their farthest neighbors. |
Column 6: |
Median separation indices of the clusters to their neighbors. |
Column 7: |
Data file names with format |
The function also returns three lists: datList
, memList
, and noisyList
.
datList: |
a list of data matrices for generated data sets. |
memList: |
a list of luster memberships for data points for generated data sets. |
noisyList: |
a list of sets of noisy variables for generated data sets. |
This function might be take a while to complete.
Weiliang Qiu [email protected]
Harry Joe [email protected]
Joe, H. (2006) Generating Random Correlation Matrices Based on Partial Correlations. Journal of Multivariate Analysis, 97, 2177–2189.
Milligan G. W. (1985) An Algorithm for Generating Artificial Test Clusters. Psychometrika 50, 123–127.
Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with Specified Degree of Separaion. Journal of Classification, 23(2), 315-334.
Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
Su, J. Q. and Liu, J. S. (1993) Linear Combinations of Multiple Diagnostic Markers. Journal of the American Statistical Association, 88, 1350–1355.
Ghosh, S., Henderson, S. G. (2003). Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(3), 276–294.
Kurowicka and Cooke, 2006. Uncertainty Analysis with High Dimensional Dependence Modelling, Wiley, 2006.
## Not run: tmp1 <- genRandomClust( numClust = 7, sepVal = 0.3, numNonNoisy = 5, numNoisy = 3, numOutlier = 5, numReplicate = 2, fileName = "chk1") ## End(Not run) ## Not run: tmp2 <- genRandomClust( numClust = 7, sepVal = 0.3, numNonNoisy = 5, numNoisy = 3, numOutlier = 5, numReplicate = 2, covMethod = "unifcorrmat", fileName = "chk2") ## End(Not run) ## Not run: tmp3 <- genRandomClust( numClust = 2, sepVal = -0.1, numNonNoisy = 2, numNoisy = 6, numOutlier = 30, numReplicate = 1, clustszind = 1, clustSizeEq = 80, rangeVar = c(10, 20), covMethod = "unifcorrmat", iniProjDirMethod = "naive", projDirMethod = "fixedpoint", fileName = "chk3") ## End(Not run)
## Not run: tmp1 <- genRandomClust( numClust = 7, sepVal = 0.3, numNonNoisy = 5, numNoisy = 3, numOutlier = 5, numReplicate = 2, fileName = "chk1") ## End(Not run) ## Not run: tmp2 <- genRandomClust( numClust = 7, sepVal = 0.3, numNonNoisy = 5, numNoisy = 3, numOutlier = 5, numReplicate = 2, covMethod = "unifcorrmat", fileName = "chk2") ## End(Not run) ## Not run: tmp3 <- genRandomClust( numClust = 2, sepVal = -0.1, numNonNoisy = 2, numNoisy = 6, numOutlier = 30, numReplicate = 1, clustszind = 1, clustSizeEq = 80, rangeVar = c(10, 20), covMethod = "unifcorrmat", iniProjDirMethod = "naive", projDirMethod = "fixedpoint", fileName = "chk3") ## End(Not run)
Optimal projection direction and corresponding separation index for pairs of clusters.
getSepProjTheory( muMat, SigmaArray, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE) getSepProjData( y, cl, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE)
getSepProjTheory( muMat, SigmaArray, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE) getSepProjData( y, cl, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE)
muMat |
Matrix of mean vectors. Rows correspond to mean vectors for clusters. |
SigmaArray |
Array of covariance matrices. |
y |
Data matrix. Rows correspond to observations. Columns correspond to variables. |
cl |
Cluster membership vector. |
iniProjDirMethod |
Indicating the method to get initial projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
projDirMethod |
Indicating the method to get the optimal projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
alpha |
Tuning parameter reflecting the percentage in the two
tails of a projected cluster that might be outlying.
We set |
ITMAX |
Maximum iteration allowed when to iteratively calculate the optimal projection direction. The actual number of iterations is usually much less than the default value 20. |
eps |
Convergence threshold. A small positive number to check if a quantitiy
|
quiet |
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is |
When calculating the optimal projection direction and corresponding optimal
separation index for a pair of cluster, if one or both cluster covariance
matrices is/are singular, the ‘newton’ method can not be used.
In this case, the functions getSepProjTheory
and getSepProjData
will automatically use the ‘fixedpoint’ method to search the optimal
projection direction, even if the user specifies the value of the argument
projDirMethod
as ‘newton’. Also, multiple initial projection
directions will be evaluated.
Specifically, projection directions will be evaluated. The first
projection direction is the “naive” direction
.
The second projection direction is the “SL” projection direction
.
The next
projection directions are the
eigenvectors of the covariance
matrix of the first cluster. The remaining
projection directions are
the
eigenvectors of the covariance matrix of the second cluster.
Each of these projection directions are in turn used as the initial
projection direction for the ‘fixedpoint’ algorithm to obtain the
optimal projection direction and the corresponding optimal separation index.
We also obtain
separation indices by projecting two clusters along each of these
projection directions.
Finally, the projection direction with the largest separation index among the
optimal separation indices is chosen as the optimal projection
direction. The corresponding separation index is chosen as the optimal
separation index.
sepValMat |
Separation index matrix |
projDirArray |
Array of projection directions for each pair of clusters |
Weiliang Qiu [email protected]
Harry Joe [email protected]
Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with Specified Degree of Separaion. Journal of Classification, 23(2), 315-334.
Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
Su, J. Q. and Liu, J. S. (1993) Linear Combinations of Multiple Diagnostic Markers. Journal of the American Statistical Association, 88, 1350–1355.
n1 <- 50 mu1 <- c(0, 0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) projDir <- c(1, 0) muMat <- rbind(mu1, mu2) SigmaArray <- array(0, c(2, 2, 2)) SigmaArray[, , 1] <- Sigma1 SigmaArray[, , 2] <- Sigma2 a <- getSepProjTheory( muMat = muMat, SigmaArray = SigmaArray, iniProjDirMethod = "SL") # separation index for cluster distributions 1 and 2 a$sepValMat[1, 2] # projection direction for cluster distributions 1 and 2 a$projDirArray[1, 2, ] library(MASS) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y <- rbind(y1, y2) cl <- rep(1:2, c(n1, n2)) b <- getSepProjData( y = y, cl = cl, iniProjDirMethod = "SL", projDirMethod = "newton") # separation index for clusters 1 and 2 b$sepValMat[1, 2] # projection direction for clusters 1 and 2 b$projDirArray[1, 2, ]
n1 <- 50 mu1 <- c(0, 0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) projDir <- c(1, 0) muMat <- rbind(mu1, mu2) SigmaArray <- array(0, c(2, 2, 2)) SigmaArray[, , 1] <- Sigma1 SigmaArray[, , 2] <- Sigma2 a <- getSepProjTheory( muMat = muMat, SigmaArray = SigmaArray, iniProjDirMethod = "SL") # separation index for cluster distributions 1 and 2 a$sepValMat[1, 2] # projection direction for cluster distributions 1 and 2 a$projDirArray[1, 2, ] library(MASS) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y <- rbind(y1, y2) cl <- rep(1:2, c(n1, n2)) b <- getSepProjData( y = y, cl = cl, iniProjDirMethod = "SL", projDirMethod = "newton") # separation index for clusters 1 and 2 b$sepValMat[1, 2] # projection direction for clusters 1 and 2 b$projDirArray[1, 2, ]
Separation information matrix containing the nearest neighbor and farthest neighbor of each cluster.
nearestNeighborSepVal(sepValMat)
nearestNeighborSepVal(sepValMat)
sepValMat |
a |
This function returns a separation information matrix containing K
rows and
the following six columns, where K
is the number of clusters.
Column 1: |
Labels of clusters ( |
Column 2: |
Labels of the corresponding nearest neighbors. |
Column 3: |
Separation indices of the clusters to their nearest neighboring clusters. |
Column 4: |
Labels of the corresponding farthest neighboring clusters. |
Column 5: |
Separation indices of the clusters to their farthest neighbors. |
Column 6: |
Median separation indices of the clusters to their neighbors. |
Weiliang Qiu [email protected]
Harry Joe [email protected]
Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with Specified Degree of Separaion. Journal of Classification, 23(2), 315-334.
Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
n1 <- 50 mu1 <- c(0, 0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) n3 <- 30 mu3 <- c(10, 10) Sigma3 <- matrix(c(3, 1.5, 1.5, 1), 2, 2) projDir <- c(1, 0) muMat <- rbind(mu1, mu2, mu3) SigmaArray <- array(0, c(2, 2, 3)) SigmaArray[, , 1] <- Sigma1 SigmaArray[, , 2] <- Sigma2 SigmaArray[, , 3] <- Sigma3 tmp <- getSepProjTheory( muMat = muMat, SigmaArray = SigmaArray, iniProjDirMethod="SL") sepValMat <- tmp$sepValMat nearestNeighborSepVal(sepValMat = sepValMat)
n1 <- 50 mu1 <- c(0, 0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) n3 <- 30 mu3 <- c(10, 10) Sigma3 <- matrix(c(3, 1.5, 1.5, 1), 2, 2) projDir <- c(1, 0) muMat <- rbind(mu1, mu2, mu3) SigmaArray <- array(0, c(2, 2, 3)) SigmaArray[, , 1] <- Sigma1 SigmaArray[, , 2] <- Sigma2 SigmaArray[, , 3] <- Sigma3 tmp <- getSepProjTheory( muMat = muMat, SigmaArray = SigmaArray, iniProjDirMethod="SL") sepValMat <- tmp$sepValMat nearestNeighborSepVal(sepValMat = sepValMat)
Plot a pair of clusters and their density estimates, which are projected along a specified 1-D projection direction.
plot1DProjection( y1, y2, projDir, sepValMethod = c("normal", "quantile"), bw = "nrd0", xlim = NULL, ylim = NULL, xlab = "1-D projected clusters", ylab = "density estimates", title = "1-D Projected Clusters and their density estimates", font = 2, font.lab = 2, cex = 1.2, cex.lab = 1.2, cex.main = 1.5, lwd = 4, lty1 = 1, lty2 = 2, pch1 = 18, pch2 = 19, col1 = 2, col2 = 4, type = "l", alpha = 0.05, eps = 1.0e-10, quiet = TRUE)
plot1DProjection( y1, y2, projDir, sepValMethod = c("normal", "quantile"), bw = "nrd0", xlim = NULL, ylim = NULL, xlab = "1-D projected clusters", ylab = "density estimates", title = "1-D Projected Clusters and their density estimates", font = 2, font.lab = 2, cex = 1.2, cex.lab = 1.2, cex.main = 1.5, lwd = 4, lty1 = 1, lty2 = 2, pch1 = 18, pch2 = 19, col1 = 2, col2 = 4, type = "l", alpha = 0.05, eps = 1.0e-10, quiet = TRUE)
y1 |
Data matrix of cluster 1. Rows correspond to observations. Columns correspond to variables. |
y2 |
Data matrix of cluster 2. Rows correspond to observations. Columns correspond to variables. |
projDir |
1-D projection direction along which two clusters will be projected. |
sepValMethod |
Method to calculate separation index for a pair of clusters projected onto a
1-D space. |
bw |
The smoothing bandwidth to be used by the function |
xlim |
Range of X axis. |
ylim |
Range of Y axis. |
xlab |
X axis label. |
ylab |
Y axis label. |
title |
Title of the plot. |
font |
An integer which specifies which font to use for text (see |
font.lab |
The font to be used for x and y labels (see |
cex |
A numerical value giving the amount by which plotting text
and symbols should be scaled relative to the default (see |
cex.lab |
The magnification to be used for x and y labels relative
to the current setting of 'cex' (see |
cex.main |
The magnification to be used for main titles relative
to the current setting of 'cex' (see |
lwd |
The line width, a positive number, defaulting to '1' (see |
lty1 |
Line type for cluster 1 (see |
lty2 |
Line type for cluster 2 (see |
pch1 |
Either an integer specifying a symbol or a single character
to be used as the default in plotting points for cluster 1 (see |
pch2 |
Either an integer specifying a symbol or a single character
to be used as the default in plotting points for cluster 2 (see |
col1 |
Color to indicates cluster 1. |
col2 |
Color to indicates cluster 2. |
type |
What type of plot should be drawn (see |
alpha |
Tuning parameter reflecting the percentage in the two tails of a projected cluster that might be outlying. |
eps |
A small positive number to check if a quantitiy |
quiet |
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is |
The ticks along X axis indicates the positions of points of the projected
two clusters. The positions of and
,
, are also indicated
on X axis, where
and
are the lower and upper
sample
percentiles of cluster
if
sepValMethod="quantile"
.
If sepValMethod="normal"
,
, where
and
are the
sample mean and standard deviation of cluster
, and
is the upper
percentile of standard normal distribution.
sepVal |
value of the separation index for the projected two clusters along
the projection direction |
projDir |
projection direction. To make sure the projected cluster 1 is on the
left-hand side of the projected cluster 2, the input |
Weiliang Qiu [email protected]
Harry Joe [email protected]
Qiu, W.-L. and Joe, H. (2006) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
n1 <- 50 mu1 <- c(0,0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) projDir <- c(1, 0) library(MASS) set.seed(1234) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y <- rbind(y1, y2) cl <- rep(1:2, c(n1, n2)) b <- getSepProjData( y = y, cl = cl, iniProjDirMethod = "SL", projDirMethod = "newton") # projection direction for clusters 1 and 2 projDir <- b$projDirArray[1, 2, ] plot1DProjection( y1 = y1, y2 = y2, projDir = projDir)
n1 <- 50 mu1 <- c(0,0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) projDir <- c(1, 0) library(MASS) set.seed(1234) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y <- rbind(y1, y2) cl <- rep(1:2, c(n1, n2)) b <- getSepProjData( y = y, cl = cl, iniProjDirMethod = "SL", projDirMethod = "newton") # projection direction for clusters 1 and 2 projDir <- b$projDirArray[1, 2, ] plot1DProjection( y1 = y1, y2 = y2, projDir = projDir)
Plot a pair of clusters along a 2-D projection space.
plot2DProjection( y1, y2, projDir, sepValMethod = c("normal", "quantile"), iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), xlim = NULL, ylim = NULL, xlab = "1st projection direction", ylab = "2nd projection direction", title = "Scatter plot of 2-D Projected Clusters", font = 2, font.lab = 2, cex = 1.2, cex.lab = 1, cex.main = 1.5, lwd = 4, lty1 = 1, lty2 = 2, pch1 = 18, pch2 = 19, col1 = 2, col2 = 4, alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE)
plot2DProjection( y1, y2, projDir, sepValMethod = c("normal", "quantile"), iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), xlim = NULL, ylim = NULL, xlab = "1st projection direction", ylab = "2nd projection direction", title = "Scatter plot of 2-D Projected Clusters", font = 2, font.lab = 2, cex = 1.2, cex.lab = 1, cex.main = 1.5, lwd = 4, lty1 = 1, lty2 = 2, pch1 = 18, pch2 = 19, col1 = 2, col2 = 4, alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE)
y1 |
Data matrix of cluster 1. Rows correspond to observations. Columns correspond to variables. |
y2 |
Data matrix of cluster 2. Rows correspond to observations. Columns correspond to variables. |
projDir |
1-D projection direction along which two clusters will be projected. |
sepValMethod |
Method to calculate separation index for a pair of clusters projected onto a
1-D space. |
iniProjDirMethod |
Indicating the method to get initial projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
projDirMethod |
Indicating the method to get the optimal projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
xlim |
Range of X axis. |
ylim |
Range of Y axis. |
xlab |
X axis label. |
ylab |
Y axis label. |
title |
Title of the plot. |
font |
An integer which specifies which font to use for text (see |
font.lab |
The font to be used for x and y labels (see |
cex |
A numerical value giving the amount by which plotting text
and symbols should be scaled relative to the default (see |
cex.lab |
The magnification to be used for x and y labels relative
to the current setting of 'cex' (see |
cex.main |
The magnification to be used for main titles relative
to the current setting of 'cex' (see |
lwd |
The line width, a positive number, defaulting to '1' (see |
lty1 |
Line type for cluster 1 (see |
lty2 |
Line type for cluster 2 (see |
pch1 |
Either an integer specifying a symbol or a single character
to be used as the default in plotting points for cluster 1 (see |
pch2 |
Either an integer specifying a symbol or a single character
to be used as the default in plotting points for cluster 2 (see |
col1 |
Color to indicates cluster 1. |
col2 |
Color to indicates cluster 2. |
alpha |
Tuning parameter reflecting the percentage in the two tails of a projected cluster that might be outlying. |
ITMAX |
Maximum iteration allowed when iteratively calculating the optimal projection direction. The actual number of iterations is usually much less than the default value 20. |
eps |
A small positive number to check if a quantitiy |
quiet |
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is |
To get the second projection direction, we first construct an orthogonal
matrix with first column projDir
. Then we rotate the data points
according to this orthogonal matrix. Next, we remove the first dimension
of the rotated data points, and obtain the optimal projection direction
projDir2
for the rotated data points in the remaining dimensions.
Finally, we rotate the vector
projDir3=(0, projDir2)
back to the original space.
The vector projDir3
is the second projection direction.
The ticks along X axis indicates the positions of points of the projected
two clusters. The positions of and
,
, are also indicated
on X axis, where
and
are the lower and upper
sample
percentiles of cluster
if
sepValMethod="quantile"
.
If sepValMethod="normal"
,
, where
and
are the
sample mean and standard deviation of cluster
, and
is the upper
percentile of standard normal distribution.
sepValx |
value of the separation index for the projected two clusters along the 1st projection direction. |
sepValy |
value of the separation index for the projected two clusters along the 2nd projection direction. |
Q2 |
1st column is the 1st projection direction. 2nd column is the 2nd projection direction. |
Weiliang Qiu [email protected]
Harry Joe [email protected]
Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with Specified Degree of Separaion. Journal of Classification, 23(2), 315-334.
Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
n1 <- 50 mu1 <- c(0,0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) projDir <- c(1, 0) library(MASS) set.seed(1234) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y <- rbind(y1, y2) cl <- rep(1:2, c(n1, n2)) b <- getSepProjData( y = y, cl = cl, iniProjDirMethod = "SL", projDirMethod = "newton") # projection direction for clusters 1 and 2 projDir <- b$projDirArray[1,2,] par(mfrow = c(2,1)) plot1DProjection( y1 = y1, y2 = y2, projDir = projDir) plot2DProjection( y1 = y1, y2 = y2, projDir = projDir)
n1 <- 50 mu1 <- c(0,0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) projDir <- c(1, 0) library(MASS) set.seed(1234) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y <- rbind(y1, y2) cl <- rep(1:2, c(n1, n2)) b <- getSepProjData( y = y, cl = cl, iniProjDirMethod = "SL", projDirMethod = "newton") # projection direction for clusters 1 and 2 projDir <- b$projDirArray[1,2,] par(mfrow = c(2,1)) plot1DProjection( y1 = y1, y2 = y2, projDir = projDir) plot2DProjection( y1 = y1, y2 = y2, projDir = projDir)
Generate a random correlation matrix based on random partial correlations.
rcorrmatrix(d, alphad = 1)
rcorrmatrix(d, alphad = 1)
d |
Dimension of the matrix. |
alphad |
|
A correlation matrix.
Weiliang Qiu [email protected]
Harry Joe [email protected]
Joe, H. (2006) Generating Random Correlation Matrices Based on Partial Correlations. Journal of Multivariate Analysis, 97, 2177–2189.
rcorrmatrix(3) rcorrmatrix(5) rcorrmatrix(5, alphad = 2.5)
rcorrmatrix(3) rcorrmatrix(5) rcorrmatrix(5, alphad = 2.5)
Measure the magnitude of the gap or sparse area between a pair of clusters (or cluster distributions) along the specified projection direction.
sepIndexTheory( projDir, mu1, Sigma1, mu2, Sigma2, alpha = 0.05, eps = 1.0e-10, quiet = TRUE) sepIndexData( projDir, y1, y2, alpha = 0.05, eps = 1.0e-10, quiet = TRUE)
sepIndexTheory( projDir, mu1, Sigma1, mu2, Sigma2, alpha = 0.05, eps = 1.0e-10, quiet = TRUE) sepIndexData( projDir, y1, y2, alpha = 0.05, eps = 1.0e-10, quiet = TRUE)
projDir |
Projection direction. |
mu1 |
Mean vector of cluster 1. |
Sigma1 |
Covariance matrix of cluster 1. |
mu2 |
Mean vector of cluster 2. |
Sigma2 |
Covariance matrix of cluster 2. |
y1 |
Data matrix of cluster 1. Rows correspond to observations. Columns correspond to variables. |
y2 |
Data matrix of cluster 2. Rows correspond to observations. Columns correspond to variables. |
alpha |
Tuning parameter reflecting the percentage in the two
tails of a projected cluster that might be outlying.
We set |
eps |
Convergence threshold. A small positive number to check if a quantitiy |
quiet |
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is |
The value of the separation index defined in Qiu and Joe (2006).
Weiliang Qiu [email protected]
Harry Joe [email protected]
Qiu, W.-L. and Joe, H. (2006) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
n1<-50 mu1<-c(0,0) Sigma1<-matrix(c(2,1,1,5),2,2) n2<-100 mu2<-c(10,0) Sigma2<-matrix(c(5,-1,-1,2),2,2) projDir<-c(1, 0) sepIndexTheory(projDir, mu1, Sigma1, mu2, Sigma2) library(MASS) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) sepIndexData( projDir = projDir, y1 = y1, y2 = y2)
n1<-50 mu1<-c(0,0) Sigma1<-matrix(c(2,1,1,5),2,2) n2<-100 mu2<-c(10,0) Sigma2<-matrix(c(5,-1,-1,2),2,2) projDir<-c(1, 0) sepIndexTheory(projDir, mu1, Sigma1, mu2, Sigma2) library(MASS) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) sepIndexData( projDir = projDir, y1 = y1, y2 = y2)
Generating data sets via a factorial design, which has factors: degree of separation, number of clusters, number of non-noisy variables, number of noisy variables. The separation between any cluster and its nearest neighboring clusters can be set to a specified value. The covariance matrices of clusters can have arbitrary diameters, shapes and orientations.
simClustDesign(numClust = c(3,6,9), sepVal = c(0.01, 0.21, 0.342), sepLabels = c("L", "M", "H"), numNonNoisy = c(4,8,20), numNoisy = NULL, numOutlier = 0, numReplicate = 3, fileName = "test", clustszind = 2, clustSizeEq = 50, rangeN = c(50,200), clustSizes = NULL, covMethod = c("eigen", "onion", "c-vine", "unifcorrmat"), eigenvalue = NULL, rangeVar = c(1, 10), lambdaLow = 1, ratioLambda = 10, alphad = 1, eta = 1, rotateind = TRUE, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE, outputDatFlag = TRUE, outputLogFlag = TRUE, outputEmpirical = TRUE, outputInfo = TRUE)
simClustDesign(numClust = c(3,6,9), sepVal = c(0.01, 0.21, 0.342), sepLabels = c("L", "M", "H"), numNonNoisy = c(4,8,20), numNoisy = NULL, numOutlier = 0, numReplicate = 3, fileName = "test", clustszind = 2, clustSizeEq = 50, rangeN = c(50,200), clustSizes = NULL, covMethod = c("eigen", "onion", "c-vine", "unifcorrmat"), eigenvalue = NULL, rangeVar = c(1, 10), lambdaLow = 1, ratioLambda = 10, alphad = 1, eta = 1, rotateind = TRUE, iniProjDirMethod = c("SL", "naive"), projDirMethod = c("newton", "fixedpoint"), alpha = 0.05, ITMAX = 20, eps = 1.0e-10, quiet = TRUE, outputDatFlag = TRUE, outputLogFlag = TRUE, outputEmpirical = TRUE, outputInfo = TRUE)
numClust |
Vector of the number of clusters for data sets in the design. |
sepVal |
Vector of desired values of the separation index between clusters
and their nearest neighboring clusters. Each element of |
sepLabels |
Labels for "close", "separated", and "well-separated" cluster structures. By default, "L" (low) means "close", "M" (medium) means "separated", "H" (high) means "well-separated". |
numNonNoisy |
Vector of the number of non-noisy variables. |
numNoisy |
Vectors of the number of noisy variables. The default value of |
numOutlier |
The number or ratio of outliers. If |
numReplicate |
Number of data sets to be generated for the same cluster structure specified
by the other arguments of the function |
fileName |
The first part of the names of data files that record the generated data sets
and associated information, such as cluster membership of data points, labels
of noisy variables, separation index matrix, projection directions, etc.
(see details). The default value of |
clustszind |
Cluster size indicator.
|
clustSizeEq |
Cluster size.
If the argument |
rangeN |
The range of cluster sizes.
If |
clustSizes |
The sizes of clusters.
If |
covMethod |
Method to generate covariance matrices for clusters (see details). The default method is 'eigen' so that the user can directly specify the range of the diameters of clusters. |
eigenvalue |
numeric. user-specified eigenvalues when |
rangeVar |
Range for variances of a covariance matrix (see details).
The default range is |
lambdaLow |
Lower bound of the eigenvalues of cluster covariance matrices.
If the argument |
ratioLambda |
The ratio of the upper bound of the eigenvalues to the lower bound of the
eigenvalues of cluster covariance matrices.
If the argument |
alphad |
parameter for unifcorrmat method to generate random correlation matrix
|
eta |
parameter for “c-vine” and “onion” methods to generate random correlation matrix
|
rotateind |
Rotation indicator.
|
iniProjDirMethod |
Indicating the method to get initial projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
projDirMethod |
Indicating the method to get the optimal projection direction when calculating
the separation index between a pair of clusters (c.f. Qiu and Joe,
2006a, 2006b). |
alpha |
Tuning parameter reflecting the percentage in the two
tails of a projected cluster that might be outlying.
We set |
ITMAX |
Maximum iteration allowed when to iteratively calculating the optimal projection direction. The actual number of iterations is usually much less than the default value 20. |
eps |
Convergence threshold. A small positive number to check if a quantitiy |
quiet |
A flag to switch on/off the outputs of intermediate results and/or possible warning messages. The default value is |
outputDatFlag |
Indicates if data set should be output to file. |
outputLogFlag |
Indicates if log info should be output to file. |
outputEmpirical |
Indicates if empirical separation indices and projection directions should be
calculated. This option is useful when generating clusters with sizes which
are not large enough so that the sample covariance matrices may be singular.
Hence, by default, |
outputInfo |
Indicates if theoretical and empirical separation information data frames
should be output to a file with format |
The function simClustDesign
is an implementation of the design for
generating random clusters proposed in Qiu and Joe (2006a). In the design,
the degree of separation between any cluster and its nearest neighboring
cluster could be set to a specified value while the cluster covariance
matrices can be arbitrary positive definite matrices, and so that clusters
generated might not be visualized by pair-wise scatterplots of variables.
The separation between a pair of clusters is measured by the separation index
proposed in Qiu and Joe (2006b).
The current version of the function simClustDesign
implements two
methods to generate covariance matrices for clusters. The first method,
denoted by eigen
, first randomly generates eigenvalues
() for the covariance matrix
(
), then uses columns of a randomly generated
orthogonal matrix
(
)
as eigenvectors. The covariance matrix
is then contructed as
.
The second method, denoted as
unifcorrmat
, first generates a random
correlation matrix () via the method proposed in Joe (2006),
then randomly generates variances (
) from
an interval specified by the argument
rangeVar
. The covariance matrix
is then constructed as
.
For each data set generated, the function simClustDesign
outputs
four files: data file, log file, membership file, and noisy set file.
All four files have the same format: [fileName]J[j]G[g]v[p1]nv[p2]out[numOutlier]_[numReplicate].[extension]
where ‘extension’ can be ‘dat’, ‘log’, ‘mem’, or
‘noisy’. ‘J’ indicates separation index, with ‘j’
indicating the level of the factor ‘separation index’;
‘G’ indicates number of clusters, with ‘g’ indicating the
level of the factor ‘number of clusters’; ‘v’ indicates
the number of non-noisy variables, with ‘p1’ indicating the level
of the factor ‘number of non-noisy variables’; ‘nv’ indicates
the number of noisy variables, with ‘p2’ indicating the level of
the factor ‘number of noisy variables’; ‘out’ indicates
number of outliers, with ‘numOutlier’ indicating the value of the
argument numOutlier
of the function simClustDesign
;
‘numReplicate’ indicates the value of the argument numReplicate
of the function simClustDesign
.
The data file with file extension ‘dat’ contains rows and
columns, where
is the number of data points and
is
the number of variables. The first row is the variable names. The log file
with file extension ‘log’ contains information such as cluster sizes,
mean vectors, covariance matrices, projection directions, separation index
matrices, etc. The membership file with file extension ‘mem’ contains
rows and one column of cluster memberships for data points. The noisy
set file with file extension ‘noisy’ contains a row of labels of noisy
variables.
When generating clusters, population covariance matrices are all
positive-definite. However sample covariance matrices might be
semi-positive-definite due to small cluster sizes. In this case, the
function genRandomClust
will automatically use the
“fixedpoint” method to search the optimal projection direction.
The function outputs four data files for each data set (see details).
This function also returns separation information data frames
infoFrameTheory
and infoFrameData
based on population
and empirical mean vectors and covariance matrices of clusters for all
the data sets generated. Both infoFrameTheory
and infoFrameData
contain the following seven columns:
Column 1: |
Labels of clusters ( |
Column 2: |
Labels of the corresponding nearest neighbors. |
Column 3: |
Separation indices of the clusters to their nearest neighboring clusters. |
Column 4: |
Labels of the corresponding farthest neighboring clusters. |
Column 5: |
Separation indices of the clusters to their farthest neighbors. |
Column 6: |
Median separation indices of the clusters to their neighbors. |
Column 7: |
Data file names with format
|
The function also returns three lists: datList
, memList
, and noisyList
.
datList: |
a list of lists of data matrices for generated data sets. |
memList: |
a list of lists of cluster memberships for data points for generated data sets. |
noisyList: |
a list of lists of sets of noisy variables for generated data sets. |
The speed of this function might be slow.
Weiliang Qiu [email protected]
Harry Joe [email protected]
Joe, H. (2006) Generating Random Correlation Matrices Based on Partial Correlations. Journal of Multivariate Analysis, 97, 2177–2189.
Milligan G. W. (1985) An Algorithm for Generating Artificial Test Clusters. Psychometrika 50, 123–127.
Qiu, W.-L. and Joe, H. (2006a) Generation of Random Clusters with Specified Degree of Separaion. Journal of Classification, 23(2), 315-334.
Qiu, W.-L. and Joe, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
Su, J. Q. and Liu, J. S. (1993) Linear Combinations of Multiple Diagnostic Markers. Journal of the American Statistical Association, 88, 1350–1355
## Not run: tmp <- simClustDesign( numClust = 3, sepVal = c(0.01, 0.21), sepLabels = c("L", "M"), numNonNoisy = 4, numOutlier = 0, numReplicate = 2, clustszind = 2) ## End(Not run)
## Not run: tmp <- simClustDesign( numClust = 3, sepVal = c(0.01, 0.21), sepLabels = c("L", "M"), numNonNoisy = 4, numOutlier = 0, numReplicate = 2, clustszind = 2) ## End(Not run)
Plot all clusters in a 2-D projection space.
viewClusters( y, cl, outlierLabel = 0, projMethod = "Eigen", xlim = NULL, ylim = NULL, xlab = "1st projection direction", ylab = "2nd projection direction", title = "Scatter plot of 2-D Projected Clusters", font = 2, font.lab = 2, cex = 1.2, cex.lab = 1.2)
viewClusters( y, cl, outlierLabel = 0, projMethod = "Eigen", xlim = NULL, ylim = NULL, xlab = "1st projection direction", ylab = "2nd projection direction", title = "Scatter plot of 2-D Projected Clusters", font = 2, font.lab = 2, cex = 1.2, cex.lab = 1.2)
y |
Data matrix. Rows correspond to observations. Columns correspond to variables. |
cl |
Cluster membership vector. |
outlierLabel |
Label for outliers. Outliers are not involved in calculating the projection
directions. Outliers will be represented by red triangles in the plot.
By default, |
projMethod |
Method to construct 2-D projection directions.
|
xlim |
Range of X axis. |
ylim |
Range of Y axis. |
xlab |
X axis label. |
ylab |
Y axis label. |
title |
Title of the plot. |
font |
An integer which specifies which font to use for text (see |
font.lab |
The font to be used for x and y labels (see |
cex |
A numerical value giving the amount by which plotting text
and symbols should be scaled relative to the default (see |
cex.lab |
The magnification to be used for x and y labels relative
to the current setting of 'cex' (see |
B |
Between cluster distance matrix measuring the between cluster variation. |
Q |
Columns of |
proj |
Projected clusters in the 2-D space spanned by the first 2 columns of
the matrix |
Weiliang Qiu [email protected]
Harry Joe [email protected]
Dhillon I. S., Modha, D. S. and Spangler, W. S. (2002) Class visualization of high-dimensional data with applications. computational Statistics and Data Analysis, 41, 59–90.
Qiu, W.-L. and Joe, H. (2006) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, 585–603.
plot1DProjection
plot2DProjection
n1 <- 50 mu1 <- c(0, 0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) n3 <- 30 mu3 <- c(10, 10) Sigma3 <- matrix(c(3, 1.5, 1.5, 1), 2, 2) n4 <- 10 mu4 <- c(0, 0) Sigma4 <- 50*diag(2) library(MASS) set.seed(1234) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y3 <- mvrnorm(n3, mu3, Sigma3) y4 <- mvrnorm(n4, mu4, Sigma4) y <- rbind(y1, y2, y3, y4) cl <- rep(c(1:3, 0), c(n1, n2, n3, n4)) par(mfrow=c(2,1)) viewClusters(y = y, cl = cl) viewClusters(y = y, cl = cl, projMethod = "DMS")
n1 <- 50 mu1 <- c(0, 0) Sigma1 <- matrix(c(2, 1, 1, 5), 2, 2) n2 <- 100 mu2 <- c(10, 0) Sigma2 <- matrix(c(5, -1, -1, 2), 2, 2) n3 <- 30 mu3 <- c(10, 10) Sigma3 <- matrix(c(3, 1.5, 1.5, 1), 2, 2) n4 <- 10 mu4 <- c(0, 0) Sigma4 <- 50*diag(2) library(MASS) set.seed(1234) y1 <- mvrnorm(n1, mu1, Sigma1) y2 <- mvrnorm(n2, mu2, Sigma2) y3 <- mvrnorm(n3, mu3, Sigma3) y4 <- mvrnorm(n4, mu4, Sigma4) y <- rbind(y1, y2, y3, y4) cl <- rep(c(1:3, 0), c(n1, n2, n3, n4)) par(mfrow=c(2,1)) viewClusters(y = y, cl = cl) viewClusters(y = y, cl = cl, projMethod = "DMS")