Title: | Block Structure Detection Using Singular Vectors |
---|---|
Description: | Performs block diagonal covariance matrix detection using singular vectors (BD-SVD), which can be extended to hierarchical variable clustering (HC-SVD). The methods are described in Bauer (202Xa) <doi:10.48550/arXiv.2211.16155> and Bauer (202Xb) <doi:10.48550/arXiv.2308.06820>. |
Authors: | Jan O. Bauer [aut, cre] , Ron Holzapfel [aut] |
Maintainer: | Jan O. Bauer <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.0 |
Built: | 2024-12-04 07:07:24 UTC |
Source: | CRAN |
Performs BD-SVD iteratively to reveal the block structure. Splits the data matrix into one (i.e., no split)
or two submatrices, depending on the structure of the first sparse loading (which is a sparse approximation of the
first right singular vector, i.e., a vector with many zero values) that mirrors the shape of the covariance matrix. This
procedure is continued iteratively until the block diagonal structure has been revealed.
The data matrix ordered according to this revealed block diagonal structure can be obtained by bdsvd.structure.
bdsvd(X, dof.lim, anp = "2", standardize = TRUE, max.iter, trace = FALSE)
bdsvd(X, dof.lim, anp = "2", standardize = TRUE, max.iter, trace = FALSE)
X |
Data matrix of dimension |
dof.lim |
Interval limits for the number of non-zero components in the sparse loading (degrees of freedom).
If |
anp |
Which regularization function should be used for the HBIC. |
standardize |
Standardize the data to have unit variance. Default is |
max.iter |
How many iterations should be performed for computing the sparse loading.
Default is |
trace |
Print out progress as iterations are performed. Default is |
The sparse loadings are computed using the method by Shen & Huang (2008), implemented in
the irlba
package.
A list containing the feature names of the submatrices of X
. The length of the list equals
the number of submatrices.
Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.
Wang, H., B. Li, and C. Leng (2009). Shrinkage tuning parameter selection with a diverging number of parameters, J. R. Stat. Soc. B 71 (3), 671–683.
Wang, L., Y. Kim, and R. Li (2013). Calibrating nonconvex penalized regression in ultra-high dimension, Ann. Stat. 41 (5), 2505–2536.
bdsvd.structure, bdsvd.ht, single.bdsvd
#Replicate the simulation study (c) from Bauer (202Xa). ## Not run: p <- 500 #Number of variables n <- 250 #Number of observations b <- 10 #Number of blocks design <- "c" #Simulation design "a", "b", "c", or "d". #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- seq_len(p) bdsvd(X, standardize = FALSE) ## End(Not run)
#Replicate the simulation study (c) from Bauer (202Xa). ## Not run: p <- 500 #Number of variables n <- 250 #Number of observations b <- 10 #Number of blocks design <- "c" #Simulation design "a", "b", "c", or "d". #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- seq_len(p) bdsvd(X, standardize = FALSE) ## End(Not run)
This function generates covariance matrices based on the simulation studies described in Bauer (202Xa).
bdsvd.cov.sim(p = p, b, design = design)
bdsvd.cov.sim(p = p, b, design = design)
p |
Number of variables. |
b |
Number of blocks. Only required for simulation design "c" and "d". |
design |
Simulation design "a", "b", "c", or "d". |
A covariance matrix according to the chosen simulation design.
Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.
#The covariance matrix for simulation design (a) is given by Sigma <- bdsvd.cov.sim(p = 500, b = 500, design = "a")
#The covariance matrix for simulation design (a) is given by Sigma <- bdsvd.cov.sim(p = 500, b = 500, design = "a")
Finds the number of non-zero elements of the sparse loading according to the high-dimensional Bayesian information criterion (HBIC).
bdsvd.ht(X, dof.lim, standardize = TRUE, anp = "2", max.iter)
bdsvd.ht(X, dof.lim, standardize = TRUE, anp = "2", max.iter)
X |
Data matrix of dimension |
dof.lim |
Interval limits for the number of non-zero components in the sparse loading (degrees of freedom).
If |
standardize |
Standardize the data to have unit variance. Default is |
anp |
Which regularization function should be used for the HBIC. |
max.iter |
How many iterations should be performed for computing the sparse loading.
Default is |
The sparse loadings are computed using the method by Shen & Huang (2008), implemented in
the irlba
package. The computation of the HBIC is outlined in Bauer (202Xa).
dof |
The optimal number of nonzero components (degrees of freedom) according to the HBIC. |
BIC |
The HBIC for the different numbers of nonzero components. |
Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.
Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.
Wang, H., B. Li, and C. Leng (2009). Shrinkage tuning parameter selection with a diverging number of parameters, J. R. Stat. Soc. B 71 (3), 671–683.
Wang, L., Y. Kim, and R. Li (2013). Calibrating nonconvex penalized regression in ultra-high dimension, Ann. Stat. 41 (5), 2505–2536.
#Replicate the illustrative example from Bauer (202Xa). p <- 300 #Number of variables. In Bauer (202Xa), p = 3000 n <- 500 #Number of observations b <- 3 #Number of blocks design <- "c" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- seq_len(p) ht <- bdsvd.ht(X) plot(0:(p-1), ht$BIC[,1], xlab = "|S|", ylab = "HBIC", main = "", type = "l") single.bdsvd(X, dof = ht$dof, standardize = FALSE)
#Replicate the illustrative example from Bauer (202Xa). p <- 300 #Number of variables. In Bauer (202Xa), p = 3000 n <- 500 #Number of observations b <- 3 #Number of blocks design <- "c" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- seq_len(p) ht <- bdsvd.ht(X) plot(0:(p-1), ht$BIC[,1], xlab = "|S|", ylab = "HBIC", main = "", type = "l") single.bdsvd(X, dof = ht$dof, standardize = FALSE)
Either sorts the data matrix according to the detected block structure
, ordered by the number
of variables that the blocks contain. Or returns the detected submatrices each individually in a list object.
bdsvd.structure(X, block.structure, output = "matrix", block.order)
bdsvd.structure(X, block.structure, output = "matrix", block.order)
X |
Data matrix of dimension |
block.structure |
Output of |
output |
Should the output be the data matrix ordered according to the blocks ( |
block.order |
A vector that contains the order of the blocks detected by |
Either the data matrix X
with columns sorted according to the detected blocks, or a list containing the detected
submatrices.
Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.
#Toying with the illustrative example from Bauer (202Xa). p <- 150 #Number of variables. In Bauer (202Xa), p = 3000. n <- 500 #Number of observations b <- 3 #Number of blocks design <- "c" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- seq_len(p) #Compute iterative BD-SVD bdsvd.obj <- bdsvd(X, standardize = FALSE) #Obtain the data matrix X, sorted by the detected blocks colnames(bdsvd.structure(X, bdsvd.obj, output = "matrix") ) colnames(bdsvd.structure(X, bdsvd.obj, output = "matrix", block.order = c(2,1,3)) ) #Obtain the detected submatrices X_1, X_2, and X_3 colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[1]] ) colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[2]] ) colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[3]] )
#Toying with the illustrative example from Bauer (202Xa). p <- 150 #Number of variables. In Bauer (202Xa), p = 3000. n <- 500 #Number of observations b <- 3 #Number of blocks design <- "c" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- seq_len(p) #Compute iterative BD-SVD bdsvd.obj <- bdsvd(X, standardize = FALSE) #Obtain the data matrix X, sorted by the detected blocks colnames(bdsvd.structure(X, bdsvd.obj, output = "matrix") ) colnames(bdsvd.structure(X, bdsvd.obj, output = "matrix", block.order = c(2,1,3)) ) #Obtain the detected submatrices X_1, X_2, and X_3 colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[1]] ) colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[2]] ) colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[3]] )
Class used within the package to store the structure and information about the detected blocks.
features
numeric vector that contains the the variables corresponding to this block.
block.columns
numeric vector that contains the indices of the singular vectors corresponding to this block.
This function returns the block structure of a matrix.
detect.blocks(V, threshold = 0)
detect.blocks(V, threshold = 0)
V |
Numeric matrix which either contains the loadings or is a covariance matrix. |
threshold |
All absolute values of |
An object of class Block
containing the features and columns indices corresponding to each detected block.
Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.
#In the first example, we replicate the simulation study for the ad hoc procedure #Est_0.1 from Bauer (202Xa). In the second example, we manually compute the first step #of BD-SVD, which can be done using the bdsvd() and/or single.bdsvd(), for constructed #sparse loadings #Example 1: Replicate the simulation study (a) from Bauer (202Xa) for the ad hoc #procedure Est_0.1. p <- 500 #Number of variables n <- 125 #Number of observations b <- 500 #Number of blocks design <- "a" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- 1:p #Perform the ad hoc procedure detect.blocks(cvCovEst::scadEst(dat = X, lambda = 0.2), threshold = 0) #Example 2: Manually compute the first step of BD-SVD #for some loadings V that mirror the two blocks #("A", "B") and c("C", "D"). V <- matrix(c(1,0, 1,0, 0,1, 0,1), 4, 2, byrow = TRUE) rownames(V) <- c("A", "B", "C", "D") detected.blocks <- detect.blocks(V) #Variables in block one with corresponding column index: detected.blocks[[1]]@features detected.blocks[[1]]@block.columns #Variables in block two with corresponding column index: detected.blocks[[2]]@features detected.blocks[[2]]@block.columns
#In the first example, we replicate the simulation study for the ad hoc procedure #Est_0.1 from Bauer (202Xa). In the second example, we manually compute the first step #of BD-SVD, which can be done using the bdsvd() and/or single.bdsvd(), for constructed #sparse loadings #Example 1: Replicate the simulation study (a) from Bauer (202Xa) for the ad hoc #procedure Est_0.1. p <- 500 #Number of variables n <- 125 #Number of observations b <- 500 #Number of blocks design <- "a" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- 1:p #Perform the ad hoc procedure detect.blocks(cvCovEst::scadEst(dat = X, lambda = 0.2), threshold = 0) #Example 2: Manually compute the first step of BD-SVD #for some loadings V that mirror the two blocks #("A", "B") and c("C", "D"). V <- matrix(c(1,0, 1,0, 0,1, 0,1), 4, 2, byrow = TRUE) rownames(V) <- c("A", "B", "C", "D") detected.blocks <- detect.blocks(V) #Variables in block one with corresponding column index: detected.blocks[[1]]@features detected.blocks[[1]]@block.columns #Variables in block two with corresponding column index: detected.blocks[[2]]@features detected.blocks[[2]]@block.columns
Performs HC-SVD to reveal the hierarchical variable structure as descried in Bauer (202Xb). For this divise approach, each cluster is split into two clusters iteratively. Potential splits are identified by the first sparse loadings (which are sparse approximations of the first right singular vectors, i.e., vectors with many zero values) that mirror the masked shape of the correlation matrix. This procedure is continued until each variable lies in a single cluster.
hcsvd(X, k = "all", linkage = "single", reliability, R, max.iter, trace = TRUE)
hcsvd(X, k = "all", linkage = "single", reliability, R, max.iter, trace = TRUE)
X |
Data matrix of dimension |
k |
Number of sparse loadings to be used. This should be |
linkage |
The linkage function to be used. This should be one of |
reliability |
By default, the value of each cluster equals the distance calculated by the chosen linkage function.
If preferred, the value of each cluster can be assigned by its reliability. When |
R |
Sample correlation matrix of |
max.iter |
How many iterations should be performed for computing the sparse loadings.
Default is |
trace |
Print out progress as |
The sparse loadings are computed using the method by Shen & Huang (2008), implemented in
the irlba
package.
A list with two components:
dist.matrix |
The ultrametric distance matrix (cophenetic matrix) of the HC-SVD structure as an object of class |
u.cor |
The ultrametric correlation matrix of |
k.p |
A vector of length |
Bauer, J.O. (202Xb). Hierarchical variable clustering using singular vectors.
Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.
#We replicate the simulation study in Bauer (202Xb) ## Not run: p <- 100 n <- 300 b <- 5 design <- "a" Rho <- hcsvd.cor.sim(p = p, b = b, design = "a") X <- scale(mvtnorm::rmvnorm(300, mean=rep(0,100), sigma=Rho, checkSymmetry = FALSE)) colnames(X) = 1:ncol(X) hcsvd.obj <- hcsvd(X, k = "Kaiser") #The dendrogram can be obtained from the ultrametric distance matrix: plot(hclust(hcsvd.obj$dist.matrix)) ## End(Not run)
#We replicate the simulation study in Bauer (202Xb) ## Not run: p <- 100 n <- 300 b <- 5 design <- "a" Rho <- hcsvd.cor.sim(p = p, b = b, design = "a") X <- scale(mvtnorm::rmvnorm(300, mean=rep(0,100), sigma=Rho, checkSymmetry = FALSE)) colnames(X) = 1:ncol(X) hcsvd.obj <- hcsvd(X, k = "Kaiser") #The dendrogram can be obtained from the ultrametric distance matrix: plot(hclust(hcsvd.obj$dist.matrix)) ## End(Not run)
This function generates correlation matrices based on the simulation studies described in Bauer (202Xb).
hcsvd.cor.sim(p = p, b = b, design = design)
hcsvd.cor.sim(p = p, b = b, design = design)
p |
Number of variables. |
b |
Number of blocks. |
design |
Simulation design "a" or "b". |
A correlation matrix according to the chosen simulation design.
Bauer, J.O. (202Xb). Hierarchical variable clustering using singular vectors.
#The correlation matrix for simulation design (a) is given by #R <- hcsvd.cov.sim(p = 100, b = 5, design = "a")
#The correlation matrix for simulation design (a) is given by #R <- hcsvd.cov.sim(p = 100, b = 5, design = "a")
Performs a single iteration of BD-SVD: splits the data matrix into one (i.e., no split)
or two submatrices, depending on the structure of the first sparse loading (which is a sparse
approximation of the first right singular vector, i.e., a vector with many zero values) that mirrors the
shape of the covariance matrix.
single.bdsvd(X, dof, standardize = TRUE, max.iter)
single.bdsvd(X, dof, standardize = TRUE, max.iter)
X |
Data matrix of dimension |
dof |
Number of non-zero components in the sparse loading (degrees of freedom). If
|
standardize |
Standardize the data to have unit variance. Default is |
max.iter |
How many iterations should be performed for computing the sparse loading.
Default is |
The sparse loadings are computed using the method by Shen & Huang (2008), implemented in
the irlba
package.
A list containing the feature names of the submatrices of X
. It is either of length one (no
split) or length two (split into two submatrices).
Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.
Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.
#Replicate the illustrative example from Bauer (202Xa). ## Not run: p <- 300 #Number of variables. In Bauer (202Xa), p = 3000. n <- 500 #Number of observations b <- 3 #Number of blocks design <- "c" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- 1:p ht <- bdsvd.ht(X) plot(0:(p-1), ht$BIC[,1], xlab = "|S|", ylab = "HBIC", main = "", type = "l") single.bdsvd(X, dof = ht$dof, standardize = FALSE) ## End(Not run)
#Replicate the illustrative example from Bauer (202Xa). ## Not run: p <- 300 #Number of variables. In Bauer (202Xa), p = 3000. n <- 500 #Number of observations b <- 3 #Number of blocks design <- "c" #Simulate data matrix X set.seed(1) Sigma <- bdsvd.cov.sim(p = p, b = b, design = design) X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma) colnames(X) <- 1:p ht <- bdsvd.ht(X) plot(0:(p-1), ht$BIC[,1], xlab = "|S|", ylab = "HBIC", main = "", type = "l") single.bdsvd(X, dof = ht$dof, standardize = FALSE) ## End(Not run)