Package 'bdsvd'

Title: Block Structure Detection Using Singular Vectors
Description: Performs block diagonal covariance matrix detection using singular vectors (BD-SVD), which can be extended to hierarchical variable clustering (HC-SVD). The methods are described in Bauer (202Xa) <doi:10.48550/arXiv.2211.16155> and Bauer (202Xb) <doi:10.48550/arXiv.2308.06820>.
Authors: Jan O. Bauer [aut, cre] , Ron Holzapfel [aut]
Maintainer: Jan O. Bauer <[email protected]>
License: GPL (>= 2)
Version: 0.2.0
Built: 2024-11-04 06:28:18 UTC
Source: CRAN

Help Index


Block Detection Using Singular Vectors (BD-SVD).

Description

Performs BD-SVD iteratively to reveal the block structure. Splits the data matrix into one (i.e., no split) or two submatrices, depending on the structure of the first sparse loading vv (which is a sparse approximation of the first right singular vector, i.e., a vector with many zero values) that mirrors the shape of the covariance matrix. This procedure is continued iteratively until the block diagonal structure has been revealed.

The data matrix ordered according to this revealed block diagonal structure can be obtained by bdsvd.structure.

Usage

bdsvd(X, dof.lim, anp = "2", standardize = TRUE, max.iter, trace = FALSE)

Arguments

X

Data matrix of dimension nxpn x p with possibly p>>np >> n.

dof.lim

Interval limits for the number of non-zero components in the sparse loading (degrees of freedom). If SS denotes the support of vv, then the cardinality of the support, S|S|, corresponds to the degrees of freedom. Default is dof.lim <- c(0, p-1) which is highly recommended to check for all levels of sparsity.

anp

Which regularization function should be used for the HBIC. anp = "1" implements anp=1a_{np} = 1 which corresponds to the BIC, anp = "2" implements anp=1/2log(np)a_{np} = 1/2 log(np) which corresponds to the regularization used by Bauer (202Xa), and anp = "3" implements anp=log(log(np))a_{np} = log(log(np)) which corresponds to the regularization used by Wang et al. (2009) and Wang et al. (2013).

standardize

Standardize the data to have unit variance. Default is TRUE.

max.iter

How many iterations should be performed for computing the sparse loading. Default is 200.

trace

Print out progress as iterations are performed. Default is TRUE.

Details

The sparse loadings are computed using the method by Shen & Huang (2008), implemented in the irlba package.

Value

A list containing the feature names of the submatrices of X. The length of the list equals the number of submatrices.

References

Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.

Wang, H., B. Li, and C. Leng (2009). Shrinkage tuning parameter selection with a diverging number of parameters, J. R. Stat. Soc. B 71 (3), 671–683.

Wang, L., Y. Kim, and R. Li (2013). Calibrating nonconvex penalized regression in ultra-high dimension, Ann. Stat. 41 (5), 2505–2536.

See Also

bdsvd.structure, bdsvd.ht, single.bdsvd

Examples

#Replicate the simulation study (c) from Bauer (202Xa).

## Not run: 
p <- 500 #Number of variables
n <- 250 #Number of observations
b <- 10  #Number of blocks
design <- "c" #Simulation design "a", "b", "c", or "d".

#Simulate data matrix X
set.seed(1)
Sigma <- bdsvd.cov.sim(p = p, b = b, design = design)
X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma)
colnames(X) <- seq_len(p)

bdsvd(X, standardize = FALSE)

## End(Not run)

Covariance Matrix Simulation for BD-SVD

Description

This function generates covariance matrices based on the simulation studies described in Bauer (202Xa).

Usage

bdsvd.cov.sim(p = p, b, design = design)

Arguments

p

Number of variables.

b

Number of blocks. Only required for simulation design "c" and "d".

design

Simulation design "a", "b", "c", or "d".

Value

A covariance matrix according to the chosen simulation design.

References

Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.

Examples

#The covariance matrix for simulation design (a) is given by
Sigma <- bdsvd.cov.sim(p = 500, b = 500, design = "a")

Hyperparameter Tuning for BD-SVD

Description

Finds the number of non-zero elements of the sparse loading according to the high-dimensional Bayesian information criterion (HBIC).

Usage

bdsvd.ht(X, dof.lim, standardize = TRUE, anp = "2", max.iter)

Arguments

X

Data matrix of dimension nxpn x p with possibly p>>np >> n.

dof.lim

Interval limits for the number of non-zero components in the sparse loading (degrees of freedom). If SS denotes the support of vv, then the cardinality of the support, S|S|, corresponds to the degrees of freedom. Default is dof.lim <- c(0, p-1) which is highly recommended to check for all levels of sparsity.

standardize

Standardize the data to have unit variance. Default is TRUE.

anp

Which regularization function should be used for the HBIC. anp = "1" implements anp=1a_{np} = 1 which corresponds to the BIC, anp = "2" implements anp=1/2log(np)a_{np} = 1/2 log(np) which corresponds to the regularization used by Bauer (202Xa), and anp = "3" implements anp=log(log(np))a_{np} = log(log(np)) which corresponds to the regularization used by Wang et al. (2009) and Wang et al. (2013).

max.iter

How many iterations should be performed for computing the sparse loading. Default is 200.

Details

The sparse loadings are computed using the method by Shen & Huang (2008), implemented in the irlba package. The computation of the HBIC is outlined in Bauer (202Xa).

Value

dof

The optimal number of nonzero components (degrees of freedom) according to the HBIC.

BIC

The HBIC for the different numbers of nonzero components.

References

Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.

Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.

Wang, H., B. Li, and C. Leng (2009). Shrinkage tuning parameter selection with a diverging number of parameters, J. R. Stat. Soc. B 71 (3), 671–683.

Wang, L., Y. Kim, and R. Li (2013). Calibrating nonconvex penalized regression in ultra-high dimension, Ann. Stat. 41 (5), 2505–2536.

See Also

bdsvd, single.bdsvd

Examples

#Replicate the illustrative example from Bauer (202Xa).


p <- 300 #Number of variables. In Bauer (202Xa), p = 3000
n <- 500 #Number of observations
b <- 3   #Number of blocks
design <- "c"

#Simulate data matrix X
set.seed(1)
Sigma <- bdsvd.cov.sim(p = p, b = b, design = design)
X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma)
colnames(X) <- seq_len(p)

ht <- bdsvd.ht(X)
plot(0:(p-1), ht$BIC[,1], xlab = "|S|", ylab = "HBIC", main = "", type = "l")
single.bdsvd(X, dof = ht$dof, standardize = FALSE)

Data Matrix Structure According to the Detected Block Structure.

Description

Either sorts the data matrix XX according to the detected block structure X1,...,XbX_1 , ... , X_b, ordered by the number of variables that the blocks contain. Or returns the detected submatrices each individually in a list object.

Usage

bdsvd.structure(X, block.structure, output = "matrix", block.order)

Arguments

X

Data matrix of dimension nxpn x p with possibly p>>np >> n.

block.structure

Output of bdsvd() or single.bdsvd() which identified the block structure.

output

Should the output be the data matrix ordered according to the blocks ("matrix"), or a list containing the submatrices ("submatrices"). Default is "matrix".

block.order

A vector that contains the order of the blocks detected by bdsvd() or single.bdsvd(). The vector must contain the index of each blocks exactly once. Default is 1:b where b is the total number of blocks.

Value

Either the data matrix X with columns sorted according to the detected blocks, or a list containing the detected submatrices.

References

Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.

See Also

bdsvd, single.bdsvd

Examples

#Toying with the illustrative example from Bauer (202Xa).


p <- 150 #Number of variables. In Bauer (202Xa), p = 3000.
n <- 500 #Number of observations
b <- 3   #Number of blocks
design <- "c"

#Simulate data matrix X
set.seed(1)
Sigma <- bdsvd.cov.sim(p = p, b = b, design = design)
X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma)
colnames(X) <- seq_len(p)

#Compute iterative BD-SVD
bdsvd.obj <- bdsvd(X, standardize = FALSE)

#Obtain the data matrix X, sorted by the detected blocks
colnames(bdsvd.structure(X, bdsvd.obj, output = "matrix") )
colnames(bdsvd.structure(X, bdsvd.obj, output = "matrix", block.order = c(2,1,3)) )

#Obtain the detected submatrices X_1, X_2, and X_3
colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[1]] )
colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[2]] )
colnames(bdsvd.structure(X, bdsvd.obj, output = "submatrices")[[3]] )

Block

Description

Class used within the package to store the structure and information about the detected blocks.

Slots

features

numeric vector that contains the the variables corresponding to this block.

block.columns

numeric vector that contains the indices of the singular vectors corresponding to this block.


Block Detection

Description

This function returns the block structure of a matrix.

Usage

detect.blocks(V, threshold = 0)

Arguments

V

Numeric matrix which either contains the loadings or is a covariance matrix.

threshold

All absolute values of V below the threshold are set to zero.

Value

An object of class Block containing the features and columns indices corresponding to each detected block.

References

Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.

See Also

bdsvd, single.bdsvd

Examples

#In the first example, we replicate the simulation study for the ad hoc procedure
#Est_0.1 from Bauer (202Xa). In the second example, we manually compute the first step
#of BD-SVD, which can be done using the bdsvd() and/or single.bdsvd(), for constructed
#sparse loadings

#Example 1: Replicate the simulation study (a) from Bauer (202Xa) for the ad hoc
#procedure Est_0.1.

p <- 500 #Number of variables
n <- 125 #Number of observations
b <- 500 #Number of blocks
design <- "a"

#Simulate data matrix X
set.seed(1)
Sigma <- bdsvd.cov.sim(p = p, b = b, design = design)
X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma)
colnames(X) <- 1:p

#Perform the ad hoc procedure
detect.blocks(cvCovEst::scadEst(dat = X, lambda = 0.2), threshold = 0)

#Example 2: Manually compute the first step of BD-SVD
#for some loadings V that mirror the two blocks
#("A", "B") and c("C", "D").

V <- matrix(c(1,0,
              1,0,
              0,1,
              0,1), 4, 2, byrow = TRUE)

rownames(V) <- c("A", "B", "C", "D")
detected.blocks <- detect.blocks(V)

#Variables in block one with corresponding column index:
detected.blocks[[1]]@features
detected.blocks[[1]]@block.columns

#Variables in block two with corresponding column index:
detected.blocks[[2]]@features
detected.blocks[[2]]@block.columns

Hierarchical Variable Clustering Using Singular Vectors (HC-SVD).

Description

Performs HC-SVD to reveal the hierarchical variable structure as descried in Bauer (202Xb). For this divise approach, each cluster is split into two clusters iteratively. Potential splits are identified by the first sparse loadings (which are sparse approximations of the first right singular vectors, i.e., vectors with many zero values) that mirror the masked shape of the correlation matrix. This procedure is continued until each variable lies in a single cluster.

Usage

hcsvd(X, k = "all", linkage = "single", reliability, R, max.iter, trace = TRUE)

Arguments

X

Data matrix of dimension nxpn x p. The data matrix is standardized during the analysis by hcsvd.

k

Number of sparse loadings to be used. This should be "all" for all sparse loadings, or "Kaiser" for as many sparse loadings as there are eigenvalues larger or equal to one (see Bauer (202Xb) for details). Selecting "Kaiser" reduces computation time.

linkage

The linkage function to be used. This should be one of "average", "single", or "RV" (for RV-coefficient).

reliability

By default, the value of each cluster equals the distance calculated by the chosen linkage function. If preferred, the value of each cluster can be assigned by its reliability. When reliability = spectral, the reliability is calculated by the averaged spectral norm.

R

Sample correlation matrix of X. By default, R <- cov(X).

max.iter

How many iterations should be performed for computing the sparse loadings. Default is 200.

trace

Print out progress as p1p-1 iterations for divisive hierarchical clustering are performed. Default is TRUE.

Details

The sparse loadings are computed using the method by Shen & Huang (2008), implemented in the irlba package.

Value

A list with two components:

dist.matrix

The ultrametric distance matrix (cophenetic matrix) of the HC-SVD structure as an object of class dist.

u.cor

The ultrametric correlation matrix of XX obtained by HC-SVD as an object of class matrix.

k.p

A vector of length p1p-1 containing the ratio ki/pik_i/p_i of the kik_i sparse loadings used relative to all sparse loadings pip_i for the split of each cluster. The ratio is set to NA if the cluster contains only two variables as the search for sparse loadings that reflect the split is not required in this case.

References

Bauer, J.O. (202Xb). Hierarchical variable clustering using singular vectors.

Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.

Examples

#We replicate the simulation study in Bauer (202Xb)

## Not run: 
p <- 100
n <- 300
b <- 5
design <- "a"

Rho <- hcsvd.cor.sim(p = p, b = b, design = "a")
X <- scale(mvtnorm::rmvnorm(300, mean=rep(0,100), sigma=Rho, checkSymmetry = FALSE))
colnames(X) = 1:ncol(X)
hcsvd.obj <- hcsvd(X, k = "Kaiser")

#The dendrogram can be obtained from the ultrametric distance matrix:
plot(hclust(hcsvd.obj$dist.matrix))

## End(Not run)

Correlation Matrix Simulation for HC-SVD

Description

This function generates correlation matrices based on the simulation studies described in Bauer (202Xb).

Usage

hcsvd.cor.sim(p = p, b = b, design = design)

Arguments

p

Number of variables.

b

Number of blocks.

design

Simulation design "a" or "b".

Value

A correlation matrix according to the chosen simulation design.

References

Bauer, J.O. (202Xb). Hierarchical variable clustering using singular vectors.

Examples

#The correlation matrix for simulation design (a) is given by
#R <- hcsvd.cov.sim(p = 100, b = 5, design = "a")

Single Iteration of Block Detection Using Singular Vectors (BD-SVD).

Description

Performs a single iteration of BD-SVD: splits the data matrix into one (i.e., no split) or two submatrices, depending on the structure of the first sparse loading vv (which is a sparse approximation of the first right singular vector, i.e., a vector with many zero values) that mirrors the shape of the covariance matrix.

Usage

single.bdsvd(X, dof, standardize = TRUE, max.iter)

Arguments

X

Data matrix of dimension nxpn x p with possibly p>>np >> n.

dof

Number of non-zero components in the sparse loading (degrees of freedom). If SS denotes the support of vv, then the cardinality of the support, S|S|, corresponds to the degrees of freedom.

standardize

Standardize the data to have unit variance. Default is TRUE.

max.iter

How many iterations should be performed for computing the sparse loading. Default is 200.

Details

The sparse loadings are computed using the method by Shen & Huang (2008), implemented in the irlba package.

Value

A list containing the feature names of the submatrices of X. It is either of length one (no split) or length two (split into two submatrices).

References

Bauer, J.O. (202Xa). High-dimensional block diagonal covariance structure detection using singular vectors.

Shen, H. and Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation, J. Multivar. Anal. 99, 1015–1034.

See Also

bdsvd, bdsvd.ht

Examples

#Replicate the illustrative example from Bauer (202Xa).

## Not run: 

p <- 300 #Number of variables. In Bauer (202Xa), p = 3000.
n <- 500 #Number of observations
b <- 3   #Number of blocks
design <- "c"

#Simulate data matrix X
set.seed(1)
Sigma <- bdsvd.cov.sim(p = p, b = b, design = design)
X <- mvtnorm::rmvnorm(n, mean=rep(0, p), sigma=Sigma)
colnames(X) <- 1:p

ht <- bdsvd.ht(X)
plot(0:(p-1), ht$BIC[,1], xlab = "|S|", ylab = "HBIC", main = "", type = "l")
single.bdsvd(X, dof = ht$dof, standardize = FALSE)


## End(Not run)