Introduction
VsusP provides robust methods for variable selection
within Gaussian linear models, utilizing shrinkage priors to enhance the
analysis of high-dimensional data. The main approach involves a two-step
process where the number of significant variables is first determined by
clustering coefficients, followed by selection based on posterior
medians.
Installation
You can install the most recent version of ‘VsusP’ package from GitHub
using the following commands:
# Plain installation
devtools::install_github("nilson01/VsusP")
# For installation with vignette
devtools::install_github("nilson01/VsusP", build_vignettes = TRUE)
# load the library
library(VsusP)
Methodology
Theoretical Background
Variable selection in high-dimensional models has garnered
significant interest, especially for its applications in complex
biological and environmental research. Traditional methods using
spike-and-slab priors face computational challenges in high dimensions,
motivating the use of continuous shrinkage priors. These priors
facilitate computation and interpretability by allowing for a mixture of
global and local shrinkage effects, thus handling high-dimensional
sparse vectors more efficiently.
In the context of Gaussian linear models: Y = Xβ + ϵ, ϵ ∼ N(0, σ2In)
where Y is the response
vector, X is the covariate
matrix, and β is the
coefficient vector. The continuous shrinkage priors, such as the
horseshoe prior, have shown promise in variable selection by providing a
balance between sparsity and signal retention.
Main Results
The VsusP
package implements a novel post-MCMC variable
selection method using shrinkage priors. This method consists primarily
of two approaches: the 2-Means (2-M) and Sequential 2-Means (S2M)
variable selection. Both methods are designed to address
high-dimensional challenges by efficiently identifying significant
variables without the need for extensive tuning parameters.
Sequential 2-Means (S2M) Variable Selection
The S2M method is a refined approach that aims to minimize masking
errors, which often occur with heterogeneous signal strengths among
variables. It involves iterative clustering of absolute coefficients,
adjusting clusters based on a dynamic threshold parameter
b
, which is crucial for distinguishing between signal and
noise.
Algorithm
- Cluster the absolute values of coefficients using the k-means
algorithm.
- Continuously adjust the clustering by recalculating the mean of each
cluster and adjusting the membership based on the threshold
b
.
- Determine the set of significant variables based on the stability of
cluster memberships across iterations.
Choosing Tuning Parameter b: The tuning parameter
b is chosen to balance the
masking and swamping errors. A visual inspection of the plot of bi vs. the
number of significant variables helps in identifying the optimal value
of b.
Note: The parameter values presented in the examples are for demo
purpose only and choose the parameters as per the problem you are
dealing with.
Example
This example demonstrates how to use VsusP
to perform
variable selection through simulated data:
# Assuming MASS is installed, if not uncomment the next line
if (!requireNamespace("MASS", quietly = TRUE)) {
install.packages("MASS")
}
library(MASS)
# Set seed for reproducibility
set.seed(123)
# Simulate data
sim.XY <- function(n, p, beta) {
X <- matrix(rnorm(n * p), n, p)
Y <- X %*% beta + rnorm(n)
return(list(X = X, Y = Y, beta = beta))
}
n <- 10
p <- 5
beta <- exp(rnorm(p))
data <- sim.XY(n, p, beta)
Applying Sequential2Means
Sequential2Means
is used to cluster the coefficients and
estimate the number of significant variables:
b.i <- seq(0, 1, by = 0.05)
S2M <- VsusP::Sequential2Means(X = data$X, Y = as.vector(data$Y), b.i = b.i, prior = "horseshoe+", n.samples = 300, burnin = 100)
Beta <- S2M$Beta
H.b.i <- S2M$H.b.i
Identifying Significant Variables
Using the OptimalHbi
function, the optimal number of
significant variables is determined:
VsusP::OptimalHbi(bi = b.i, Hbi = H.b.i)
optimal_b_i <- b.i[which.min(S2M$H.b.i)]
optimal_Hbi <- min(S2M$H.b.i)
cat("Optimal b.i: \n", optimal_b_i, "\n")
#> Optimal b.i:
#> 0
cat("Optimal H.b.i: \n", optimal_Hbi, "\n")
#> Optimal H.b.i:
#> 3
H <- optimal_Hbi
# Variable selection
impVariablesGLM <- VsusP::S2MVarSelection(Beta, H)
impVariablesGLM
#> [1] 3 4 1
Detailed Function Descriptions
Sequential2Means
Purpose: Implements a sequential two-means
clustering to segregate significant variables from noise, crucial for
high-dimensional data settings. This function identifies the significant
variables from the results obtained from the Sequential2Means function.
It uses the posterior distribution of coefficients to determine which
variables play a crucial role in explaining the response variable in a
Gaussian linear model setting. The selection is based on the highest
posterior median of absolute coefficients, ensuring that the most
consistently relevant variables are chosen, minimizing both type I and
type II errors.
Parameters: - X: Design matrix (n x
p) where n is the number of observations and p is the number of
variables. - Y: Response vector (n x 1). -
b.i: Vector of tuning parameters. -
prior: Type of shrinkage prior (e.g., “ridge”, “lasso”,
“horseshoe”). - n.samples: Number of MCMC samples. -
burnin: Number of burn-in samples.
Output: Returns a list containing: -
Beta: Coefficient matrix across all iterations. -
b.i: Tested values of tuning parameters. -
H.b.i: Estimated number of significant variables for
each tuning parameter value.
Process: Iteratively refines variable classification
into signal or noise based on the Bayesian posterior distributions.
set.seed(123)
n <- 10
p <- 5
X <- matrix(rnorm(n * p), n, p)
Y <- X %*% exp(rnorm(p)) + rnorm(n)
b.i <- seq(0, 1, length.out = 20)
result <- VsusP::Sequential2Means(X, as.vector(Y), b.i, prior = "horseshoe+", n.samples = 300, burnin = 100)
cat("Beta: \n")
#> Beta:
print(result$Beta[1:5, ])
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.03209725 1.3798302 0.6598352 6.105249 0.37699363
#> [2,] -0.03093031 1.2403271 0.3906860 5.769710 0.15751401
#> [3,] 0.06496785 1.2439758 0.9864951 6.784788 0.11080311
#> [4,] 1.52096234 0.2004912 -0.4766526 3.566845 0.63325843
#> [5,] -0.02993973 1.1910556 0.4649243 6.338999 0.02854982
cat("\n\n")
cat("H.b.i: \n")
#> H.b.i:
print(result$H.b.i)
#> [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
OptimalHbi
Purpose: Determines the optimal number of
significant variables by assessing stability across tuning parameters,
ensuring balanced model complexity.
This function determines the optimal number of significant variables
(‘H’) by assessing the stability across different values of tuning
parameters. It is pivotal for ensuring the model neither overfits nor
underfits, providing a robust means to handle model complexity and
accuracy efficiently.
The optimal value of the tuning parameter ‘H’, representing the best
compromise between model complexity and fitting accuracy. This function
plots ‘bi’ vs ‘Hbi’ and the user should select ‘H’ at the point of
sharpest change in the plot, indicating the most stable variable
selection.
Parameters: - bi: Vector of tuning
parameters. - Hbi: Estimated number of signals for each
tuning parameter.
Output: Plot to select optimal ‘H’ value, indicating
the most appropriate number of significant variables to prevent
overfitting or underfitting.
Process: Select the ‘H’ at the point of the sharpest
change in the plot of ‘bi’ vs ‘Hbi’, indicative of stable variable
selection.
VsusP::OptimalHbi(b.i, result$H.b.i)
S2MVarSelection
Purpose: Extracts a subset of significant variables
based on the highest posterior medians of absolute coefficients from the
Sequential2Means output. An alternative to S2MVarSelection that provides
a more direct approach to variable selection, using refined criteria for
determining significance directly from the clustering outputs of
Sequential2Means. It leverages detailed clustering information,
specifically the median absolute values of the coefficients, to
prioritize variables, providing a method that might be more responsive
to subtle variations in data structure and signal intensity.
Parameters: - Beta: Matrix of
posterior samples. - H: Number of significant variables
estimated.
Output: Indices of significant variables, directly
identifying key predictors.
Process: Selects variables whose coefficients show
consistent importance across samples, minimizing error rates.
optimal_Hbi <- min(result$H.b.i)
H <- optimal_Hbi
significantVariables <- VsusP::S2MVarSelection(result$Beta, H)
cat("Significant Variables: \n")
#> Significant Variables:
print(significantVariables)
#> [1] 4 2 3
Sequential2MeansBeta
Purpose: Extends the functionality of
Sequential2Means
by allowing the specification of a range
for the tuning parameters, enabling more flexible model testing. This
function extends Sequential2Means by allowing for precise control over
the range and density of tuning parameters to be tested, accommodating
advanced scenarios where the distribution of signals is hypothesized to
be non-uniform. Similar to Sequential2Means, it returns a list with p
(number of variables), b.i (tested tuning parameters), and H.b.i
(estimated significant variables for each tuning parameter),
facilitating detailed analysis of variable significance across a
customized range of model complexities.
Parameters: - Beta: Matrix of N by
p dimensions consisting of N posterior samples of p variables. -
lower: Lower bound of the tuning parameter values. -
upper: Upper bound of the tuning parameter values. -
l: Number of points within the tuning parameter
range.
Output: Similar to Sequential2Means
, it
returns a list that includes: - p: Number of variables.
- b.i: Vector of tested tuning parameters. -
H.b.i: Estimated number of significant variables for
each tuning parameter.
Process: Evaluates the significance of coefficients
across a user-defined range of tuning parameters, enhancing the ability
to discern the underlying signal structure.
Beta <- result$Beta
lower <- 0
upper <- 1
l <- 20
S2MBeta = VsusP::Sequential2MeansBeta(Beta, lower, upper, l)
cat("p : \n")
#> p :
print(S2MBeta$p)
#> [1] 5
cat("\n\n")
cat("bi : \n")
#> bi :
print(S2MBeta$b.i)
#> [1] 0.00000000 0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
#> [7] 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
#> [13] 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
#> [19] 0.94736842 1.00000000
cat("\n\n")
cat("Hbi : \n")
#> Hbi :
print(S2MBeta$H.b.i)
#> [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Simulation example
library(MASS)
library(VsusP)
set.seed(12345)
n <- 20
p <- 5
r <- 3
rho <- 0.95
Sigma <- diag(1, p)
block_size <- 5
num_blocks <- p / block_size
for (b in 0:(num_blocks - 1)) {
block_start <- b * block_size + 1
block_end <- min(p, block_start + block_size - 1)
Sigma[block_start:block_end, block_start:block_end] <- rho
diag(Sigma[block_start:block_end, block_start:block_end]) <- 1
}
Sigma <- Sigma + diag(0.001, p) # Ensure positive definiteness
X <- mvrnorm(n, mu = rep(0, p), Sigma = Sigma)
beta_true <- c(6, rep(3, 2), 1, 0)
Y <- as.vector(X %*% beta_true + rnorm(n))
var_y <- var(Y)
b_i_range <- seq(0.5 * var_y, 10 * var_y, length.out = 20) # Check the scale of var(Y)
results <- Sequential2Means(X = X, Y = Y, b.i = b_i_range, prior = "horseshoe+", n.samples = 300, burnin = 100)
plot(b_i_range, results$H.b.i, type = 'b', xlab = "Tuning parameter b.i", ylab = "Estimated number of signals (H.b.i)",
main = "Plot of b.i vs H.b.i for Simulation 2")
optimal_b_i <- b_i_range[which.min(results$H.b.i)]
optimal_Hbi <- min(results$H.b.i)
cat("Optimal b.i:", optimal_b_i, "\n")
#> Optimal b.i: 60.33691
cat("Optimal H.b.i:", optimal_Hbi, "\n")
#> Optimal H.b.i: 2
H <- optimal_Hbi
# Variable selection
Beta <- results$Beta
impVariablesGLM <- S2MVarSelection(Beta, H)
impVariablesGLM
#> [1] 1 2