--- title: "Variable Selection using Shrinkage Priors (VsusP)" author: "Nilson Chapagain, Debdeep Pati" date: "`r Sys.Date()`" output: rmarkdown::pdf_document: df_print: paged number_sections: true toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Variable Selection using Shrinkage Priors (VsusP)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} header-includes: - \usepackage{fvextra} - \DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}} - \usepackage{titlesec} - \titlespacing*{\section}{0pt}{1em}{0.5em} - \titlespacing*{\subsection}{0pt}{0.75em}{0.5em} - \titlespacing*{\subsubsection}{0pt}{0.5em}{0.25em} - \usepackage{anyfontsize} - \usepackage{lmodern} - \renewcommand{\normalsize}{\fontsize{12}{14}\selectfont} - \renewcommand{\large}{\fontsize{14}{16}\selectfont} - \renewcommand{\Large}{\fontsize{16}{18}\selectfont} - \renewcommand{\LARGE}{\fontsize{18}{20}\selectfont} - \renewcommand{\huge}{\fontsize{20}{24}\selectfont} - \renewcommand{\Huge}{\fontsize{22}{26}\selectfont} - \usepackage{tocloft} - \renewcommand{\cftsecnumwidth}{2.5em} - \renewcommand{\cftsubsecnumwidth}{3em} - \renewcommand{\cftsubsubsecnumwidth}{3.5em} - \renewcommand{\cftsecfont}{\normalfont\bfseries} - \renewcommand{\cftsubsecfont}{\normalfont} - \renewcommand{\cftsubsubsecfont}{\normalfont} - \setcounter{secnumdepth}{3} - \setcounter{tocdepth}{3} - \renewcommand{\thesection}{\arabic{section}} - \renewcommand{\thesubsection}{\thesection.\arabic{subsection}} - \renewcommand{\thesubsubsection}{\thesubsection.\arabic{subsubsection}} - \setcounter{section}{0} - \usepackage{setspace} - \doublespacing --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>") ``` \newpage # Introduction **VsusP** provides robust methods for variable selection within Gaussian linear models, utilizing shrinkage priors to enhance the analysis of high-dimensional data. The main approach involves a two-step process where the number of significant variables is first determined by clustering coefficients, followed by selection based on posterior medians. ## Installation You can install the most recent version of 'VsusP' package from [GitHub](https://github.com/nilson01/VsusP-variable-selection-using-shrinkage-priors) using the following commands: ```{r, eval = FALSE} # Plain installation devtools::install_github("nilson01/VsusP") # For installation with vignette devtools::install_github("nilson01/VsusP", build_vignettes = TRUE) # load the library library(VsusP) ``` ## Methodology ### Theoretical Background Variable selection in high-dimensional models has garnered significant interest, especially for its applications in complex biological and environmental research. Traditional methods using spike-and-slab priors face computational challenges in high dimensions, motivating the use of continuous shrinkage priors. These priors facilitate computation and interpretability by allowing for a mixture of global and local shrinkage effects, thus handling high-dimensional sparse vectors more efficiently. In the context of Gaussian linear models: \[ Y = X\beta + \epsilon, \quad \epsilon \sim N(0, \sigma^2 I_n) \] where \( Y \) is the response vector, \( X \) is the covariate matrix, and \( \beta \) is the coefficient vector. The continuous shrinkage priors, such as the horseshoe prior, have shown promise in variable selection by providing a balance between sparsity and signal retention. ### Main Results The `VsusP` package implements a novel post-MCMC variable selection method using shrinkage priors. This method consists primarily of two approaches: the 2-Means (2-M) and Sequential 2-Means (S2M) variable selection. Both methods are designed to address high-dimensional challenges by efficiently identifying significant variables without the need for extensive tuning parameters. ### Sequential 2-Means (S2M) Variable Selection The S2M method is a refined approach that aims to minimize masking errors, which often occur with heterogeneous signal strengths among variables. It involves iterative clustering of absolute coefficients, adjusting clusters based on a dynamic threshold parameter `b`, which is crucial for distinguishing between signal and noise. #### Algorithm 1. Cluster the absolute values of coefficients using the k-means algorithm. 2. Continuously adjust the clustering by recalculating the mean of each cluster and adjusting the membership based on the threshold `b`. 3. Determine the set of significant variables based on the stability of cluster memberships across iterations. **Choosing Tuning Parameter \( b \)**: The tuning parameter \( b \) is chosen to balance the masking and swamping errors. A visual inspection of the plot of \( b_i \) vs. the number of significant variables helps in identifying the optimal value of \( b \). Note: The parameter values presented in the examples are for demo purpose only and choose the parameters as per the problem you are dealing with. ## Example This example demonstrates how to use `VsusP` to perform variable selection through simulated data: ```{r example-setup} # Assuming MASS is installed, if not uncomment the next line if (!requireNamespace("MASS", quietly = TRUE)) { install.packages("MASS") } library(MASS) # Set seed for reproducibility set.seed(123) # Simulate data sim.XY <- function(n, p, beta) { X <- matrix(rnorm(n * p), n, p) Y <- X %*% beta + rnorm(n) return(list(X = X, Y = Y, beta = beta)) } n <- 10 p <- 5 beta <- exp(rnorm(p)) data <- sim.XY(n, p, beta) ``` ### Applying Sequential2Means `Sequential2Means` is used to cluster the coefficients and estimate the number of significant variables: ```{r s2m-apply} b.i <- seq(0, 1, by = 0.05) S2M <- VsusP::Sequential2Means(X = data$X, Y = as.vector(data$Y), b.i = b.i, prior = "horseshoe+", n.samples = 300, burnin = 100) Beta <- S2M$Beta H.b.i <- S2M$H.b.i ``` ### Identifying Significant Variables Using the `OptimalHbi` function, the optimal number of significant variables is determined: ```{r s2m-results} VsusP::OptimalHbi(bi = b.i, Hbi = H.b.i) optimal_b_i <- b.i[which.min(S2M$H.b.i)] optimal_Hbi <- min(S2M$H.b.i) cat("Optimal b.i: \n", optimal_b_i, "\n") cat("Optimal H.b.i: \n", optimal_Hbi, "\n") ``` ```{r s2m-results important Variables} H <- optimal_Hbi # Variable selection impVariablesGLM <- VsusP::S2MVarSelection(Beta, H) impVariablesGLM ``` # Detailed Function Descriptions ## Sequential2Means **Purpose:** Implements a sequential two-means clustering to segregate significant variables from noise, crucial for high-dimensional data settings. This function identifies the significant variables from the results obtained from the Sequential2Means function. It uses the posterior distribution of coefficients to determine which variables play a crucial role in explaining the response variable in a Gaussian linear model setting. The selection is based on the highest posterior median of absolute coefficients, ensuring that the most consistently relevant variables are chosen, minimizing both type I and type II errors. **Parameters:** - **X**: Design matrix (n x p) where n is the number of observations and p is the number of variables. - **Y**: Response vector (n x 1). - **b.i**: Vector of tuning parameters. - **prior**: Type of shrinkage prior (e.g., "ridge", "lasso", "horseshoe"). - **n.samples**: Number of MCMC samples. - **burnin**: Number of burn-in samples. **Output:** Returns a list containing: - **Beta**: Coefficient matrix across all iterations. - **b.i**: Tested values of tuning parameters. - **H.b.i**: Estimated number of significant variables for each tuning parameter value. **Process:** Iteratively refines variable classification into signal or noise based on the Bayesian posterior distributions. ```{r sequential2means-example} set.seed(123) n <- 10 p <- 5 X <- matrix(rnorm(n * p), n, p) Y <- X %*% exp(rnorm(p)) + rnorm(n) b.i <- seq(0, 1, length.out = 20) result <- VsusP::Sequential2Means(X, as.vector(Y), b.i, prior = "horseshoe+", n.samples = 300, burnin = 100) cat("Beta: \n") print(result$Beta[1:5, ]) cat("\n\n") cat("H.b.i: \n") print(result$H.b.i) ``` ## OptimalHbi **Purpose:** Determines the optimal number of significant variables by assessing stability across tuning parameters, ensuring balanced model complexity. This function determines the optimal number of significant variables ('H') by assessing the stability across different values of tuning parameters. It is pivotal for ensuring the model neither overfits nor underfits, providing a robust means to handle model complexity and accuracy efficiently. The optimal value of the tuning parameter 'H', representing the best compromise between model complexity and fitting accuracy. This function plots 'bi' vs 'Hbi' and the user should select 'H' at the point of sharpest change in the plot, indicating the most stable variable selection. **Parameters:** - **bi**: Vector of tuning parameters. - **Hbi**: Estimated number of signals for each tuning parameter. **Output:** Plot to select optimal 'H' value, indicating the most appropriate number of significant variables to prevent overfitting or underfitting. **Process:** Select the 'H' at the point of the sharpest change in the plot of 'bi' vs 'Hbi', indicative of stable variable selection. ```{r optimal-hbi-example} VsusP::OptimalHbi(b.i, result$H.b.i) ``` ## S2MVarSelection **Purpose:** Extracts a subset of significant variables based on the highest posterior medians of absolute coefficients from the Sequential2Means output. An alternative to S2MVarSelection that provides a more direct approach to variable selection, using refined criteria for determining significance directly from the clustering outputs of Sequential2Means. It leverages detailed clustering information, specifically the median absolute values of the coefficients, to prioritize variables, providing a method that might be more responsive to subtle variations in data structure and signal intensity. **Parameters:** - **Beta**: Matrix of posterior samples. - **H**: Number of significant variables estimated. **Output:** Indices of significant variables, directly identifying key predictors. **Process:** Selects variables whose coefficients show consistent importance across samples, minimizing error rates. ```{r s2mvar-selection-example} optimal_Hbi <- min(result$H.b.i) H <- optimal_Hbi significantVariables <- VsusP::S2MVarSelection(result$Beta, H) cat("Significant Variables: \n") print(significantVariables) ``` ## Sequential2MeansBeta **Purpose:** Extends the functionality of `Sequential2Means` by allowing the specification of a range for the tuning parameters, enabling more flexible model testing. This function extends Sequential2Means by allowing for precise control over the range and density of tuning parameters to be tested, accommodating advanced scenarios where the distribution of signals is hypothesized to be non-uniform. Similar to Sequential2Means, it returns a list with p (number of variables), b.i (tested tuning parameters), and H.b.i (estimated significant variables for each tuning parameter), facilitating detailed analysis of variable significance across a customized range of model complexities. **Parameters:** - **Beta**: Matrix of N by p dimensions consisting of N posterior samples of p variables. - **lower**: Lower bound of the tuning parameter values. - **upper**: Upper bound of the tuning parameter values. - **l**: Number of points within the tuning parameter range. **Output:** Similar to `Sequential2Means`, it returns a list that includes: - **p**: Number of variables. - **b.i**: Vector of tested tuning parameters. - **H.b.i**: Estimated number of significant variables for each tuning parameter. **Process:** Evaluates the significance of coefficients across a user-defined range of tuning parameters, enhancing the ability to discern the underlying signal structure. ```{r sequential2meansbeta-example} Beta <- result$Beta lower <- 0 upper <- 1 l <- 20 S2MBeta = VsusP::Sequential2MeansBeta(Beta, lower, upper, l) cat("p : \n") print(S2MBeta$p) cat("\n\n") cat("bi : \n") print(S2MBeta$b.i) cat("\n\n") cat("Hbi : \n") print(S2MBeta$H.b.i) ``` # Simulation example ```{r Simulation} library(MASS) library(VsusP) set.seed(12345) n <- 20 p <- 5 r <- 3 rho <- 0.95 Sigma <- diag(1, p) block_size <- 5 num_blocks <- p / block_size for (b in 0:(num_blocks - 1)) { block_start <- b * block_size + 1 block_end <- min(p, block_start + block_size - 1) Sigma[block_start:block_end, block_start:block_end] <- rho diag(Sigma[block_start:block_end, block_start:block_end]) <- 1 } Sigma <- Sigma + diag(0.001, p) # Ensure positive definiteness X <- mvrnorm(n, mu = rep(0, p), Sigma = Sigma) beta_true <- c(6, rep(3, 2), 1, 0) Y <- as.vector(X %*% beta_true + rnorm(n)) var_y <- var(Y) b_i_range <- seq(0.5 * var_y, 10 * var_y, length.out = 20) # Check the scale of var(Y) results <- Sequential2Means(X = X, Y = Y, b.i = b_i_range, prior = "horseshoe+", n.samples = 300, burnin = 100) plot(b_i_range, results$H.b.i, type = 'b', xlab = "Tuning parameter b.i", ylab = "Estimated number of signals (H.b.i)", main = "Plot of b.i vs H.b.i for Simulation 2") optimal_b_i <- b_i_range[which.min(results$H.b.i)] optimal_Hbi <- min(results$H.b.i) cat("Optimal b.i:", optimal_b_i, "\n") cat("Optimal H.b.i:", optimal_Hbi, "\n") H <- optimal_Hbi # Variable selection Beta <- results$Beta impVariablesGLM <- S2MVarSelection(Beta, H) impVariablesGLM ``` \newpage # References - **Li, H., & Pati, D. (2020) ** "Variable Selection Using Shrinkage Priors." Computational Statistics & Data Analysis, 107, pp.107-119. . - **Makalic, E. & Schmidt, D. F. (2016)** "High-Dimensional Bayesian Regularised Regression with the BayesReg Package." arXiv:1611.06649 . - **Bhattacharya, A., Pati, D., Pillai, N.S., & Dunson, D.B. (2015)** “Dirichlet-Laplace Priors for Optimal Shrinkage.” Journal of the American Statistical Association, 110 (512), 1479–1490. . - **Rosenwald, A., et al. (2002)** “The Use of Molecular Profiling to Predict Survival After Chemotherapy for Diffuse Large-B-Cell Lymphoma.” New England Journal of Medicine, 346 (25), 1937–1947. . - **Bhattacharya, A., & Dunson, D. (2011)** “Sparse Bayesian Infinite Factor Models.” Biometrika, 98 (2), 291–306. . For further information and updates, visit the VsusP package repository at .