--- title: "The Terminating-LARS (T-LARS) Method: Usage and Simulations" author: | | Jasin Machkour^\#^, Simon Tien^\#^, Daniel P. Palomar^\*^, Michael Muma^\#^ | | ^\#^Technische Universität Darmstadt | ^\*^The Hong Kong University of Science and Technology date: "`r Sys.Date()`" output: html_document: theme: flatly highlight: pygments toc: yes toc_depth: 1 toc_float: yes css: vignette_styles.css mathjax: "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_CHTML.js" prettydoc::html_pretty: theme: tactile highlight: vignette toc: yes toc_depth: 2 toc-title: "Table of Contents" csl: ieee.csl bibliography: refs.bib nocite: | @machkour2022terminating, @efron2004least vignette: | %\VignetteKeyword{T-LARS, T-Rex selector, false discovery rate (FDR) control, high-dimensional variable selection, martingale theory, genome-wide association studies (GWAS)} %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{The Terminating-LARS (T-LARS) Method: Usage and Simulations} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} # Store user's options() old_options <- options() library(knitr) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.retina = 2, out.width = "85%", dpi = 96 # pngquant = "--speed=1" ) options(width = 80) ``` ----------- # Motivation This package implements the **Terminating-LARS (T-LARS)** algorithm, i.e., it computes the solution path of the T-LARS algorithm. The T-LARS algorithm appends dummy predictors to the original predictor matrix and terminates the forward-selection process after a pre-defined number of dummy variables has been selected. In contrast, the original LARS algorithm computes the entire solution path without stopping early. However, in certain applications there exists very little or no useful information in later steps of the solution path and, therefore, stopping early allows for huge improvements in terms of computation time and no loss in the variable selection accuracy. Especially the **T-Rex selector** ([Paper](https://arxiv.org/abs/2110.06048) and [R package](https://CRAN.R-project.org/package=TRexSelector)) requires terminating multiple solution paths early after a pre-defined number of dummy variables has been included. The T-LARS algorithm is a major building block of the **T-Rex selector**. The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level while maximizing the number of selected variables. In the following, we show how to use the package and give you an idea of why terminating the solution path early is a reasonable approach in high-dimensional and sparse variable selection: In many applications, most active variables enter the solution path early! $$ \DeclareMathOperator{\FDP}{FDP} \DeclareMathOperator{\FDR}{FDR} \DeclareMathOperator{\TPP}{TPP} \DeclareMathOperator{\TPR}{TPR} \newcommand{\A}{\mathcal{A}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=} $$ # Installation You can install the 'tlars' package (stable version) from [CRAN](https://CRAN.R-project.org/package=tlars) with ``` r install.packages("tlars") library(tlars) ``` You can install the 'tlars' package (developer version) from [GitHub](https://github.com/jasinmachkour/tlars) with ``` r install.packages("devtools") devtools::install_github("jasinmachkour/tlars") ``` You can open the help pages with ```{r, eval=FALSE} library(tlars) help(package = "tlars") ?tlars ?tlars_model ?tlars_cpp ?plot.Rcpp_tlars_cpp ?print.Rcpp_tlars_cpp ?Gauss_data ``` To cite the package 'tlars' in publications use ```{r, eval=FALSE} citation("tlars") ``` # Quick Start In the following, we illustrate the basic usage of the 'tlars' package for performing variable selection in sparse and high-dimensional regression settings using the T-LARS algorithm: 1. **First**, we generate a high-dimensional Gaussian data set with sparse support: ```{r} library(tlars) # Setup n <- 150 # Number of observations p <- 300 # Number of variables num_act <- 5 # Number of true active variables beta <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # Coefficient vector true_actives <- which(beta > 0) # Indices of true active variables num_dummies <- p # Number of dummy predictors (or dummies) # Generate Gaussian data set.seed(123) X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p) y <- X %*% beta + stats::rnorm(n) ``` 2. **Second**, we generate a dummy matrix containing n rows and num_dummies dummy predictors that are sampled from the standard normal distribution and append it to the original predictor matrix: ```{r} set.seed(1234) dummies <- matrix(stats::rnorm(n * num_dummies), nrow = n, ncol = num_dummies) XD <- cbind(X, dummies) ``` 3. **Third**, we generate an object of the C++ class 'tlars_cpp' and supply the information that the last num_dummies predictors in XD are dummy predictors: ```{r} mod_tlars <- tlars_model(X = XD, y = y, num_dummies = num_dummies) ``` 4. **Finally**, we perform one T-LARS step on 'mod_tlars', i.e., the T-LARS algorithm is run until **T_stop = 1** dummy has entered the solution path and stops there: ```{r terminated_solution_path_T_3, echo=TRUE, fig.align='center', message=TRUE, fig.width = 10, fig.height = 6, out.width = "90%"} tlars(model = mod_tlars, T_stop = 1, early_stop = TRUE) # Perform one T-LARS step on object "mod_tlars" print(mod_tlars) # Print information about the results of the performed T-LARS steps plot(mod_tlars) # Plot the terminated solution path ``` # T-LARS Warm Starts The object "mod_tlars" stores the results and, therefore, allows for warm starts. That is, after performing a T-LARS step with, e.g., "T_stop = 1", we can perform another T-LARS step with, e.g., "T_stop = 2, 3, ...", by continuing to build the solution path from its last T-LARS step. ****** * **T_stop = 2**: ```{r terminated_solution_path_T_5, echo=TRUE, fig.align='center', message=TRUE, fig.width = 10, fig.height = 6, out.width = "90%"} # Numerical zero eps <- .Machine$double.eps # Perform one additional T-LARS step (going from T_stop = 1 to T_stop = 2) on object "mod_tlars" tlars(model = mod_tlars, T_stop = 2, early_stop = TRUE) print(mod_tlars) plot(mod_tlars) # Coefficient vector corresponding to original and dummy variables at the terminal T-LARS step beta_hat <- mod_tlars$get_beta() selected_var <- which(abs(beta_hat[seq(p)]) > eps) # Indices of selected original variables selected_dummies <- p + which(abs(beta_hat[seq(p + 1, ncol(XD))]) > eps) # Indices of selected dummy variables FDP <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) # False discovery proportion (FDP) TPP <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) # True positive proportion (TPP) selected_var selected_dummies FDP TPP ``` ****** * **T_stop = 5**: ```{r terminated_solution_path_T_10, echo=TRUE, fig.align='center', message=TRUE, fig.width = 10, fig.height = 6, out.width = "90%"} # Perform three additional T-LARS steps (going from T_stop = 2 to T_stop = 5) on object "mod_tlars" tlars(model = mod_tlars, T_stop = 5, early_stop = TRUE) print(mod_tlars) plot(mod_tlars) # Coefficient vector corresponding to original and dummy variables at the terminal T-LARS step beta_hat <- mod_tlars$get_beta() selected_var <- which(abs(beta_hat[seq(p)]) > eps) # Indices of selected original variables selected_dummies <- p + which(abs(beta_hat[seq(p + 1, ncol(XD))]) > eps) # Indices of selected dummy variables FDP <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) # False discovery proportion (FDP) TPP <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) # True positive proportion (TPP) selected_var selected_dummies FDP TPP ``` ****** # FDR and TPR ## False discovery rate (FDR) and true positive rate (TPR) From a statistical point of view, it is desirable to use a variable selection method that allows for controlling the expected value of the FDP at a user-defined target level $\alpha \in [0, 1]$ while maximizing the number of selected variables. These type of methods exist and are called false discovery rate (FDR)-controlling methods. For example, the **T-Rex selector** ([Paper](https://arxiv.org/abs/2110.06048) and [R package](https://CRAN.R-project.org/package=TRexSelector)) is a fast and FDR-controlling variable/feature selection framework for large-scale high-dimensional settings that relies on the T-LARS method. **Definitions** (FDR and TPR) Let $\widehat{\A} \subseteq \lbrace 1, \ldots, p \rbrace$ be the set of selected variables, $\A \subseteq \lbrace 1, \ldots, p \rbrace$ the set of true active variables, $| \widehat{\A} |$ the cardinality of $\widehat{\A}$, and define $1 \lor a \coloneqq \max\lbrace 1, a \rbrace$, $a \in \mathbb{R}$. Then, the false discovery rate (FDR) and the true positive rate (TPR) are defined by $$ \FDR \coloneqq \mathbb{E} \big[ \FDP \big] \coloneqq \mathbb{E} \left[ \dfrac{\big| \widehat{\A} \backslash \A \big|}{1 \lor \big| \widehat{\A} \big|} \right] $$ and $$ \TPR \coloneqq \mathbb{E} \big[ \TPP \big] \coloneqq \mathbb{E} \left[ \dfrac{| \A \cap \widehat{\A} |}{1 \lor | \A |} \right], $$ respectively. # Simulations We conduct Monte Carlo simulations and plot the resulting averaged FDP and TPP over the number of included dummies T. Note that the averaged FDP and TPP are estimates of the FDR and TPR, respectively. ```{r} # Setup n <- 100 # number of observations p <- 300 # number of variables # Parameters num_act <- 10 # number of true active variables beta <- rep(0, times = p) # coefficient vector (all zeros first) beta[sample(seq(p), size = num_act, replace = FALSE)] <- 3 # coefficient vector (active variables with non-zero coefficients) true_actives <- which(beta > 0) # indices of true active variables num_dummies <- p # number of dummies T_vec <- c(1, 2, 5, 10, 20, 50, 100) # stopping points, i.e, number of included dummies before terminating the solution path MC <- 500 # number of Monte Carlo runs per stopping point # Initialize results vectors FDP <- matrix(NA, nrow = MC, ncol = length(T_vec)) TPP <- matrix(NA, nrow = MC, ncol = length(T_vec)) # Numerical zero eps <- .Machine$double.eps # Seed set.seed(12345) # Run simulations for (t in seq_along(T_vec)) { for (mc in seq(MC)) { # Generate Gaussian data X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p) y <- X %*% beta + stats::rnorm(n) # Generate dummy matrix and append it to X dummies <- matrix(stats::rnorm(n * p), nrow = n, ncol = num_dummies) XD <- cbind(X, dummies) # Create object of class tlars_cpp mod_tlars <- tlars_model(X = XD, y = y, num_dummies = num_dummies, type = "lar", info = FALSE) # Run T-LARS steps tlars(model = mod_tlars, T_stop = t, early_stop = TRUE, info = FALSE) beta_hat <- mod_tlars$get_beta() selected_var <- which(abs(beta_hat[seq(p)]) > eps) # Results FDP[mc, t] <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) TPP[mc, t] <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) } } # Compute estimates of FDR and TPR by averaging FDP and TPP over MC Monte Carlo runs FDR <- colMeans(FDP) TPR <- colMeans(TPP) ``` ```{r FDR_and_TPR, echo=FALSE, fig.align='center', message=FALSE, fig.width = 10, fig.height = 5, out.width = "90%"} # Plot results library(ggplot2) library(patchwork) plot_data = data.frame(T_vec = T_vec, FDR = 100 * FDR, TPR = 100 * TPR) # data frame containing data to be plotted (FDR and TPR in %) # FDR vs. T FDR_vs_T = ggplot(plot_data, aes(x = T_vec, y = FDR)) + labs(x = "T", y = "FDR") + scale_x_continuous(breaks = T_vec[-2], minor_breaks = c(2), limits = c(T_vec[1], T_vec[length(T_vec)])) + scale_y_continuous(breaks = seq(0, 100, by = 10), minor_breaks = c(), limits = c(0, 100)) + geom_line(linewidth = 1.5, colour = "#336C68") + geom_point(size = 2.5, colour = "#336C68") + theme_bw(base_size = 16) + theme(panel.background = element_rect(fill = "white", color = "black", linewidth = 1)) + coord_fixed(ratio = 0.85 * T_vec[length(T_vec)] / (100 - 0)) # TPR vs. T TPR_vs_T = ggplot(plot_data, aes(x = T_vec, y = TPR)) + labs(x = "T", y = "TPR") + scale_x_continuous(breaks = T_vec[-2], minor_breaks = c(2), limits = c(T_vec[1], T_vec[length(T_vec)])) + scale_y_continuous(breaks = seq(0, 100, by = 10), minor_breaks = c(), limits = c(0, 100)) + geom_line(linewidth = 1.5, colour = "#336C68") + geom_point(size = 2.5, colour = "#336C68") + theme_bw(base_size = 16) + theme(panel.background = element_rect(fill = "white", color = "black", linewidth = 1)) + coord_fixed(ratio = 0.85 * T_vec[length(T_vec)] / (100 - 0)) FDR_vs_T + TPR_vs_T # TPR vs. FDR TPR_vs_FDR = ggplot(plot_data, aes(x = FDR, y = TPR)) + labs(x = "FDR", y = "TPR") + geom_line(linewidth = 1.5, colour = "#336C68") + geom_point(size = 2.5, colour = "#336C68") + theme_bw(base_size = 16) + theme(panel.background = element_rect(fill = "white", color = "black", linewidth = 1)) TPR_vs_FDR ``` We observe that with growing number of included dummies $T$ the FDR and TPR increase. Moreover, we see that there seems to be a trade-off between FDR and TPR. For more details and discussions on these observations, we refer the interested reader to [@machkour2022terminating]. # Outlook The T-LARS algorithm is a major building block of the T-Rex selector ([Paper](https://arxiv.org/abs/2110.06048) and [R package](https://CRAN.R-project.org/package=TRexSelector)). The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the FDR at the user-defined target level while maximizing the number of selected variables. If you are working in genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings, then this is for you. Check it out! # References {-} ```{r, include = FALSE} # Reset user's options() options(old_options) ```