--- title: "Getting Started with ivgls" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with ivgls} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") set.seed(42) ``` ## Overview Estimating causal effects from high-dimensional, network-structured exposures is a core challenge in neuroscience, genomics, and environmental science. Standard penalized regression recovers associations, not causal effects: when unmeasured confounders jointly influence both the exposures and the outcome, ordinary regression coefficients are biased. Instrumental variables (IVs) break this bias by exploiting external perturbations — such as genetic variants in brain-imaging studies — that shift exposures without directly affecting the outcome. The **ivgls** package combines IV identification with graph-structured regularization. When the exposures are nodes in a known network (e.g., brain regions linked by white-matter tracts, genes linked by a pathway graph), borrowing strength across connected nodes improves support recovery. In high-dimensional genetic applications, some candidate instruments may violate the exclusion restriction through horizontal pleiotropy; **ivgls** corrects for this via a sisVIVE-style alternating update. The package provides three estimators: | Estimator | Graph penalty | Invalid-IV correction | Requires `glmgraph` | |-------------|:------------:|:---------------------:|:-------------------:| | `iv_lasso` | No | No | No | | `ivgl` | Yes | No | Yes | | `ivgl_s` | Yes | Yes | Yes | `iv_lasso` is a pure-`glmnet` two-stage LASSO baseline. `ivgl` adds a graph-Laplacian penalty in the second stage via `glmgraph`. `ivgl_s` further estimates and removes direct instrument-to-outcome effects. ## Installation ```{r install, eval=FALSE} # Step 1: install glmgraph (not on CRAN; required for ivgl and ivgl_s) devtools::install_github("cran/glmgraph") # Step 2: install ivgls install.packages("ivgls") ``` `iv_lasso()` and all graph construction, simulation, and evaluation utilities work without `glmgraph`. Calling `ivgl()` or `ivgl_s()` without it installed produces an informative error with the install command above. ## Graph construction ```{r graphs} library(ivgls) A <- make_graph(p = 30, type = "chain") L <- get_laplacian(A) cat("Number of edges:", sum(A) / 2, "\n") ``` ## Generating a causal coefficient vector `generate_beta()` places $s_0$ active nodes on the graph according to one of three signal patterns. The **smooth** pattern places active nodes in a spatially contiguous cluster with similarly-signed effects. ```{r beta} set.seed(1) bobj <- generate_beta(A, s2 = 4, signal = 2, pattern = "smooth") cat("Active nodes:", sort(bobj$active_set), "\n") cat("True effects at active nodes:\n") print(round(bobj$beta_true[bobj$active_set], 3)) ``` ## Simulating data `generate_data()` draws instruments $Z \in \mathbb{R}^{n \times q}$, constructs exposures with latent confounding, and introduces `s_alpha` invalid instruments whose direct effects on $Y$ violate the exclusion restriction. ```{r data} dat <- generate_data( n = 100, p = 30, q = 100, s_alpha = 5, alpha_strength = 3, beta_true = bobj$beta_true ) cat("Y:", length(dat$Y), "| X:", nrow(dat$X), "x", ncol(dat$X), "| Z:", nrow(dat$Z), "x", ncol(dat$Z), "\n") cat("Invalid instrument indices:", which(dat$alpha_true != 0), "\n") ``` ## Fitting the estimators ### IV-LASSO ```{r ivlasso} beta_ivl <- iv_lasso(dat$Y, dat$X, dat$Z) cat("IV-LASSO selected nodes:", which(abs(beta_ivl) > 1e-4), "\n") cat("IV-LASSO MCC:", round(get_mcc(bobj$active_set, which(abs(beta_ivl) > 1e-4), p = 30), 3), "\n") ``` ### IVGL and IVGL-S `ivgl()` and `ivgl_s()` require the `glmgraph` package. Install it with: `devtools::install_github("cran/glmgraph")`. ```{r glmgraph-estimators} if (requireNamespace("glmgraph", quietly = TRUE)) { beta_ivgl <- ivgl(dat$Y, dat$X, dat$Z, L) fit_s <- ivgl_s(dat$Y, dat$X, dat$Z, L, max_iter = 10) cat("IVGL selected nodes:", which(abs(beta_ivgl) > 1e-4), "\n") cat("IVGL-S selected nodes:", which(abs(fit_s$beta) > 1e-4), "\n") cat("IVGL MCC:", round(get_mcc(bobj$active_set, which(abs(beta_ivgl) > 1e-4), p = 30), 3), "\n") cat("IVGL-S MCC:", round(get_mcc(bobj$active_set, which(abs(fit_s$beta) > 1e-4), p = 30), 3), "\n") cat("Detected invalid instruments:", which(abs(fit_s$alpha) > 1e-4), "\n") } else { message( "Package 'glmgraph' is not installed. ", "ivgl() and ivgl_s() are unavailable.\n", "Install with: devtools::install_github(\"cran/glmgraph\")" ) } ``` ## Performance evaluation `eval_support()` returns MCC, TPR, FPR, and number of selected variables. ```{r eval} supp_ivl <- which(abs(beta_ivl) > 1e-4) metrics <- eval_support(bobj$active_set, supp_ivl, p = 30) print(round(metrics, 3)) ``` ## Graph misspecification `corrupt_graph()` randomly swaps a fraction of edges, useful for sensitivity analysis when the graph is estimated rather than known exactly. ```{r corrupt} set.seed(2) A_corrupted <- corrupt_graph(A, corruption_rate = 0.20) cat("Original edges:", sum(A) / 2, "\n") cat("Corrupted edges:", sum(A_corrupted) / 2, "\n") ``` ## Simulation study `run_simulation()` runs $B$ independent replicates and returns a long data frame with one row per method per replicate. ```{r simulation} if (requireNamespace("glmgraph", quietly = TRUE)) { res <- run_simulation( B = 5, n = 100, p = 30, q = 100, graph_type = "chain", signal_pattern = "smooth", s2 = 4, signal = 2, s_alpha = 5, alpha_strength = 3 ) aggregate(cbind(MCC, MSE, TPR, FPR) ~ Method, data = res, FUN = mean) } else { message( "Package 'glmgraph' is not installed. ", "run_simulation() is unavailable.\n", "Install with: devtools::install_github(\"cran/glmgraph\")" ) } ``` ## Application context The methods in this package were developed for neuroimaging-genetics studies using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Brain region volumes from T1-weighted MRI serve as the network-structured exposures; SNPs from genome-wide genotyping serve as candidate instruments; and MMSE score serves as the cognitive outcome. The DKT atlas defines a graph over 62 cortical and subcortical regions of interest. In that setting, `ivgl_s()` identifies a conservative set of eight causal ROIs robust to pleiotropic SNP effects, while `iv_lasso()` and `ivgl()` select broader sets that include confounded associations. See the companion paper for full details.