Getting Started with ivgls

Overview

Estimating causal effects from high-dimensional, network-structured exposures is a core challenge in neuroscience, genomics, and environmental science. Standard penalized regression recovers associations, not causal effects: when unmeasured confounders jointly influence both the exposures and the outcome, ordinary regression coefficients are biased. Instrumental variables (IVs) break this bias by exploiting external perturbations — such as genetic variants in brain-imaging studies — that shift exposures without directly affecting the outcome.

The ivgls package combines IV identification with graph-structured regularization. When the exposures are nodes in a known network (e.g., brain regions linked by white-matter tracts, genes linked by a pathway graph), borrowing strength across connected nodes improves support recovery. In high-dimensional genetic applications, some candidate instruments may violate the exclusion restriction through horizontal pleiotropy; ivgls corrects for this via a sisVIVE-style alternating update.

The package provides three estimators:

Estimator Graph penalty Invalid-IV correction Requires glmgraph
iv_lasso No No No
ivgl Yes No Yes
ivgl_s Yes Yes Yes

iv_lasso is a pure-glmnet two-stage LASSO baseline. ivgl adds a graph-Laplacian penalty in the second stage via glmgraph. ivgl_s further estimates and removes direct instrument-to-outcome effects.

Installation

# Step 1: install glmgraph (not on CRAN; required for ivgl and ivgl_s)
devtools::install_github("cran/glmgraph")

# Step 2: install ivgls
install.packages("ivgls")

iv_lasso() and all graph construction, simulation, and evaluation utilities work without glmgraph. Calling ivgl() or ivgl_s() without it installed produces an informative error with the install command above.

Graph construction

library(ivgls)

A <- make_graph(p = 30, type = "chain")
L <- get_laplacian(A)

cat("Number of edges:", sum(A) / 2, "\n")
#> Number of edges: 29

Generating a causal coefficient vector

generate_beta() places \(s_0\) active nodes on the graph according to one of three signal patterns. The smooth pattern places active nodes in a spatially contiguous cluster with similarly-signed effects.

set.seed(1)

bobj <- generate_beta(A, s2 = 4, signal = 2, pattern = "smooth")

cat("Active nodes:", sort(bobj$active_set), "\n")
#> Active nodes: 23 24 25 26
cat("True effects at active nodes:\n")
#> True effects at active nodes:
print(round(bobj$beta_true[bobj$active_set], 3))
#> [1] 2.037 1.833 2.319 2.066

Simulating data

generate_data() draws instruments \(Z \in \mathbb{R}^{n \times q}\), constructs exposures with latent confounding, and introduces s_alpha invalid instruments whose direct effects on \(Y\) violate the exclusion restriction.

dat <- generate_data(
  n              = 100,
  p              = 30,
  q              = 100,
  s_alpha        = 5,
  alpha_strength = 3,
  beta_true      = bobj$beta_true
)

cat("Y:", length(dat$Y),
    "| X:", nrow(dat$X), "x", ncol(dat$X),
    "| Z:", nrow(dat$Z), "x", ncol(dat$Z), "\n")
#> Y: 100 | X: 100 x 30 | Z: 100 x 100
cat("Invalid instrument indices:", which(dat$alpha_true != 0), "\n")
#> Invalid instrument indices: 1 2 3 4 5

Fitting the estimators

IV-LASSO

beta_ivl <- iv_lasso(dat$Y, dat$X, dat$Z)

cat("IV-LASSO selected nodes:", which(abs(beta_ivl) > 1e-4), "\n")
#> IV-LASSO selected nodes: 2 3 4 5 6 8 10 11 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27 28
cat("IV-LASSO MCC:",
    round(get_mcc(bobj$active_set, which(abs(beta_ivl) > 1e-4), p = 30), 3), "\n")
#> IV-LASSO MCC: 0.196

IVGL and IVGL-S

ivgl() and ivgl_s() require the glmgraph package. Install it with: devtools::install_github("cran/glmgraph").

if (requireNamespace("glmgraph", quietly = TRUE)) {

  beta_ivgl <- ivgl(dat$Y, dat$X, dat$Z, L)
  fit_s     <- ivgl_s(dat$Y, dat$X, dat$Z, L, max_iter = 10)

  cat("IVGL selected nodes:", which(abs(beta_ivgl) > 1e-4), "\n")
  cat("IVGL-S selected nodes:", which(abs(fit_s$beta) > 1e-4), "\n")

  cat("IVGL MCC:",
      round(get_mcc(bobj$active_set, which(abs(beta_ivgl) > 1e-4), p = 30), 3), "\n")
  cat("IVGL-S MCC:",
      round(get_mcc(bobj$active_set, which(abs(fit_s$beta) > 1e-4), p = 30), 3), "\n")

  cat("Detected invalid instruments:", which(abs(fit_s$alpha) > 1e-4), "\n")

} else {
  message(
    "Package 'glmgraph' is not installed. ",
    "ivgl() and ivgl_s() are unavailable.\n",
    "Install with: devtools::install_github(\"cran/glmgraph\")"
  )
}
#> IVGL selected nodes: 2 3 4 5 6 16 18 20 23 24 25 26 28 
#> IVGL-S selected nodes: 23 24 25 26 28 
#> IVGL MCC: 0.449 
#> IVGL-S MCC: 0.877 
#> Detected invalid instruments: 1 2 3 4 5 6 32 37 42 57 59 60 78 79 84 97

Performance evaluation

eval_support() returns MCC, TPR, FPR, and number of selected variables.

supp_ivl <- which(abs(beta_ivl) > 1e-4)
metrics  <- eval_support(bobj$active_set, supp_ivl, p = 30)
print(round(metrics, 3))
#>      MCC      TPR      FPR Selected 
#>    0.196    1.000    0.769   24.000

Graph misspecification

corrupt_graph() randomly swaps a fraction of edges, useful for sensitivity analysis when the graph is estimated rather than known exactly.

set.seed(2)
A_corrupted <- corrupt_graph(A, corruption_rate = 0.20)

cat("Original edges:", sum(A) / 2, "\n")
#> Original edges: 29
cat("Corrupted edges:", sum(A_corrupted) / 2, "\n")
#> Corrupted edges: 29

Simulation study

run_simulation() runs \(B\) independent replicates and returns a long data frame with one row per method per replicate.

if (requireNamespace("glmgraph", quietly = TRUE)) {

  res <- run_simulation(
    B              = 5,
    n              = 100,
    p              = 30,
    q              = 100,
    graph_type     = "chain",
    signal_pattern = "smooth",
    s2             = 4,
    signal         = 2,
    s_alpha        = 5,
    alpha_strength = 3
  )

  aggregate(cbind(MCC, MSE, TPR, FPR) ~ Method, data = res, FUN = mean)

} else {
  message(
    "Package 'glmgraph' is not installed. ",
    "run_simulation() is unavailable.\n",
    "Install with: devtools::install_github(\"cran/glmgraph\")"
  )
}
#>     Method       MCC         MSE TPR       FPR
#> 1 IV_LASSO 0.4268386 0.119422153   1 0.4692308
#> 2     IVGL 0.5894756 0.082105051   1 0.2461538
#> 3   IVGL_S 0.7632763 0.003185131   1 0.1153846

Application context

The methods in this package were developed for neuroimaging-genetics studies using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Brain region volumes from T1-weighted MRI serve as the network-structured exposures; SNPs from genome-wide genotyping serve as candidate instruments; and MMSE score serves as the cognitive outcome. The DKT atlas defines a graph over 62 cortical and subcortical regions of interest. In that setting, ivgl_s() identifies a conservative set of eight causal ROIs robust to pleiotropic SNP effects, while iv_lasso() and ivgl() select broader sets that include confounded associations. See the companion paper for full details.