---
title: "Getting Started with ivgls"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with ivgls}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
set.seed(42)
```

## Overview

Estimating causal effects from high-dimensional, network-structured exposures
is a core challenge in neuroscience, genomics, and environmental science.
Standard penalized regression recovers associations, not causal effects: when
unmeasured confounders jointly influence both the exposures and the outcome,
ordinary regression coefficients are biased. Instrumental variables (IVs)
break this bias by exploiting external perturbations — such as genetic
variants in brain-imaging studies — that shift exposures without directly
affecting the outcome.

The **ivgls** package combines IV identification with graph-structured
regularization. When the exposures are nodes in a known network (e.g., brain
regions linked by white-matter tracts, genes linked by a pathway graph),
borrowing strength across connected nodes improves support recovery. In
high-dimensional genetic applications, some candidate instruments may violate
the exclusion restriction through horizontal pleiotropy; **ivgls** corrects
for this via a sisVIVE-style alternating update.

The package provides three estimators:

| Estimator   | Graph penalty | Invalid-IV correction | Requires `glmgraph` |
|-------------|:------------:|:---------------------:|:-------------------:|
| `iv_lasso`  | No           | No                    | No                  |
| `ivgl`      | Yes          | No                    | Yes                 |
| `ivgl_s`    | Yes          | Yes                   | Yes                 |

`iv_lasso` is a pure-`glmnet` two-stage LASSO baseline. `ivgl` adds a
graph-Laplacian penalty in the second stage via `glmgraph`. `ivgl_s` further
estimates and removes direct instrument-to-outcome effects.

## Installation

```{r install, eval=FALSE}
# Step 1: install glmgraph (not on CRAN; required for ivgl and ivgl_s)
devtools::install_github("cran/glmgraph")

# Step 2: install ivgls
install.packages("ivgls")
```

`iv_lasso()` and all graph construction, simulation, and evaluation utilities
work without `glmgraph`. Calling `ivgl()` or `ivgl_s()` without it installed
produces an informative error with the install command above.

## Graph construction

```{r graphs}
library(ivgls)

A <- make_graph(p = 30, type = "chain")
L <- get_laplacian(A)

cat("Number of edges:", sum(A) / 2, "\n")
```

## Generating a causal coefficient vector

`generate_beta()` places $s_0$ active nodes on the graph according to one
of three signal patterns. The **smooth** pattern places active nodes in a
spatially contiguous cluster with similarly-signed effects.

```{r beta}
set.seed(1)

bobj <- generate_beta(A, s2 = 4, signal = 2, pattern = "smooth")

cat("Active nodes:", sort(bobj$active_set), "\n")
cat("True effects at active nodes:\n")
print(round(bobj$beta_true[bobj$active_set], 3))
```

## Simulating data

`generate_data()` draws instruments $Z \in \mathbb{R}^{n \times q}$,
constructs exposures with latent confounding, and introduces `s_alpha`
invalid instruments whose direct effects on $Y$ violate the exclusion
restriction.

```{r data}
dat <- generate_data(
  n              = 100,
  p              = 30,
  q              = 100,
  s_alpha        = 5,
  alpha_strength = 3,
  beta_true      = bobj$beta_true
)

cat("Y:", length(dat$Y),
    "| X:", nrow(dat$X), "x", ncol(dat$X),
    "| Z:", nrow(dat$Z), "x", ncol(dat$Z), "\n")
cat("Invalid instrument indices:", which(dat$alpha_true != 0), "\n")
```

## Fitting the estimators

### IV-LASSO

```{r ivlasso}
beta_ivl <- iv_lasso(dat$Y, dat$X, dat$Z)

cat("IV-LASSO selected nodes:", which(abs(beta_ivl) > 1e-4), "\n")
cat("IV-LASSO MCC:",
    round(get_mcc(bobj$active_set, which(abs(beta_ivl) > 1e-4), p = 30), 3), "\n")
```

### IVGL and IVGL-S

`ivgl()` and `ivgl_s()` require the `glmgraph` package. Install it with:
`devtools::install_github("cran/glmgraph")`.

```{r glmgraph-estimators}
if (requireNamespace("glmgraph", quietly = TRUE)) {

  beta_ivgl <- ivgl(dat$Y, dat$X, dat$Z, L)
  fit_s     <- ivgl_s(dat$Y, dat$X, dat$Z, L, max_iter = 10)

  cat("IVGL selected nodes:", which(abs(beta_ivgl) > 1e-4), "\n")
  cat("IVGL-S selected nodes:", which(abs(fit_s$beta) > 1e-4), "\n")

  cat("IVGL MCC:",
      round(get_mcc(bobj$active_set, which(abs(beta_ivgl) > 1e-4), p = 30), 3), "\n")
  cat("IVGL-S MCC:",
      round(get_mcc(bobj$active_set, which(abs(fit_s$beta) > 1e-4), p = 30), 3), "\n")

  cat("Detected invalid instruments:", which(abs(fit_s$alpha) > 1e-4), "\n")

} else {
  message(
    "Package 'glmgraph' is not installed. ",
    "ivgl() and ivgl_s() are unavailable.\n",
    "Install with: devtools::install_github(\"cran/glmgraph\")"
  )
}
```

## Performance evaluation

`eval_support()` returns MCC, TPR, FPR, and number of selected variables.

```{r eval}
supp_ivl <- which(abs(beta_ivl) > 1e-4)
metrics  <- eval_support(bobj$active_set, supp_ivl, p = 30)
print(round(metrics, 3))
```

## Graph misspecification

`corrupt_graph()` randomly swaps a fraction of edges, useful for sensitivity
analysis when the graph is estimated rather than known exactly.

```{r corrupt}
set.seed(2)
A_corrupted <- corrupt_graph(A, corruption_rate = 0.20)

cat("Original edges:", sum(A) / 2, "\n")
cat("Corrupted edges:", sum(A_corrupted) / 2, "\n")
```

## Simulation study

`run_simulation()` runs $B$ independent replicates and returns a long
data frame with one row per method per replicate.

```{r simulation}
if (requireNamespace("glmgraph", quietly = TRUE)) {

  res <- run_simulation(
    B              = 5,
    n              = 100,
    p              = 30,
    q              = 100,
    graph_type     = "chain",
    signal_pattern = "smooth",
    s2             = 4,
    signal         = 2,
    s_alpha        = 5,
    alpha_strength = 3
  )

  aggregate(cbind(MCC, MSE, TPR, FPR) ~ Method, data = res, FUN = mean)

} else {
  message(
    "Package 'glmgraph' is not installed. ",
    "run_simulation() is unavailable.\n",
    "Install with: devtools::install_github(\"cran/glmgraph\")"
  )
}
```

## Application context

The methods in this package were developed for neuroimaging-genetics studies
using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).
Brain region volumes from T1-weighted MRI serve as the network-structured
exposures; SNPs from genome-wide genotyping serve as candidate instruments;
and MMSE score serves as the cognitive outcome. The DKT atlas defines a
graph over 62 cortical and subcortical regions of interest. In that setting,
`ivgl_s()` identifies a conservative set of eight causal ROIs robust to
pleiotropic SNP effects, while `iv_lasso()` and `ivgl()` select broader
sets that include confounded associations. See the companion paper for full
details.