---
title: "Calibrating with a Weakly-Informative, Biased LLM"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Calibrating with a Weakly-Informative, Biased LLM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", echo = FALSE)
library(knitr)

# Fits at N = 100,000 take several seconds each, so this vignette CANNOT compute
# live. Results are precomputed by data-raw/precompute_largeN.R, which writes
# vignettes/largeN_results.rds. We load that if present, else fall back to the
# cached numbers below (so the vignette always renders).
load_largeN <- function() {
  for (p in c("largeN_results.rds",
              file.path("..", "vignettes", "largeN_results.rds"))) {
    if (file.exists(p)) {
      r <- tryCatch(readRDS(p), error = function(e) NULL)
      if (!is.null(r)) return(r)
    }
  }
  NULL
}

res <- load_largeN()
if (is.null(res)) {
  lam <- seq(0, 1, by = 0.1)
  res <- list(
    n = 500, N = 100000, n_reps = 16,
    opt_lambda = 0.275, grid_argmin = 0.10,
    per_rep_opt = c(0.275, 0.226, 0.052, 0.005, 0.190, 0.086, 0.175, 0.186,
                    0.139, 0.173, 0.211, 0.170, 0.027, 0.119, 0.147, 0.131),
    naive_bias = c(a = -0.119, d = 0.248),
    human_bias = c(a = 0.000, d = 0.001),
    curve = data.frame(
      lambda    = lam,
      bias_a    = c(0.0002, 0.0008, 0.0019, 0.0034, 0.0053, 0.0077, 0.0105, 0.0139, 0.0178, 0.0222, 0.0273),
      bias_a_se = c(0.0190, 0.0187, 0.0185, 0.0186, 0.0187, 0.0190, 0.0195, 0.0200, 0.0207, 0.0215, 0.0225),
      bias_d    = c(0.0011, 0.0008, 0.0006, 0.0004, 0.0003, 0.0003, 0.0003, 0.0005, 0.0007, 0.0010, 0.0014),
      bias_d_se = c(0.0130, 0.0120, 0.0111, 0.0103, 0.0098, 0.0095, 0.0094, 0.0096, 0.0101, 0.0108, 0.0116),
      risk      = c(0.0389, 0.0381, 0.0382, 0.0391, 0.0409, 0.0435, 0.0470, 0.0514, 0.0568, 0.0631, 0.0705),
      risk_se   = c(0.0014, 0.0013, 0.0013, 0.0014, 0.0015, 0.0016, 0.0018, 0.0021, 0.0025, 0.0029, 0.0034)
    )
  )
}
```

This vignette treats the regime prediction-powered inference is built for: a smaller human sample (here `n = 500`) alongside a much larger synthetic / LLM sample (`N = 100000`). The LLM here is biased, in that its item parameters are systematically off, making them only weakly informative about the human responses.

Importantly, the mixed-subjects (PPI) estimator is asymptotically unbiased for the true human parameters at every $\lambda$. Tuning $\lambda$ is an efficiency knob, not a bias knob. A naive fit that pools the human and LLM responses has no such protection: the `n = 500` humans are outvoted by the `N = 100000` rows of LLM-generated responses, and the estimate inherits the LLM's biased data generating process.

```{r note-precompute, echo = FALSE, results = "asis"}
cat(sprintf(
  "_All numbers are precomputed (`data-raw/precompute_largeN.R`): n = %d, N = %d, %d Monte Carlo replications._",
  res$n, res$N, res$n_reps))
```

## The setup

Human responses come from a true 8-item 2PL model (`a ∈ [0.8, 1.6]`, `d ∈ [-1.1, 1.1]`). The LLM is a shifted version with discriminations attenuated by 10% and intercepts shifted up by 0.25. This makes the response structure plausible but biased:

```r
true_pars <- data.frame(item = paste0("Item", 1:8),
                        a = seq(0.8, 1.6, length.out = 8),
                        d = seq(-1.1, 1.1, length.out = 8))
llm   <- true_pars
llm$a <- 0.9 * true_pars$a       # ~10% attenuated discriminations
llm$d <- true_pars$d + 0.25      # +0.25 intercept shift

theta     <- rnorm(500)
observed  <- simulate_2pl(theta, true_pars)             # n = 500 human
predicted <- simulate_2pl(theta, llm)                   # paired LLM (same people)
generated <- simulate_2pl(rnorm(100000), llm)           # N = 100000 unlabeled LLM
```

## Naive pooling inherits the bias

The obvious move is to pool everything and fit one model:

```r
naive <- fit_2pl(rbind(observed, generated))   # 500 human + 100000 LLM rows
```

The 500 humans are in the fit, but against 100,000 LLM rows their information is washed out, and the estimate is dragged onto the LLM's shifted parameters:

```{r naive-line, echo = FALSE, results = "asis"}
cat(sprintf(
  "Averaged over the replications, the naive estimator's item-parameter bias is **%+.3f** in the slopes and **%+.3f** in the intercepts — essentially the LLM's shift (−0.1·a, +0.25). Because `N = 100000`, that wrong answer is estimated *very precisely* (a tiny standard error); more LLM data only sharpens the bias.",
  res$naive_bias[["a"]], res$naive_bias[["d"]]))
```

## $\lambda$ moves efficiency, not bias

The mixed-subjects estimator minimizes the loss

$$L_o^{\mathrm{marg}}(\gamma) \;+\; \lambda\bigl[L_g^{\mathrm{marg}}(\gamma) - L_p^{\mathrm{marg}}(\gamma)\bigr].$$

At the true parameters the human loss is mean-zero and the paired correction `L_g − L_p` is also mean-zero, so the estimating equation is mean-zero for every $\lambda$. Unbiasedness comes from this structure, not from a specific value of $\lambda$. To see it directly, we fit the estimator across a grid of $\lambda$ values and track two things: the item-parameter bias (Monte Carlo mean of `estimate − truth`) and the model-based ability-score risk $\mathbb{E}\big[g'\Sigma_\gamma(\lambda) g\big]$ (the quantity the tuner actually minimizes).

```{r bias-curve, echo = FALSE, fig.width = 7, fig.height = 3.3, fig.alt = "Item-parameter bias of the mixed-subjects estimator is flat near zero across all lambda, far from the naive pooled bias shown as dashed reference lines."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  cv <- res$curve
  bd <- rbind(
    data.frame(lambda = cv$lambda, bias = cv$bias_a, se = cv$bias_a_se,
               parameter = "discrimination (a)"),
    data.frame(lambda = cv$lambda, bias = cv$bias_d, se = cv$bias_d_se,
               parameter = "difficulty (d)")
  )
  refs <- data.frame(
    parameter = c("discrimination (a)", "difficulty (d)"),
    naive = c(res$naive_bias[["a"]], res$naive_bias[["d"]])
  )
  ggplot(bd, aes(lambda, bias)) +
    geom_hline(yintercept = 0, colour = "grey70") +
    geom_hline(data = refs, aes(yintercept = naive), linetype = 2, colour = "firebrick") +
    geom_ribbon(aes(ymin = bias - 2 * se, ymax = bias + 2 * se), alpha = 0.15) +
    geom_line() + geom_point(size = 1.6) +
    facet_wrap(~ parameter) +
    labs(x = expression(lambda), y = "item-parameter bias",
         title = "Bias is flat at ~0 for every λ (dashed red = naive pooled bias)") +
    theme_minimal()
}
```

The mixed-subjects bias sits on zero across the entire range of $\lambda$ (the shaded band is $\pm 2$ Monte Carlo SE); the dashed red lines mark the naive pooled bias. Tuning $\lambda$ changes efficiency:

```{r risk-curve, echo = FALSE, fig.width = 5.4, fig.height = 3.3, fig.alt = "Model-based ability-score risk as a function of lambda, with a shallow minimum near the optimized lambda."}
if (requireNamespace("ggplot2", quietly = TRUE)) {
  cv <- res$curve
  vline <- cv$lambda[which.min(cv$risk)]   # minimum of the (averaged) risk curve
  rug <- data.frame(lambda = res$per_rep_opt)
  ggplot(cv, aes(lambda, risk)) +
    geom_ribbon(aes(ymin = risk - 2 * risk_se, ymax = risk + 2 * risk_se), alpha = 0.15) +
    geom_line() + geom_point(size = 1.6) +
    geom_vline(xintercept = vline, linetype = 3, colour = "steelblue") +
    geom_rug(data = rug, aes(x = lambda), inherit.aes = FALSE, sides = "b",
             colour = "firebrick", alpha = 0.6) +
    annotate("text", x = vline, y = max(cv$risk), hjust = -0.1,
             label = sprintf("avg-curve min ~ λ = %.1f", vline),
             colour = "steelblue", size = 3.2) +
    labs(x = expression(lambda), y = "model-based ability-score risk",
         title = "Efficiency curve (shallow); red ticks = per-dataset optima") +
    theme_minimal()
}
```

```{r efficiency-line, echo = FALSE, results = "asis"}
cv <- res$curve
argmin <- cv$lambda[which.min(cv$risk)]
gain   <- (cv$risk[cv$lambda == 0] - min(cv$risk)) / cv$risk[cv$lambda == 0]
cat(sprintf(
  paste0("For this weakly-informative LLM the averaged risk curve is shallow ",
  "and rises for larger λ: leaning on a poorly-correlated predictor *adds* ",
  "measurement error to latent ability. Its minimum sits near λ ≈ %.1f, only",
  "about %.0f%% below the λ = 0 (human-only) risk — almost no efficiency to be",
  "gained. Because the curve is so flat, each individual dataset's optimum ",
  "scatters around this value (the red ticks); see the next section. Every point",
  "on the curve is unbiased."), argmin, 100 * gain))
```

## Choosing $\lambda$

The curve above was sampled on a grid only to draw the surface using `tune_lambda_ability_risk(..., method = "grid")`. To choose an operating $\lambda$ you do not need a grid at all. By default, `tune_lambda_ability_risk()` selects $\lambda$ by direct optimization of the risk over `[0, 1]` (`stats::optimize()`):

```r
# Direct optimization is the default (method = "optimize").
tuned <- tune_lambda_ability_risk(
  observed = observed, predicted = predicted, generated = generated,
  target_resp = observed, initial_pars = human_start$pars,
  fit_fn = fit_mixed_subjects_mml, n_quad = 11
)
tuned$best_lambda            # continuous lambda

# Pass method = "grid" (and a lambda_grid) to scan instead -- how the curve
# above was drawn. lambda_grid otherwise just bounds the optimizer's search.
```

```{r tune-line, echo = FALSE, results = "asis"}
pr <- res$per_rep_opt
cat(sprintf(
  paste0("The optimizer returns the minimizer of this dataset's risk surface. ",
  "Here, λ = %.2f. Every dataset has its own (noisy) risk surface, so its  ",
  "optimal λ varies. Across the %d replications the per-dataset optimum ",
  "averaged %.2f and ranged [%.1f, %.1f], scattering around the minimum of the ",
  "*averaged* curve (≈ %.1f). (These are not the same point — the minimum of the ",
  "average risk is not the average of the per-dataset minima.) The scatter is wide ",
  "here because the surface is shallow; informative predictions sharpens it."),
  res$opt_lambda, res$n_reps, mean(pr), min(pr), max(pr), res$grid_argmin))
```

(The 2-fold cross-fitted tuner, `tune_lambda_ability_risk_crossfit()`, lands at the same place: at $N \gg n$ the cross-fit $\lambda$-inflation $N/(N + n/2)$ vs $N/(N + n)$ is negligible, so cross-fitting does not change the selected $\lambda$.)

## Takeaways

1. The mixed-subjects estimator is unbiased for the true human parameters at every $\lambda$; pooling lets a large biased LLM sample outvote the human anchor and inherits its bias.
2. $\lambda$ tuning is performed directly and efficiently. `tune_lambda_ability_risk()` selects $\lambda$ by direct 1-D optimization by default; a grid (`method = "grid"`) is just a
   convenient way to visualize the whole risk surface.

## Reproducing

`data-raw/precompute_largeN.R` runs the Monte Carlo over the λ grid and the direct
optimization, and writes the cached results
(`Rscript data-raw/precompute_largeN.R [n_reps] [cores] [N]`). At `N = 100000` each
fit takes several seconds, so it is run once offline rather than during vignette
knitting; pass a larger `N` to confirm the picture is unchanged.