--- title: "Calibrating with a Weakly-Informative, Biased LLM" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Calibrating with a Weakly-Informative, Biased LLM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", echo = FALSE) library(knitr) # Fits at N = 100,000 take several seconds each, so this vignette CANNOT compute # live. Results are precomputed by data-raw/precompute_largeN.R, which writes # vignettes/largeN_results.rds. We load that if present, else fall back to the # cached numbers below (so the vignette always renders). load_largeN <- function() { for (p in c("largeN_results.rds", file.path("..", "vignettes", "largeN_results.rds"))) { if (file.exists(p)) { r <- tryCatch(readRDS(p), error = function(e) NULL) if (!is.null(r)) return(r) } } NULL } res <- load_largeN() if (is.null(res)) { lam <- seq(0, 1, by = 0.1) res <- list( n = 500, N = 100000, n_reps = 16, opt_lambda = 0.275, grid_argmin = 0.10, per_rep_opt = c(0.275, 0.226, 0.052, 0.005, 0.190, 0.086, 0.175, 0.186, 0.139, 0.173, 0.211, 0.170, 0.027, 0.119, 0.147, 0.131), naive_bias = c(a = -0.119, d = 0.248), human_bias = c(a = 0.000, d = 0.001), curve = data.frame( lambda = lam, bias_a = c(0.0002, 0.0008, 0.0019, 0.0034, 0.0053, 0.0077, 0.0105, 0.0139, 0.0178, 0.0222, 0.0273), bias_a_se = c(0.0190, 0.0187, 0.0185, 0.0186, 0.0187, 0.0190, 0.0195, 0.0200, 0.0207, 0.0215, 0.0225), bias_d = c(0.0011, 0.0008, 0.0006, 0.0004, 0.0003, 0.0003, 0.0003, 0.0005, 0.0007, 0.0010, 0.0014), bias_d_se = c(0.0130, 0.0120, 0.0111, 0.0103, 0.0098, 0.0095, 0.0094, 0.0096, 0.0101, 0.0108, 0.0116), risk = c(0.0389, 0.0381, 0.0382, 0.0391, 0.0409, 0.0435, 0.0470, 0.0514, 0.0568, 0.0631, 0.0705), risk_se = c(0.0014, 0.0013, 0.0013, 0.0014, 0.0015, 0.0016, 0.0018, 0.0021, 0.0025, 0.0029, 0.0034) ) ) } ``` This vignette treats the regime prediction-powered inference is built for: a smaller human sample (here `n = 500`) alongside a much larger synthetic / LLM sample (`N = 100000`). The LLM here is biased, in that its item parameters are systematically off, making them only weakly informative about the human responses. Importantly, the mixed-subjects (PPI) estimator is asymptotically unbiased for the true human parameters at every $\lambda$. Tuning $\lambda$ is an efficiency knob, not a bias knob. A naive fit that pools the human and LLM responses has no such protection: the `n = 500` humans are outvoted by the `N = 100000` rows of LLM-generated responses, and the estimate inherits the LLM's biased data generating process. ```{r note-precompute, echo = FALSE, results = "asis"} cat(sprintf( "_All numbers are precomputed (`data-raw/precompute_largeN.R`): n = %d, N = %d, %d Monte Carlo replications._", res$n, res$N, res$n_reps)) ``` ## The setup Human responses come from a true 8-item 2PL model (`a ∈ [0.8, 1.6]`, `d ∈ [-1.1, 1.1]`). The LLM is a shifted version with discriminations attenuated by 10% and intercepts shifted up by 0.25. This makes the response structure plausible but biased: ```r true_pars <- data.frame(item = paste0("Item", 1:8), a = seq(0.8, 1.6, length.out = 8), d = seq(-1.1, 1.1, length.out = 8)) llm <- true_pars llm$a <- 0.9 * true_pars$a # ~10% attenuated discriminations llm$d <- true_pars$d + 0.25 # +0.25 intercept shift theta <- rnorm(500) observed <- simulate_2pl(theta, true_pars) # n = 500 human predicted <- simulate_2pl(theta, llm) # paired LLM (same people) generated <- simulate_2pl(rnorm(100000), llm) # N = 100000 unlabeled LLM ``` ## Naive pooling inherits the bias The obvious move is to pool everything and fit one model: ```r naive <- fit_2pl(rbind(observed, generated)) # 500 human + 100000 LLM rows ``` The 500 humans are in the fit, but against 100,000 LLM rows their information is washed out, and the estimate is dragged onto the LLM's shifted parameters: ```{r naive-line, echo = FALSE, results = "asis"} cat(sprintf( "Averaged over the replications, the naive estimator's item-parameter bias is **%+.3f** in the slopes and **%+.3f** in the intercepts — essentially the LLM's shift (−0.1·a, +0.25). Because `N = 100000`, that wrong answer is estimated *very precisely* (a tiny standard error); more LLM data only sharpens the bias.", res$naive_bias[["a"]], res$naive_bias[["d"]])) ``` ## $\lambda$ moves efficiency, not bias The mixed-subjects estimator minimizes the loss $$L_o^{\mathrm{marg}}(\gamma) \;+\; \lambda\bigl[L_g^{\mathrm{marg}}(\gamma) - L_p^{\mathrm{marg}}(\gamma)\bigr].$$ At the true parameters the human loss is mean-zero and the paired correction `L_g − L_p` is also mean-zero, so the estimating equation is mean-zero for every $\lambda$. Unbiasedness comes from this structure, not from a specific value of $\lambda$. To see it directly, we fit the estimator across a grid of $\lambda$ values and track two things: the item-parameter bias (Monte Carlo mean of `estimate − truth`) and the model-based ability-score risk $\mathbb{E}\big[g'\Sigma_\gamma(\lambda) g\big]$ (the quantity the tuner actually minimizes). ```{r bias-curve, echo = FALSE, fig.width = 7, fig.height = 3.3, fig.alt = "Item-parameter bias of the mixed-subjects estimator is flat near zero across all lambda, far from the naive pooled bias shown as dashed reference lines."} if (requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) cv <- res$curve bd <- rbind( data.frame(lambda = cv$lambda, bias = cv$bias_a, se = cv$bias_a_se, parameter = "discrimination (a)"), data.frame(lambda = cv$lambda, bias = cv$bias_d, se = cv$bias_d_se, parameter = "difficulty (d)") ) refs <- data.frame( parameter = c("discrimination (a)", "difficulty (d)"), naive = c(res$naive_bias[["a"]], res$naive_bias[["d"]]) ) ggplot(bd, aes(lambda, bias)) + geom_hline(yintercept = 0, colour = "grey70") + geom_hline(data = refs, aes(yintercept = naive), linetype = 2, colour = "firebrick") + geom_ribbon(aes(ymin = bias - 2 * se, ymax = bias + 2 * se), alpha = 0.15) + geom_line() + geom_point(size = 1.6) + facet_wrap(~ parameter) + labs(x = expression(lambda), y = "item-parameter bias", title = "Bias is flat at ~0 for every λ (dashed red = naive pooled bias)") + theme_minimal() } ``` The mixed-subjects bias sits on zero across the entire range of $\lambda$ (the shaded band is $\pm 2$ Monte Carlo SE); the dashed red lines mark the naive pooled bias. Tuning $\lambda$ changes efficiency: ```{r risk-curve, echo = FALSE, fig.width = 5.4, fig.height = 3.3, fig.alt = "Model-based ability-score risk as a function of lambda, with a shallow minimum near the optimized lambda."} if (requireNamespace("ggplot2", quietly = TRUE)) { cv <- res$curve vline <- cv$lambda[which.min(cv$risk)] # minimum of the (averaged) risk curve rug <- data.frame(lambda = res$per_rep_opt) ggplot(cv, aes(lambda, risk)) + geom_ribbon(aes(ymin = risk - 2 * risk_se, ymax = risk + 2 * risk_se), alpha = 0.15) + geom_line() + geom_point(size = 1.6) + geom_vline(xintercept = vline, linetype = 3, colour = "steelblue") + geom_rug(data = rug, aes(x = lambda), inherit.aes = FALSE, sides = "b", colour = "firebrick", alpha = 0.6) + annotate("text", x = vline, y = max(cv$risk), hjust = -0.1, label = sprintf("avg-curve min ~ λ = %.1f", vline), colour = "steelblue", size = 3.2) + labs(x = expression(lambda), y = "model-based ability-score risk", title = "Efficiency curve (shallow); red ticks = per-dataset optima") + theme_minimal() } ``` ```{r efficiency-line, echo = FALSE, results = "asis"} cv <- res$curve argmin <- cv$lambda[which.min(cv$risk)] gain <- (cv$risk[cv$lambda == 0] - min(cv$risk)) / cv$risk[cv$lambda == 0] cat(sprintf( paste0("For this weakly-informative LLM the averaged risk curve is shallow ", "and rises for larger λ: leaning on a poorly-correlated predictor *adds* ", "measurement error to latent ability. Its minimum sits near λ ≈ %.1f, only", "about %.0f%% below the λ = 0 (human-only) risk — almost no efficiency to be", "gained. Because the curve is so flat, each individual dataset's optimum ", "scatters around this value (the red ticks); see the next section. Every point", "on the curve is unbiased."), argmin, 100 * gain)) ``` ## Choosing $\lambda$ The curve above was sampled on a grid only to draw the surface using `tune_lambda_ability_risk(..., method = "grid")`. To choose an operating $\lambda$ you do not need a grid at all. By default, `tune_lambda_ability_risk()` selects $\lambda$ by direct optimization of the risk over `[0, 1]` (`stats::optimize()`): ```r # Direct optimization is the default (method = "optimize"). tuned <- tune_lambda_ability_risk( observed = observed, predicted = predicted, generated = generated, target_resp = observed, initial_pars = human_start$pars, fit_fn = fit_mixed_subjects_mml, n_quad = 11 ) tuned$best_lambda # continuous lambda # Pass method = "grid" (and a lambda_grid) to scan instead -- how the curve # above was drawn. lambda_grid otherwise just bounds the optimizer's search. ``` ```{r tune-line, echo = FALSE, results = "asis"} pr <- res$per_rep_opt cat(sprintf( paste0("The optimizer returns the minimizer of this dataset's risk surface. ", "Here, λ = %.2f. Every dataset has its own (noisy) risk surface, so its ", "optimal λ varies. Across the %d replications the per-dataset optimum ", "averaged %.2f and ranged [%.1f, %.1f], scattering around the minimum of the ", "*averaged* curve (≈ %.1f). (These are not the same point — the minimum of the ", "average risk is not the average of the per-dataset minima.) The scatter is wide ", "here because the surface is shallow; informative predictions sharpens it."), res$opt_lambda, res$n_reps, mean(pr), min(pr), max(pr), res$grid_argmin)) ``` (The 2-fold cross-fitted tuner, `tune_lambda_ability_risk_crossfit()`, lands at the same place: at $N \gg n$ the cross-fit $\lambda$-inflation $N/(N + n/2)$ vs $N/(N + n)$ is negligible, so cross-fitting does not change the selected $\lambda$.) ## Takeaways 1. The mixed-subjects estimator is unbiased for the true human parameters at every $\lambda$; pooling lets a large biased LLM sample outvote the human anchor and inherits its bias. 2. $\lambda$ tuning is performed directly and efficiently. `tune_lambda_ability_risk()` selects $\lambda$ by direct 1-D optimization by default; a grid (`method = "grid"`) is just a convenient way to visualize the whole risk surface. ## Reproducing `data-raw/precompute_largeN.R` runs the Monte Carlo over the λ grid and the direct optimization, and writes the cached results (`Rscript data-raw/precompute_largeN.R [n_reps] [cores] [N]`). At `N = 100000` each fit takes several seconds, so it is run once offline rather than during vignette knitting; pass a larger `N` to confirm the picture is unchanged.