--- title: "Mixed-Subjects IRT Calibration" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Mixed-Subjects IRT Calibration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` This vignette shows the recommended mixed-subjects workflow for a unidimensional 2PL model using the marginal maximum-likelihood (MML) estimator (`fit_mixed_subjects_mml`). The package expects three item response matrices with no respondent IDs and the same ordering of item columns: - `observed`: $n$ rows of binary human item responses - `predicted`: $n$ rows of binary LLM-predicted item responses for the human respondents; rows must be ordered to correspond to respondents in `observed` - `generated`: $N$ rows of additional binary LLM-generated item responses (typically $N \gg n$), generated from the same model and procedure as `predicted` All three matrices must contain binary responses $(Y_{ij} \in \{0,1\})$. Probability (fractional) predictions are **not** accepted, as they are not a valid likelihood input for the MML IRT objective and break the correction. If your prediction model outputs probabilities, sample binary responses from them first (e.g. `rbinom`). The fitted objective is ```{r, echo = FALSE, results = "asis"} cat( "$$L_o^{\\mathrm{marg}}(\\gamma) + \\lambda\\bigl[", "L_g^{\\mathrm{marg}}(\\gamma) - L_p^{\\mathrm{marg}}(\\gamma)\\bigr]$$" ) ``` where $\gamma$ is a vector of item parameters and each $L^\mathrm{marg}$ is the true IRT marginal negative log-likelihood, with posteriors recomputed from the current candidate $\gamma$ at every gradient step. Setting $\lambda = 0$ recovers the human-only MML calibration. See the [Choosing Lambda](lambda-tuning.html) vignette for the specific background on why the MML objective is preferred over expected-count based estimators. ## Simulate example data ```{r simulate} library(mixedsubjectsirt) library(ggplot2) set.seed(242424) n_human <- 400 n_generated <- 1200 n_items <- 8 true_pars <- data.frame( item = paste0("Item", seq_len(n_items)), a = seq(0.8, 1.6, length.out = n_items), d = seq(-1.1, 1.1, length.out = n_items) ) true_pars$b <- -true_pars$d / true_pars$a theta_human <- rnorm(n_human) observed <- simulate_2pl(theta_human, true_pars) # Strongly informative predictions. On the n labeled subjects, predictions # match human responses (the F = Y benchmark from the simulation study), and # N additional unlabeled responses are drawn from the same 2PL. This is # the "good predictor" regime, where the method should lean on unlabeled data. # (Further down we show the complementary case of an uninformative predictor, # where the criterion instead drives lambda to 0.) predicted <- observed generated <- simulate_2pl(rnorm(n_generated), true_pars) ``` ## Step 1: Fit the human baseline Baseline models are estimated using `mirt`.[^1] ```{r human-baseline} human_start <- fit_2pl(observed, technical = list(NCYCLES = 500)) ``` ## Step 2: Fit the MML mixed-subjects model `fit_mixed_subjects_mml()` uses a MML-based EM procedure for iterative estimation at a given $\lambda$ value. If you have a value of $\lambda$ already (say from a pilot study or previous ability tuning), the model can be estimated directly here. ```{r mml-fit} mixed_mml <- fit_mixed_subjects_mml( observed = observed, predicted = predicted, generated = generated, lambda = 0.5, initial_pars = human_start$pars ) mixed_mml ``` ```{r mml-pars} mixed_mml$item_pars ``` ## Step 3: Select $\lambda$ by ability-score risk `tune_lambda_ability_risk()` with `fit_fn = fit_mixed_subjects_mml` selects the $\lambda$ that minimizes propagated ability-score risk $\mathbb{E}\big[g'\Sigma_\gamma g\big]$ (where $\Sigma_\gamma$ is the Louis-corrected marginal sandwich covariance).[^2] By default this is done by direct 1-D optimization over `[0, 1]`. The final fit from this optimal $\lambda$ is also returned as the `best_fit` object within the output list, so the user is not required to call `fit_mixed_subjects_mml` again. ```{r tune-lambda} ability_tuned <- tune_lambda_ability_risk( observed = observed, predicted = predicted, generated = generated, target_resp = observed, initial_pars = human_start$pars, fit_fn = fit_mixed_subjects_mml, control = list(maxit = 200) ) ability_tuned$best_lambda ``` Because the predictor is highly informative, the approach selects a $\lambda$ near the theoretical maximum $N/(n+N) = 1200/1600 = 0.75$. As such, the method leans heavily on the unlabeled response data. (The returned `summary` records each $\lambda$ the optimizer evaluated, with `selection_risk = Inf` for any that failed to converge, so selection is protected against numerical failures. When the predictor is uninformative the same criterion drives $\lambda$ to 0 instead; see below.) ## Step 3b (recommended workflow): cross-fit $\lambda$ tuning `tune_lambda_ability_risk()` above estimated an appropriate $\lambda$ and fits the final model on the same data used for this estimation. Previous analysis of prediction-powered inference in finite samples shows that estimating $\lambda$ on the data you also estimate model parameters with is optimistic. Item parameters estimated this are biased in finite samples and may undercover true parameter values.[^3] `tune_lambda_ability_risk_crossfit()` removes this bias by tuning $\lambda$ on held-out data: each fold's $\lambda$ is chosen using only out of fold item responses, and the final fit combines them. Passing `fit_fn = fit_mixed_subjects_mml` for the per-fold tuning and `final_fit_fn = fit_mixed_subjects_mml` for the final fit produces a single scalar-$\lambda$ mixed subjects calibration fit (the fold-specific $\lambda$'s are averaged, weighted by fold size). ```{r crossfit-tune} cf_tuned <- tune_lambda_ability_risk_crossfit( observed = observed, predicted = predicted, generated = generated, initial_pars = human_start$pars, fit_fn = fit_mixed_subjects_mml, final_fit_fn = fit_mixed_subjects_mml, n_splits = 2, # the standard PPI sample split (also the default) control = list(maxit = 200) ) cf_tuned$lambda_by_split # one tuned lambda per held-out fold cf_tuned$lambda_final # fold-size-weighted scalar used for the final fit ``` The object `cf_tuned$final_fit` provides the final model fit from calibration, and `vcov(cf_tuned$final_fit)` gives its Louis-corrected covariance. We suggest use of the default `n_splits = 2`, the standard PPI sample split (one half tunes, the other estimates, then swap). You may notice the cross-fit $\lambda$ (here $\approx 0.85$) sits *above* the same-data value ($\approx 0.7$) and above the perfect-predictor ceiling $N/(n+N) = 0.75$. That is expected and is **not** evidence of a better $\lambda$: with two folds each $\lambda$ is tuned on only half the labeled subjects ($n_\text{train} = n/2 = 200$), and the PPI-optimal $\lambda$ grows as labeled data shrinks (it tracks $N/(N + n_\text{train})$, here $1200/1400 \approx 0.86$). The fold $\lambda$'s are tuned for $n_\text{train}$ but the final fit applies their average to the full sample, so they run a little higher than $\lambda$ estimated in the same sample; in operational settings where $N \gg n$, this difference shrinks to zero. Importantly, the reason to cross-fit is not the $\lambda$ value, but to remove finite-sample item parameter bias. The cheaper same-data tuner in Step 3 is fine for exploration; prefer the cross-fit estimate for operational calibrations or further research. ## Step 4: Inspect the covariance `vcov()` on a scalar-lambda MML fit automatically uses `vcov_mixed_subjects_mml()`, which applies Louis' observed-information correction.[^4] Here, the bread is $H_\mathrm{comp} - I_\mathrm{miss}$ rather than the EM complete-data Hessian alone. ```{r covariance} Sigma <- vcov(cf_tuned$final_fit) dim(Sigma) # 2J × 2J ``` ## Compare calibrations ```{r compare} human_only <- fit_mixed_subjects_mml( observed = observed, predicted = predicted, generated = generated, lambda = 0, initial_pars = human_start$pars ) estimates <- rbind( data.frame(estimator = "human only", human_only$item_pars), data.frame(estimator = "MML lambda = 0.5", mixed_mml$item_pars), data.frame(estimator = "MML ability-risk", ability_tuned$best_fit$item_pars) ) estimates$true_b <- true_pars$b[match(estimates$item, true_pars$item)] estimates$true_a <- true_pars$a[match(estimates$item, true_pars$item)] ggplot(estimates, aes(true_b, b, colour = estimator)) + geom_abline(slope = 1, intercept = 0, linewidth = 0.4) + geom_point(size = 2) + labs(x = "True difficulty", y = "Estimated difficulty", colour = NULL) + theme_minimal() ``` ## When the LLM is uninformative While the method is able to efficiently exploit information derived from a good predictor, it is also capable of rejecting useless information derived from a bad one. Here we simulate responses that are essentially unrelated to ability (drawn from scrambled item parameters), so the paired correction between `observed` and `predicted` carries no usable signal: ```{r uninformative-predictor} set.seed(242424) bad_pars <- true_pars bad_pars$a <- pmax(0.05, abs(rnorm(n_items, 0, 0.1))) # near-zero slopes bad_pars$d <- rnorm(n_items, 0, 2) # difficulties unrelated to truth predicted_bad <- simulate_2pl(theta_human, bad_pars) generated_bad <- simulate_2pl(rnorm(n_generated), bad_pars) tuned_bad <- tune_lambda_ability_risk( observed = observed, predicted = predicted_bad, generated = generated_bad, target_resp = observed, initial_pars = human_start$pars, fit_fn = fit_mixed_subjects_mml, control = list(maxit = 200) ) tuned_bad$best_lambda # expect ~0: the useless LLM is correctly ignored ``` The criterion drives $\lambda \to 0$, recovering the human-only item parameters. As such, the same $\lambda$ tuning procedure both embraces informative predictions (as in the main example, $\lambda \to N/(n+N)$) and rejects an uninformative predictions ($\lambda \approx 0$). ## Validation A full simulation study confirms the recommended workflow behaves as intended: - $\mathbf{\lambda}$-selection tracks predictor quality. A perfect paired predictor $(F = Y)$ selects $\lambda \approx N/(n+N)$; a useless predictor is down-weighted to $\lambda \approx 0$. - The Louis-corrected standard errors provide correct coverage. Across all simulation conditions, including a useless or biased predictor, `vcov()` on a scalar MML fit attains nominal Wald-interval coverage (~0.91/0.96) for both discriminations and intercepts, whereas the uncorrected EM-Hessian covariance under-covers (~0.71/0.79). - No average harm done to ability scoring. The tuned ability-score RMSE is no worse than the human-only calibration on average (every regime's mean $\Delta \text{RMSE} \leq 0$) when the predictor is uninformative. See the [Simulation Validation](simulation-validation.html) vignette for the full results, and `simulations/` in the source tree for the reproduction code. [^1]: [Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment._Journal of Statistical Software_, 48, 1-29.](https://doi.org/10.18637/jss.v048.i06) [^2]: [Liu, C. W., & Chalmers, R. P. (2021). A note on computing Louis’ observed information matrix identity for IRT and cognitive diagnostic models. _British Journal of Mathematical and Statistical Psychology_, 74(1), 118-138.](https://doi.org/10.1111/bmsp.12207) [^3]: [Mani, P., Xu, P., Lipton, Z. C., & Oberst, M. (2025). No free lunch: Non-asymptotic analysis of prediction-powered inference. _arXiv preprint arXiv:2505.20178_.](https://arxiv.org/abs/2505.20178) [^4]: [Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. _Journal of the Royal Statistical Society Series B: Statistical Methodology_, 44(2), 226-233.](https://doi.org/10.1111/j.2517-6161.1982.tb01203.x)