This vignette presents a
simulation study validating the scalar 2PL marginal maximum-likelihood
(MML) PPI++ estimator (fit_mixed_subjects_mml) and its
ability-risk tuning (tune_lambda_ability_risk). The full
reproducible harness lives in the package’s simulations/
directory; the tables below load the most recent run from
simulations/results/ when available and otherwise show
cached values. Rep counts: \(\lambda\)-selection 100, coverage 200,
downstream 100, cross-fit 100.
The study answers five questions:
Human responses come from a true 8-item 2PL model
(a ∈ [0.8, 1.6], d ∈ [-1.1, 1.1]) with
n = 400 human subjects and N = 1200 generated
subjects, abilities drawn from a standard normal. Four predictor regimes
vary the quality of the paired LLM predictions, F:
| Regime | Predicted F | Role |
|---|---|---|
| R1: perfect prediction (F=Y) | predicted = observed (exact) | perfect predictor |
| R2: same-DGP draw | fresh binary draw from the true model | modest real signal |
| R3: independent noise | binary draw from scrambled item parameters | uninformative LLM |
| R4: LLM shift | binary draw from attenuated/shifted parameters | biased but informative LLM |
All responses (observed and predicted) are binary \(Y_{ij}, F_{ij} \in \{0,1\}\). Note that the
package does not currently accept probability predictions for
predicted/generated (see the note at the end
of Scenario 1) or polytomous responses. The control parameter \(\lambda \in [0,1]\) sets how strongly the
generated LLM sample is used after subtracting the paired LLM
correction; \(\lambda = 0\) is a
human-only calibration.
For each regime we tune \(\lambda\) by ability-score risk and record the selected value; we also report the theoretical PPI++-score \(\lambda\) for reference.
Here \(\lambda\) is selected by
tune_lambda_ability_risk(), which optimizes the risk
directly. We do not bother using the cross-fit estimator, as we only are
concerned with \(\lambda\)
magnitude.
| label | mean_risk | median_risk | prop_zero | mean_ppi |
|---|---|---|---|---|
| R1: perfect (F=Y) | 0.750 | 0.750 | 0.00 | 0.750 |
| R2: same-DGP draw | 0.119 | 0.108 | 0.07 | 0.004 |
| R3: independent noise | 0.063 | 0.040 | 0.35 | 0.000 |
| R4: LLM shift | 0.105 | 0.104 | 0.16 | 0.002 |
Takeaways:
N / (n + N) = 1200 / 1600, the
theoretical maximum for a perfect paired predictor at these values of
\(n\) and \(N\). The PPI++ estimated \(\lambda\) from minimizing \(\text{Tr}\big [ \Sigma_\gamma \big]\)
agrees.Note that predictions must be sampled responses, not
probabilities. We require
predicted/generated responses and reject
probability inputs. This is not a convenience restriction. The response
vector enters inside a log-sum over quadrature points, so feeding
probabilities breaks the identity \(\mathbb{E}\big[\nabla L_{gen}\big] =
\mathbb{E}\big[\nabla L_{pred}\big]\) that makes the PPI loss
correction mean-zero. The estimator then becomes biased at \(\lambda > 0\), and at moderate \(\lambda\) the objective is unbounded for
discrimination parameters, causing the fit to diverge. Practically: if
you have model-derived probabilities, sample responses from them
(e.g. rbinom) before calibrating.
We fix \(\lambda\) = 0.5 and compare
empirical coverage of Wald intervals built from two covariance estimates
of the same fit: the Louis-corrected marginal sandwich
(vcov() dispatch) and the EM complete-data Hessian bread
(vcov_mixed_subjects()).
Note that the PPI correction $\lambda$(L_gen − L_pred)
is mean-zero whenever the paired and generated pseudo-responses are
drawn from the same distribution with the same ability spread, so the
human term anchors the estimand to the true parameters. This holds for
every simulation condition: R1, R2, R3 (useless LLM) and R4 (biased
LLM). As such, the estimator stays consistent for the truth even when
the LLM is uninformative or biased.
| label | louis_cov_90 | louis_cov_95 | em_cov_90 | em_cov_95 | mean_se_ratio |
|---|---|---|---|---|---|
| R1 perfect (F=Y) | 0.909 | 0.955 | 0.713 | 0.787 | 1.626 |
| R2 same-DGP draw | 0.916 | 0.957 | 0.727 | 0.797 | 1.616 |
| R3 independent noise | 0.914 | 0.961 | 0.721 | 0.795 | 1.651 |
| R4 LLM shift | 0.905 | 0.954 | 0.726 | 0.792 | 1.622 |
See that the Louis correction restores nominal coverage while the EM bread in the sandwich covariance under-covers. Across all four conditions, including the useless LLM (R3) and the biased LLM (R4), Louis intervals attain ~0.91/0.96 against nominal 0.90/0.95, while the EM bread covers only ~0.71/0.79; it understates uncertainty because it ignores the missing information about each subject’s latent ability.
We score a held-out sample of 1000 subjects with known ability and
compare ability-score RMSE for the human-only calibration (\(\lambda\) = 0) and the tuned MML
calibration. mean_delta = RMSE(tuned) − RMSE(human);
negative is an improvement. We report the paired 95% CI of
mean_delta and prop_improve (the fraction of
replications where tuned beats human-only) so a small stable effect can
be told apart from noise.
| label | mean_lambda | rmse_human | rmse_tuned | mean_delta | delta_lo | delta_hi | prop_improve | bias_a |
|---|---|---|---|---|---|---|---|---|
| R1: perfect (F=Y) | 0.750 | 1.4961 | 1.4923 | -0.0038 | -0.0073 | -0.0003 | 0.48 | 0.010 |
| R2: same-DGP draw | 0.119 | 1.5023 | 1.5005 | -0.0019 | -0.0027 | -0.0011 | 0.59 | 0.025 |
| R3: independent noise | 0.063 | 1.5037 | 1.5030 | -0.0006 | -0.0009 | -0.0003 | 0.50 | 0.035 |
| R4: LLM shift | 0.105 | 1.5064 | 1.5046 | -0.0018 | -0.0030 | -0.0007 | 0.50 | 0.008 |
Takeaways.
mean_delta is negative and its paired 95% CI excludes
zero.Cross-fitting tunes \(\lambda\) on
held-out folds so a fold’s own labels are not used to tune the \(\lambda\) applied to its paired correction.
We compare the non-cross-fitted tuner against
tune_lambda_ability_risk_crossfit (MML per-fold tuning,
scalar-mean MML final fit) on selected \(\lambda\), item-parameter bias, coverage,
and held-out RMSE.
| label | lambda_nocf | lambda_cf | bias_a_nocf | bias_a_cf | rmse_nocf | rmse_cf | cover_nocf | cover_cf |
|---|---|---|---|---|---|---|---|---|
| R1 perfect (F=Y) | 0.750 | 0.858 | 0.0104 | 0.0100 | 1.4923 | 1.4927 | 0.954 | 0.949 |
| R2 same-DGP draw | 0.119 | 0.135 | 0.0250 | 0.0258 | 1.5005 | 1.5005 | 0.954 | 0.956 |
| R4 LLM shift | 0.100 | 0.115 | 0.0374 | 0.0372 | 1.5027 | 1.5029 | 0.953 | 0.956 |
Cross-fitting does not change observed behavior in these simulations, but guards against finite-sample bias. Item-parameter bias, held-out RMSE, and 95% Wald coverage are essentially identical between the non-cross-fitted and cross-fitted tuners (differences in the fourth decimal). Because the downstream quantities are unchanged, the cheaper non-cross-fitted tuner carries value for intermediate testing and exploration, but prefer cross-fitting for final model fitting.
Scenario 2 validated the covariance formula at a fixed \(\lambda\), but the \(\lambda\) we use is estimated from data,
and vcov() treats \(\lambda\) as fixed. As such, it does not
propagate the uncertainty in \(\hat{\lambda}\) into \(\hat{\Sigma}_\gamma\). Additionally,
choosing \(\lambda\) on the same data
used to estimate the item parameters induces finite-sample bias in those
estimates (and, in principle, anti-conservative intervals). Split-sample
(cross-fit) \(\lambda\) estimation
removes that bias.
This scenario measures the size of the effect by comparing coverage
of the true item parameters at three operating points: fixed \(\lambda\) = 0.5, the same-data tuned \(\hat{\lambda}\) with
vcov(best_fit), and the cross-fit tuned \(\hat{\lambda}\) with
vcov(final_fit). It reuses the Scenario 2 seeds, so the
fixed-\(\lambda\) column reproduces
Scenario 2 exactly.
| label | lambda_sd | lambda_xf | fixed_95 | samedata_95 | crossfit_95 |
|---|---|---|---|---|---|
| R1 perfect (F=Y) | 0.751 | 0.859 | 0.955 | 0.955 | 0.953 |
| R2 same-DGP draw | 0.116 | 0.135 | 0.957 | 0.956 | 0.956 |
| R3 independent noise | 0.053 | 0.074 | 0.961 | 0.958 | 0.958 |
| R4 LLM shift | 0.103 | 0.130 | 0.954 | 0.958 | 0.958 |
And the matching item-parameter (discrimination) bias:
| label | bias_a_fixed | bias_a_samedata | bias_a_crossfit |
|---|---|---|---|
| R1 perfect (F=Y) | 0.0097 | 0.0063 | 0.0051 |
| R2 same-DGP draw | 0.0280 | 0.0224 | 0.0222 |
| R3 independent noise | 0.0365 | 0.0286 | 0.0289 |
| R4 LLM shift | 0.0207 | 0.0159 | 0.0166 |
Takeaways.
Cross-fitting is the principled default, despite the same-data tuning performing well. It is guaranteed free of the finite-sample tuning bias.
| Question | Result |
|---|---|
| Does \(\lambda\)-selection track predictor quality? | Yes: better predictors select higher \(\lambda\) |
| Do standard errors achieve appropriate coverage? | Yes: nominal coverage, including under a biased predictor (R3/R4) |
| Does the method improve downstream scoring? | No average harm demonstrated and per-rep improvement in about half of reps |
| What is the role of cross-fitting? | Cross-fitting removes finite sample bias, despite reporting similar bias/RMSE/coverage |
| Is coverage valid at the tuned \(\lambda\)? | Yes |
From the package root (second argument is the number of cores):
# Rscript simulations/run_lambda_selection.R 100 8
# Rscript simulations/run_coverage.R 200 8
# Rscript simulations/run_downstream.R 100 8
# Rscript simulations/run_crossfit.R 50 8
# Rscript simulations/figures.RSee simulations/README.md for the full design, the
deterministic per-task seeding (results are identical serially or in
parallel), and interpretation notes.