Package: mixedsubjectsirt 1.0.0

Klint Kanopka

mixedsubjectsirt: Item Response Theory Calibration with a Mixed Subjects Design

Integrates large language model generated item responses into psychometric calibration studies through a mixed-subjects design for unidimensional two-parameter and one-parameter logistic item response theory models. Human pilot responses are augmented with model-generated responses using a prediction-powered inference estimator (Angelopoulos, Bates, Fannjiang, Jordan and Zrnic (2023) <doi:10.1126/science.adi6000>; Angelopoulos, Duchi and Zrnic (2023) <doi:10.48550/arXiv.2311.01453>) adapted to marginal maximum-likelihood estimation, following the mixed-subjects design of Broska, Howes and van Loon (2025) <doi:10.1177/00491241251326865>. The estimator is anchored to the human responses and is asymptotically unbiased for the human item parameters at any tuning weight; the weight on the synthetic responses is chosen to minimize propagated ability-score risk, down-weighting uninformative or biased generated responses. Louis-corrected sandwich standard errors, ability scoring, cross-fitted tuning, and scale linking are also provided.

Authors:Klint Kanopka [aut, cre]

mixedsubjectsirt_1.0.0.tar.gz
mixedsubjectsirt_1.0.0.tar.gz(r-4.7-any)mixedsubjectsirt_1.0.0.tar.gz(r-4.6-any)
mixedsubjectsirt_1.0.0.tgz(r-4.6-emscripten)
manual.pdf |manual.html
DESCRIPTION |NEWS
card.svg |card.png
mixedsubjectsirt/json (API)

# Install 'mixedsubjectsirt' in R:
install.packages('mixedsubjectsirt', repos = c('https://cran.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/klintkanopka/mixedsubjectsirt/issues

Pkgdown/docs site:https://klintkanopka.com

On CRAN:

Conda:

3.60 score 32 exports 75 dependencies

Last updated from:2e77546e86. Checks:4 OK. Indexed: yes.

TargetResultTimeFilesSyslog
linux-devel-x86_64OK214
source / vignettesOK314
linux-release-x86_64OK210
wasm-releaseOK157

Exports:ability_gradientability_gradient_1plability_riskability_risk_1pldiagnose_lambda_gridfit_1plfit_2plfit_mixed_subjectsfit_mixed_subjects_1plfit_mixed_subjects_from_quadraturefit_mixed_subjects_iterativefit_mixed_subjects_mmlfit_mixed_subjects_mml_1plfit_mixed_subjects_splitlink_item_parametersmake_quadraturemixed_subjects_lossmixed_subjects_quadratureposterior_weights_2plscore_thetasimulate_2plsummarize_expected_countstune_lambda_ability_risktune_lambda_ability_risk_1pltune_lambda_ability_risk_crossfittune_lambda_ability_risk_itemtune_lambda_ppi_scoretune_lambda_ppi_score_1pltune_lambda_ppi_score_itemvcov_mixed_subjectsvcov_mixed_subjects_1plvcov_mixed_subjects_mml

Dependencies:audiobeeprbriocallrclassclicliprclustercodetoolscrayondcurverDerivdescdiffobjdigestdplyre1071evaluatefsfuturefuture.applygenericsglobalsglueGPArotationgridExtragtablejsonlitelatticelifecyclelistenvmagrittrMASSMatrixmgcvmiraimirtnanonextnlmeotelparallellypbapplypermutepillarpkgbuildpkgconfigpkgloadpraiseprocessxprogressrproxypsqs2R.methodsS3R.ooR.utilsR6RcppRcppArmadilloRcppParallelrlangrmutilrprojrootsessioninfoSimDesignsplines2stringfishtestthattibbletidyselectutf8vctrsveganwaldowithr

Calibrating with a Weakly-Informative, Biased LLM
The setup | Naive pooling inherits the bias | $\lambda$ moves efficiency, not bias | Choosing $\lambda$ | Takeaways | Reproducing

Last update: 2026-06-25
Started: 2026-06-25

Choosing Lambda in Mixed-Subjects IRT
Two objectives, two estimators | Example data | Ability-risk tuning: Minimizing $\mathbb{E}[g'\Sigma_\gamma g]$ | Cross-fit $\lambda$ tuning (recommended workflow) | Frozen expected-count estimator (fast approximation) | Minimizing $\text{Tr}\big[\Sigma_\gamma\big]$ (diagnostic only) | Choosing a procedure

Last update: 2026-06-25
Started: 2026-06-25

IRT Linking and Gradient Asymmetry: Diagnostic Guide
Background | Background (frozen expected-count estimator) | Linking implementations | Simulation | Fitting human and LLM models | Applying the three methods | Parameter alignment after linking | TCC alignment | Gradient asymmetry: what linking fixes and what it does not | Lambda sweep: how $\lambda$ interacts with linking quality | The role of power tuning | Validation: what does $\lambda^*$ measure? | Test A — Perfect paired surrogate ($F = Y$) | Test B — Partially overlapping predictions | Test C — Stochastic LLM predictions (practical baseline) | Summary: PPI++ score vs. ability risk | Summary of findings | Recommendation | The marginal-MML fix

Last update: 2026-06-25
Started: 2026-06-25

Mixed-Subjects 1PL Calibration
Simulate a 1PL test | Step 1: Fit the 1PL baseline | Step 2: Fit mixed-subjects MML (1PL) | Step 3: Correct covariance — $(J+1) \times (J+1)$ sandwich | Step 4: Ability-score risk and lambda tuning | Step 5: Verify — F = Y gives lambda > 0 | Compare 1PL and 2PL | Ability-score risk: 1PL vs 2PL parameterization

Last update: 2026-06-25
Started: 2026-06-25

Mixed-Subjects IRT Calibration
Simulate example data | Step 1: Fit the human baseline | Step 2: Fit the MML mixed-subjects model | Step 3: Select $\lambda$ by ability-score risk | Step 3b (recommended workflow): cross-fit $\lambda$ tuning | Step 4: Inspect the covariance | Compare calibrations | When the LLM is uninformative | Validation

Last update: 2026-06-25
Started: 2026-06-25

Per-Item Lambda (Experimental)
Why per-item lambda? | Simulate a heterogeneous test | Step 1: Fit 2PL baseline and get global scalar lambda | Step 2: PPI++ score per item (fast diagnostic) | Step 3: Per-item ability-risk tuning | Step 4: Compare scalar vs. per-item parameter recovery | Important note on initialization | Approximation caveat

Last update: 2026-06-25
Started: 2026-06-25

Simulation Validation of the Mixed-Subjects MML Estimator
Design | Does $\lambda$-selection track predictor quality? | Do standard errors achieve appropriate coverage? | Does the method improve downstream scoring? | What is the role of cross-fitting? | Is coverage valid at the tuned $\lambda$? | Summary | Reproducing these results

Last update: 2026-06-25
Started: 2026-06-25

Understanding Ability-Risk Tuning
Why this vignette exists | Key Intuition | The three response matrices | 1. Observed human responses: $O$ | 2. Paired LLM-predicted human responses: $P$ | 3. Additional LLM-generated responses: $G$ | The mixed-subjects IRT objective | What lambda is learning | $$L_O^ | Ability-risk tuning | The approximate target is$$\widehat R(\lambda) | Why row alignment matters | Case A: perfect paired prediction | $$\lambda_ | \frac | Case B: row-shuffled perfect predictions | Case C: same DGP, fresh Bernoulli draw | $$\operatorname{Cov}(O_{ij},P_{ij}) | $$\operatorname{Var}(P_{ij}) | What kind of LLM data produces higher lambda? | One approach to row alignment: leave-one-item-out prediction | Another approach: covariate-based prediction | Something that probably won't work: item-text-only generation | How to generate $G$ | Summary | Technical Explanation | Overview: four objects, one objective | 1. The estimator and its estimating equation | 2. The sandwich covariance of $\hat\gamma$ | 3. Ability scoring and the implicit gradient | 4. Delta-method propagation and the risk | 5. Why this differs from the PPI++ trace objective

Last update: 2026-06-25
Started: 2026-06-25

Readme and manuals

Help Manual

Help pageTopics
Gradient of ML ability scores with respect to item parametersability_gradient
Gradient of ML ability scores w.r.t. 1PL item parametersability_gradient_1pl
Propagated ability risk from item-parameter uncertaintyability_risk
Propagated ability risk for a 1PL fitability_risk_1pl
Diagnose lambda values over a griddiagnose_lambda_grid
Fit a 1PL (one-parameter logistic) modelfit_1pl
Fit a unidimensional 2PL IRT modelfit_2pl
Fit a mixed-subjects 2PL calibrationfit_mixed_subjects
Fit a mixed-subjects 1PL calibration (frozen expected-count)fit_mixed_subjects_1pl
Fit from precomputed quadrature summariesfit_mixed_subjects_from_quadrature
Fit a mixed-subjects 2PL calibration with iterative EMfit_mixed_subjects_iterative
Fit a mixed-subjects 2PL calibration via marginal maximum likelihoodfit_mixed_subjects_mml
Fit a mixed-subjects 1PL calibration via marginal maximum likelihoodfit_mixed_subjects_mml_1pl
Fit a split-sample mixed-subjects 2PL calibrationfit_mixed_subjects_split
Link item parameters onto a target scalelink_item_parameters
Create a standard-normal Gauss-Hermite quadrature gridmake_quadrature
Mixed-subjects objective functionmixed_subjects_loss
Convert responses to quadrature formmixed_subjects_quadrature
Compute posterior quadrature weights for a 2PL modelposterior_weights_2pl
Estimate ability scores from a 2PL calibrationscore_theta
Simulate 2PL item responsessimulate_2pl
Summarize response data as expected quadrature countssummarize_expected_counts
Tune lambda by downstream ability-score risktune_lambda_ability_risk
Tune lambda by downstream ability-score risk for a 1PL modeltune_lambda_ability_risk_1pl
Cross-fit ability-score-risk lambda tuningtune_lambda_ability_risk_crossfit
Per-item ability-risk lambda tuning via coordinate descenttune_lambda_ability_risk_item
Plug-in PPI++ optimal tuning parametertune_lambda_ppi_score
Plug-in PPI++ optimal tuning parameter for a 1PL modeltune_lambda_ppi_score_1pl
Per-item PPI++ optimal tuning parameterstune_lambda_ppi_score_item
Sandwich covariance for a mixed-subjects fitvcov_mixed_subjects
Sandwich covariance for a 1PL mixed-subjects fitvcov_mixed_subjects_1pl
Marginal-MML sandwich covariance for a mixed-subjects fitvcov_mixed_subjects_mml