The mixedsubjects package provides tools for analyzing
randomized experiments that combine traditional human subjects data with
predictions from large language models (LLMs) or other machine learning
algorithms. This approach, which we call a Mixed-Subjects Design
(MSD), can substantially increase the precision of treatment
effect estimates while maintaining valid statistical inference.
Running experiments with human subjects is expensive. A typical survey response might cost several dollars, and detecting small treatment effects often requires thousands of participants. Meanwhile, LLMs can generate predictions about how humans might respond to experimental treatments at a fraction of the cost—often less than a cent per prediction.
The key insight behind mixed-subjects designs is that LLM predictions, even if imperfect, contain information about human behavior. By combining a smaller sample of actual human responses (the “gold standard”) with a larger pool of LLM predictions, we can:
Importantly, this approach does not assume LLM predictions are accurate. Even if predictions are biased, the methods in this package automatically adjust for this bias using the human subjects data.
The mixedsubjects package implements seven estimators
for the Average Treatment Effect (ATE) in mixed-subjects designs:
| Estimator | Description | Predictions Needed |
|---|---|---|
| DiM | Difference-in-Means | None |
| GREG | Calibration estimator | 1 per unit |
| PPI++ | Power-tuned, single \(\lambda\) | 1 per unit |
| D-T | Doubly-tuned, arm-specific \(\lambda\) | 1 per unit |
| DiP | Difference-in-Predictions | 2 per unit |
| DiP++ | Power-tuned DiP | 2 per unit |
| D-T DiP | Doubly-tuned DiP | 2 per unit |
The package also helps you design optimal experiments by determining how to allocate your budget between human subjects and LLM predictions.
Let’s start with a simple example to see the package in action.
Imagine you’re running a survey experiment to test whether a persuasive message changes attitudes. You have:
set.seed(123)
# True treatment effect
true_ate <- 0.3
# Human subjects data (observed)
n_observed <- 200
observed_df <- data.frame(
# Treatment assignment (balanced)
D = rep(c(1, 0), each = n_observed / 2)
)
# Generate outcomes: Y(0) ~ N(0, 1), Y(1) ~ N(0.3, 1)
observed_df$Y <- ifelse(
observed_df$D == 1,
rnorm(n_observed / 2, mean = true_ate, sd = 1),
rnorm(n_observed / 2, mean = 0, sd = 1)
)
# LLM predictions (correlated with true outcomes, but with some error)
# S^(1) predicts Y(1), S^(0) predicts Y(0)
observed_df$S1 <- 0.6 * observed_df$Y + rnorm(n_observed, 0, 0.5)
observed_df$S0 <- 0.5 * observed_df$Y + rnorm(n_observed, 0, 0.6)
# Unobserved units (only have predictions, no actual Y)
# Predictions must come from the same pipeline as observed (Assumption C: Random Labeling)
n_unobserved <- 1000
D_unobs <- rep(c(1, 0), each = n_unobserved / 2)
latent_Y <- ifelse(D_unobs == 1,
rnorm(n_unobserved / 2, mean = true_ate, sd = 1),
rnorm(n_unobserved / 2, mean = 0, sd = 1))
unobserved_df <- data.frame(
D = D_unobs,
S1 = 0.6 * latent_Y + rnorm(n_unobserved, 0, 0.5),
S0 = 0.5 * latent_Y + rnorm(n_unobserved, 0, 0.6)
)First, we combine our data into an msd_data object:
msd <- msd_data(observed = observed_df, unobserved = unobserved_df)
print(msd)
#>
#> Mixed-Subjects Design Data
#> ==========================
#>
#> Sample Sizes:
#> Observed (labeled): 200
#> - Treated (D=1): 100
#> - Control (D=0): 100
#> Unobserved (unlabeled): 1000
#>
#> Predictions Available:
#> S0 (control arm): Yes
#> S1 (treatment arm): Yes
#>
#> Column Mapping (original names):
#> Observed: Y=Y, D=D, S0=S0, S1=S1
#> Unobserved: D=D, S0=S0, S1=S1
#>
#> Available Estimators:
#> - DiM: Yes (no predictions needed)
#> - GREG: Yes
#> - PPI++: Yes
#> - D-T: Yes
#> - DiP: Yes (both S0 and S1 available)
#> - DiP++: Yes
#> - D-T DiP: YesNow let’s estimate the ATE using different estimators:
# Classical difference-in-means (ignores predictions)
result_dim <- msd_dim(msd)
# D-T DiP (uses both predictions with arm-specific tuning)
result_dt_dip <- msd_dt_dip(msd)
# Compare
cat("Difference-in-Means:\n")
#> Difference-in-Means:
cat(" Estimate:", round(result_dim$estimate, 3), "\n")
#> Estimate: 0.498
cat(" SE:", round(result_dim$se, 3), "\n")
#> SE: 0.133
cat(" 95% CI: [", round(result_dim$ci_lower, 3), ", ",
round(result_dim$ci_upper, 3), "]\n\n")
#> 95% CI: [ 0.237 , 0.759 ]
cat("D-T DiP:\n")
#> D-T DiP:
cat(" Estimate:", round(result_dt_dip$estimate, 3), "\n")
#> Estimate: 0.177
cat(" SE:", round(result_dt_dip$se, 3), "\n")
#> SE: 0.096
cat(" 95% CI: [", round(result_dt_dip$ci_lower, 3), ", ",
round(result_dt_dip$ci_upper, 3), "]\n")
#> 95% CI: [ -0.012 , 0.366 ]Notice that the D-T DiP estimator has a smaller standard error than difference-in-means, even though both use the same human subjects data. This variance reduction comes from leveraging the LLM predictions.
For a mixed-subjects design, you need:
Y: The outcome variableD: Treatment indicator (1 = treated, 0 = control)S1 and/or S0: LLM predictions (optional
for DiM)D: Treatment indicatorS1 and/or S0: LLM predictionsThe package distinguishes between two prediction setups:
Setup 1: Arm-specific predictions (1 prediction per unit)
Each unit only has the prediction for their assigned treatment condition: - Treated units have \(S^{(1)}\) (prediction of their outcome under treatment) - Control units have \(S^{(0)}\) (prediction of their outcome under control)
This is typical when you ask the LLM to predict each unit’s response to their actual assigned condition.
Setup 2: Both predictions (2 predictions per unit)
Each unit has predictions for both conditions: - \(S^{(1)}\): Prediction of outcome if treated - \(S^{(0)}\): Prediction of outcome if in control
This requires asking the LLM to predict each unit’s response under both conditions, which doubles the prediction cost but enables the DiP family of estimators.
The msd_data() function accepts data in two formats:
Option 1: Separate dataframes
msd <- msd_data(
observed = observed_df, # Must have Y, D, and optionally S0, S1
unobserved = unobserved_df # Must have D and S0, S1 (no Y)
)Option 2: Combined dataframe
If your data is in a single dataframe where unobserved units have
Y = NA:
# Combine into single dataframe
combined_df <- rbind(
observed_df,
data.frame(Y = NA, D = unobserved_df$D,
S0 = unobserved_df$S0, S1 = unobserved_df$S1)
)
# Create msd_data object
msd_combined <- msd_data(data = combined_df)
print(msd_combined)
#>
#> Mixed-Subjects Design Data
#> ==========================
#>
#> Sample Sizes:
#> Observed (labeled): 200
#> - Treated (D=1): 100
#> - Control (D=0): 100
#> Unobserved (unlabeled): 1000
#>
#> Predictions Available:
#> S0 (control arm): Yes
#> S1 (treatment arm): Yes
#>
#> Column Mapping (original names):
#> Observed: Y=Y, D=D, S0=S0, S1=S1
#> Unobserved: D=D, S0=S0, S1=S1
#>
#> Available Estimators:
#> - DiM: Yes (no predictions needed)
#> - GREG: Yes
#> - PPI++: Yes
#> - D-T: Yes
#> - DiP: Yes (both S0 and S1 available)
#> - DiP++: Yes
#> - D-T DiP: YesBy default, the package expects columns named Y
(outcome), D (treatment), S0 (control
prediction), and S1 (treatment prediction). If your columns
use different names, specify them explicitly:
# Custom column names
my_data <- data.frame(
response_var = rnorm(100),
is_treated = rep(c(1, 0), each = 50),
claude_pred_ctrl = rnorm(100),
claude_pred_trt = rnorm(100)
)
msd <- msd_data(
observed = my_data,
outcome = "response_var",
treatment = "is_treated",
pred_control = "claude_pred_ctrl",
pred_treated = "claude_pred_trt"
)You can even use different column names for observed and unobserved data:
For even more flexibility, you can use a formula interface directly
in the estimator functions. This is similar to the syntax used by
ivreg and other regression packages:
# Formula: outcome ~ treatment | predictions
# Using custom column names directly in the estimator
# For GREG-type estimators (single prediction per arm)
result <- msd_greg(response ~ treatment | pred_1 + pred_0,
observed = obs_df, unobserved = unobs_df)
# For DiP-type estimators (both predictions)
result <- msd_dt_dip(Y ~ D | S1 + S0,
observed = obs_df, unobserved = unobs_df)The formula has three parts: 1. Left of
~: The outcome variable name 2. Between
~ and |: The treatment indicator
variable name 3. After |: The prediction
variable name(s)
For DiM (which doesn’t use predictions), you can omit the prediction part:
The classical estimator that ignores LLM predictions entirely:
\[\hat{\tau}^{DiM} = \bar{Y}_1 - \bar{Y}_0\]
Use this as your baseline. It’s unbiased and requires no predictions.
GREG uses predictions to “calibrate” the estimate:
\[\hat{\tau}^{GREG} = \left[\bar{S}^{(1)}_U + (\bar{Y}_1 - \bar{S}^{(1)}_{O_1})\right] - \left[\bar{S}^{(0)}_U + (\bar{Y}_0 - \bar{S}^{(0)}_{O_0})\right]\]
This adjusts the prediction mean by the observed “rectifier” (the average prediction error in the labeled data).
result <- msd_greg(msd)
print(result)
#>
#> Mixed-Subjects Design Estimation
#> =================================
#> Estimator: GREG (lambda = 1)
#>
#> Point Estimate: 0.3307
#> Standard Error: 0.1073
#> 95% CI: [0.1203, 0.5410]
#>
#> Tuning Parameters:
#> lambda: 1.0000
#>
#> Sample Sizes:
#> Observed: n_1=100, n_0=100
#> Unobserved: |U|=1000PPI++ improves on GREG by estimating an optimal tuning parameter \(\lambda\):
\[\hat{\mu}_d^{PPI}(\lambda) = \bar{Y}_d + \lambda(\bar{S}^{(d)}_U - \bar{S}^{(d)}_O)\]
When \(\lambda = 1\), this equals GREG. PPI++ estimates \(\lambda\) to minimize variance using cross-fitting.
result <- msd_ppi(msd, n_folds = 2)
print(result)
#>
#> Mixed-Subjects Design Estimation
#> =================================
#> Estimator: PPI++ (cross-fit, K=2)
#>
#> Point Estimate: 0.3617
#> Standard Error: 0.1033
#> 95% CI: [0.1593, 0.5642]
#>
#> Tuning Parameters:
#> lambda: 0.8222
#>
#> Sample Sizes:
#> Observed: n_1=100, n_0=100
#> Unobserved: |U|=1000D-T uses separate tuning parameters for each arm (\(\lambda_1\) for treatment, \(\lambda_0\) for control):
\[\hat{\tau}^{D-T} = \hat{\mu}_1^{PPI}(\lambda_1) - \hat{\mu}_0^{PPI}(\lambda_0)\]
This is useful when prediction quality differs between treatment and control conditions.
result <- msd_dt(msd, n_folds = 2)
print(result)
#>
#> Mixed-Subjects Design Estimation
#> =================================
#> Estimator: D-T (Doubly-Tuned, cross-fit, K=2)
#>
#> Point Estimate: 0.3457
#> Standard Error: 0.1037
#> 95% CI: [0.1425, 0.5489]
#>
#> Tuning Parameters:
#> lambda_1 (treatment): 0.9039
#> lambda_0 (control): 0.7806
#>
#> Sample Sizes:
#> Observed: n_1=100, n_0=100
#> Unobserved: |U|=1000DiP uses the contrast \(S^{(1)} - S^{(0)}\) for each unlabeled unit:
\[\hat{\tau}^{DiP} = \overline{S^{(1)} - S^{(0)}}_U + (\bar{Y}_1 - \bar{S}^{(1)}_{O_1}) - (\bar{Y}_0 - \bar{S}^{(0)}_{O_0})\]
Key insight: When \(S^{(1)}\) and \(S^{(0)}\) are positively correlated (which they often are—units that score high under treatment also tend to score high under control), the variance of their difference is smaller than the sum of their individual variances.
result <- msd_dip(msd)
print(result)
#>
#> Mixed-Subjects Design Estimation
#> =================================
#> Estimator: DiP (Difference-in-Predictions, lambda = 1)
#>
#> Point Estimate: 0.1454
#> Standard Error: 0.0976
#> 95% CI: [-0.0459, 0.3367]
#>
#> Tuning Parameters:
#> lambda: 1.0000
#>
#> Sample Sizes:
#> Observed: n_1=100, n_0=100
#> Unobserved: |U|=1000DiP++ applies power-tuning to the DiP estimator with a single \(\lambda\):
\[\hat{\tau}^{DiP++}(\lambda) = \lambda \cdot \overline{S^{(1)} - S^{(0)}}_U + (\bar{Y}_1 - \lambda\bar{S}^{(1)}_{O_1}) - (\bar{Y}_0 - \lambda\bar{S}^{(0)}_{O_0})\]
result <- msd_dip_pp(msd, n_folds = 2)
print(result)
#>
#> Mixed-Subjects Design Estimation
#> =================================
#> Estimator: DiP++ (cross-fit, K=2)
#>
#> Point Estimate: 0.1891
#> Standard Error: 0.0965
#> 95% CI: [0.0000, 0.3782]
#>
#> Tuning Parameters:
#> lambda: 0.8804
#>
#> Sample Sizes:
#> Observed: n_1=100, n_0=100
#> Unobserved: |U|=1000The most flexible estimator combines DiP with arm-specific tuning:
\[\hat{\tau}^{D-T\ DiP} = \overline{\lambda_1 S^{(1)} - \lambda_0 S^{(0)}}_U + (\bar{Y}_1 - \lambda_1\bar{S}^{(1)}_{O_1}) - (\bar{Y}_0 - \lambda_0\bar{S}^{(0)}_{O_0})\]
result <- msd_dt_dip(msd, n_folds = 2)
print(result)
#>
#> Mixed-Subjects Design Estimation
#> =================================
#> Estimator: D-T DiP (cross-fit, K=2)
#>
#> Point Estimate: 0.1699
#> Standard Error: 0.0965
#> 95% CI: [-0.0192, 0.3589]
#>
#> Tuning Parameters:
#> lambda_1 (treatment): 0.9518
#> lambda_0 (control): 0.8377
#>
#> Sample Sizes:
#> Observed: n_1=100, n_0=100
#> Unobserved: |U|=1000Use estimate_all() to run all applicable estimators at
once:
all_results <- estimate_all(msd)
print(all_results)
#>
#> Mixed-Subjects Design: All Estimators
#> ======================================
#>
#> Estimator Estimate SE 95% CI Lower
#> Difference-in-Means (DiM) 0.4980 0.1330 0.2373
#> GREG (lambda = 1) 0.3307 0.1073 0.1203
#> PPI++ (cross-fit, K=2) 0.3664 0.1032 0.1643
#> D-T (Doubly-Tuned, cross-fit, K=2) 0.3585 0.1034 0.1557
#> DiP (Difference-in-Predictions, lambda = 1) 0.1454 0.0976 -0.0459
#> DiP++ (cross-fit, K=2) 0.1893 0.0965 0.0002
#> D-T DiP (cross-fit, K=2) 0.1870 0.0964 -0.0020
#> 95% CI Upper
#> 0.7586
#> 0.5410
#> 0.5686
#> 0.5612
#> 0.3367
#> 0.3785
#> 0.3760Here’s a practical guide for choosing an estimator:
Do you have LLM predictions?
├── No → Use DiM
└── Yes → Do you have BOTH S^(1) and S^(0) for each unit?
├── No (only arm-specific) →
│ └── Is prediction quality similar across arms?
│ ├── Yes → Use PPI++
│ └── No → Use D-T
└── Yes (both predictions) →
└── Are S^(1) and S^(0) positively correlated?
├── Yes → Use D-T DiP (recommended)
└── No/Unsure → Use D-T
The DiP family (DiP, DiP++, D-T DiP) is most beneficial when:
You can check the correlation in your pilot data:
A positive correlation (which is common) means DiP estimators will likely outperform GREG-type estimators.
All estimators use the delta-method for variance estimation by default. This provides fast, analytical variance estimates based on the formulas derived in the accompanying paper.
For robustness, you can also use the fold-respecting bootstrap:
# Bootstrap variance for D-T DiP
boot_result <- bootstrap_variance(
msd,
estimator = "dt_dip",
n_bootstrap = 500,
seed = 42
)
cat("Delta-method SE:", round(result_dt_dip$se, 4), "\n")
#> Delta-method SE: 0.0965
cat("Bootstrap SE:", round(boot_result$se, 4), "\n")
#> Bootstrap SE: 0.0975The bootstrap is computationally more expensive but provides a useful check on the delta-method variance.
Before running your experiment, you need to decide:
These decisions depend on: - Your total budget - The cost per human subject - The cost per LLM prediction - The quality of predictions (estimated from pilot data)
optimal_design()The optimal_design() function finds the allocation that
minimizes expected variance given your budget constraints:
design <- optimal_design(
pilot_data = msd, # Pilot study data
budget = 5000, # Total budget in dollars
cost_human = 5, # $5 per human response
cost_prediction = 0.01, # $0.01 per LLM prediction
treatment_prob = 0.5 # Balanced treatment assignment
)
print(design)
#>
#> Optimal Mixed-Subjects Design
#> ==============================
#>
#> Recommended Design:
#> Estimator: DT_DIP
#> Observed units: 970
#> - Treated: 485
#> - Control: 485
#> Unobserved units: 7500
#>
#> Expected Performance:
#> Variance: 0.0019
#> SE: 0.0431
#>
#> Budget:
#> Total: $5,000
#> Used: $5,000
#> Human: $5/observation
#> Predict: $0.01/prediction
#>
#> Comparison Across Estimators:
#> Estimator n_obs n_unobs SE
#> DIM 1000 0 0.0595
#> GREG 951 24500 0.0443
#> PPI 960 20000 0.0439
#> DT 960 20000 0.0438
#> DIP 960 10000 0.0438
#> DIP_PP 970 7500 0.0434
#> DT_DIP 970 7500 0.0431The output tells you:
The output also shows how different estimators perform at their optimal allocations. This helps you understand the trade-offs:
Here’s a recommended workflow for using mixed-subjects designs in your research:
Run a small pilot study with both human subjects and LLM predictions:
# Collect pilot data
pilot_observed <- collect_human_responses(n = 100)
pilot_observed$S1 <- get_llm_predictions(pilot_observed, condition = "treatment")
pilot_observed$S0 <- get_llm_predictions(pilot_observed, condition = "control")
pilot_unobserved <- generate_synthetic_units(n = 200)
pilot_unobserved$S1 <- get_llm_predictions(pilot_unobserved, condition = "treatment")
pilot_unobserved$S0 <- get_llm_predictions(pilot_unobserved, condition = "control")
pilot_msd <- msd_data(observed = pilot_observed, unobserved = pilot_unobserved)Examine how well LLM predictions correlate with actual outcomes:
# Check correlations
pilot_results <- estimate_all(pilot_msd)
print(pilot_results)
# Compare to DiM baseline
dim_se <- pilot_results$SE[pilot_results$Estimator == "Difference-in-Means (DiM)"]
best_se <- min(pilot_results$SE)
variance_reduction <- 1 - (best_se / dim_se)^2
cat("Potential variance reduction:", round(variance_reduction * 100, 1), "%\n")Use the pilot data to design your main study:
Collect data according to your design:
# Collect human responses
main_observed <- collect_human_responses(n = design$optimal_n_obs)
# Generate LLM predictions
main_observed$S1 <- get_llm_predictions(main_observed, "treatment")
main_observed$S0 <- get_llm_predictions(main_observed, "control")
main_unobserved <- generate_synthetic_units(n = design$optimal_n_unobs)
main_unobserved$S1 <- get_llm_predictions(main_unobserved, "treatment")
main_unobserved$S0 <- get_llm_predictions(main_unobserved, "control")
main_msd <- msd_data(observed = main_observed, unobserved = main_unobserved)Estimate the treatment effect using the recommended estimator:
A: That’s fine! The MSD estimators are designed to be valid even when predictions are biased. The human subjects data is used to “rectify” any systematic bias in the predictions. You will still get unbiased estimates with valid confidence intervals.
A: Any positive correlation helps, but more is better. As a rough guide: - \(\rho > 0.3\): Modest variance reduction - \(\rho > 0.5\): Substantial variance reduction - \(\rho > 0.7\): Large variance reduction
You can estimate this correlation from your pilot data.
A: Yes, for the tuned estimators (PPI++, D-T, DiP++,
D-T DiP). Cross-fitting ensures that the tuning parameters are estimated
independently of the data used to compute the estimate, which prevents
bias. The default n_folds = 2 works well in most cases.
A: Use the D-T or D-T DiP estimators. These use arm-specific tuning parameters (\(\lambda_1\) and \(\lambda_0\)), which automatically adapt to different prediction quality across arms.
A: We recommend n_folds = 2 for most
applications. More folds use data more efficiently but increase
computational cost. With 2 folds, half the labeled data is used to
estimate tuning parameters and half for estimation in each fold.
A: The current package supports one set of predictions. If you have multiple LLMs, you could: 1. Use the single best-performing LLM 2. Average predictions across LLMs before using them 3. Use the LLM whose predictions have the highest correlation with outcomes
The MSD estimators require:
For reference, the variance formulas for key estimators:
DiM: \[Var(\hat{\tau}^{DiM}) = \frac{\sigma^2_{Y(1)}}{n_1} + \frac{\sigma^2_{Y(0)}}{n_0}\]
GREG: \[Var(\hat{\tau}^{GREG}) = \sum_{d \in \{0,1\}} \left[\frac{\sigma^2_{S^{(d)}}}{m_d} + \frac{Var(Y(d) - S^{(d)})}{n_d}\right]\]
DiP: \[Var(\hat{\tau}^{DiP}) = \frac{Var(S^{(1)} - S^{(0)})}{m} + \sum_{d \in \{0,1\}} \frac{Var(Y(d) - S^{(d)})}{n_d}\]
The tuned estimators (PPI++, D-T, DiP++, D-T DiP) have more complex variance formulas that account for the estimation of tuning parameters.
If you use this package in your research, please cite:
van Loon, A. and Kanopka, K. (2025). Causal Inference in Experiments with
Mixed-Subjects Designs. R package version 0.1.0.
sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mixedsubjects_1.0.0 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.59
#> [5] maketools_1.3.2 cachem_1.1.0 knitr_1.51 htmltools_0.5.9
#> [9] buildtools_1.0.0 lifecycle_1.0.5 cli_3.6.6 sass_0.4.10
#> [13] jquerylib_0.1.4 compiler_4.6.1 sys_3.4.3 tools_4.6.1
#> [17] evaluate_1.0.5 bslib_0.11.0 yaml_2.3.12 otel_0.2.0
#> [21] jsonlite_2.0.0 rlang_1.2.0