| Title: | Causal Inference in Experiments with Mixed-Subjects Designs |
|---|---|
| Description: | Implements seven estimators for average treatment effect (ATE) estimation in mixed-subjects designs (MSDs), where human subjects data is augmented with predictions from large language models (LLMs). Includes Difference-in-Means, GREG, PPI++, Doubly-Tuned, Difference-in-Predictions (DiP), DiP++, and D-T DiP estimators. Provides point estimates, variance estimation via delta-method or bootstrap, and optimal design selection for budget allocation between human observations and LLM predictions. |
| Authors: | Austin van Loon [aut], Klint Kanopka [aut, cre], Yuan Huang [ctb] |
| Maintainer: | Klint Kanopka <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0 |
| Built: | 2026-07-02 21:35:12 UTC |
| Source: | https://github.com/cran/mixedsubjects |
Variance estimation methods including fold-respecting bootstrap. Bootstrap Variance Estimation
Computes bootstrap variance estimates for MSD estimators using a fold-respecting resampling procedure.
bootstrap_variance( data, estimator = c("dim", "greg", "ppi", "dt", "dip", "dip_pp", "dt_dip"), n_bootstrap = 1000, n_folds = 2, conf_level = 0.95, seed = NULL )bootstrap_variance( data, estimator = c("dim", "greg", "ppi", "dt", "dip", "dip_pp", "dt_dip"), n_bootstrap = 1000, n_folds = 2, conf_level = 0.95, seed = NULL )
data |
An msd_data object created by |
estimator |
Character string specifying the estimator. One of: "dim", "greg", "ppi", "dt", "dip", "dip_pp", "dt_dip" |
n_bootstrap |
Number of bootstrap replications (default 1000) |
n_folds |
Number of folds for cross-fitting (default 2, used for tuned estimators) |
conf_level |
Confidence level for confidence intervals (default 0.95) |
seed |
Random seed for reproducibility (optional) |
The fold-respecting bootstrap resamples within each stratum:
Resample observed treatment units with replacement
Resample observed control units with replacement
Resample unlabeled units with replacement
Recompute the estimator on the bootstrap sample
For cross-fit estimators, the fold assignments are regenerated for each bootstrap replicate to properly account for the cross-fitting variance.
A list containing:
estimate |
Point estimate from the original data |
variance |
Bootstrap variance estimate |
se |
Bootstrap standard error |
ci_lower, ci_upper
|
Bootstrap percentile confidence interval |
bootstrap_estimates |
Vector of bootstrap estimates |
# Create sample data obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) # Bootstrap variance for D-T DiP boot_result <- bootstrap_variance(msd, "dt_dip", n_bootstrap = 100, seed = 1) print(boot_result)# Create sample data obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) # Bootstrap variance for D-T DiP boot_result <- bootstrap_variance(msd, "dt_dip", n_bootstrap = 100, seed = 1) print(boot_result)
Computes and compares variance estimates using both delta-method and bootstrap for a given estimator.
compare_variance_methods( data, estimator, n_bootstrap = 1000, n_folds = 2, seed = NULL )compare_variance_methods( data, estimator, n_bootstrap = 1000, n_folds = 2, seed = NULL )
data |
An msd_data object |
estimator |
Character string specifying the estimator |
n_bootstrap |
Number of bootstrap replications |
n_folds |
Number of folds for cross-fitting |
seed |
Random seed |
A data frame comparing variance estimates
obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) comparison <- compare_variance_methods(msd, "dt_dip", n_bootstrap = 100, seed = 1) print(comparison)obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) comparison <- compare_variance_methods(msd, "dt_dip", n_bootstrap = 100, seed = 1) print(comparison)
Runs all applicable estimators on the data and returns a summary table.
estimate_all(data, n_folds = 2, conf_level = 0.95)estimate_all(data, n_folds = 2, conf_level = 0.95)
data |
An msd_data object |
n_folds |
Number of folds for cross-fitting (default 2) |
conf_level |
Confidence level (default 0.95) |
A data frame with estimates from all applicable estimators
obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) all_estimates <- estimate_all(msd) print(all_estimates)obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) all_estimates <- estimate_all(msd) print(all_estimates)
Create and validate data objects for MSD estimation. Create an MSD data object
Creates a validated data object for mixed-subjects design estimation. Accepts either a single combined dataframe or two separate dataframes for observed and unobserved units. Column names default to "Y", "D", "S0", "S1" and can be overridden explicitly.
msd_data( data = NULL, observed = NULL, unobserved = NULL, outcome = "Y", treatment = "D", pred_control = "S0", pred_treated = "S1", obs_outcome = NULL, obs_treatment = NULL, obs_pred_control = NULL, obs_pred_treated = NULL, unobs_treatment = NULL, unobs_pred_control = NULL, unobs_pred_treated = NULL )msd_data( data = NULL, observed = NULL, unobserved = NULL, outcome = "Y", treatment = "D", pred_control = "S0", pred_treated = "S1", obs_outcome = NULL, obs_treatment = NULL, obs_pred_control = NULL, obs_pred_treated = NULL, unobs_treatment = NULL, unobs_pred_control = NULL, unobs_pred_treated = NULL )
data |
A combined dataframe containing both observed and unobserved units.
Observed units have non-missing Y values; unobserved units have Y = NA.
If provided, |
observed |
A dataframe of observed (labeled) units with columns for
outcome Y, predictions S0/S1, and treatment D. If provided, |
unobserved |
A dataframe of unobserved (unlabeled) units with columns for predictions S0/S1 and treatment D (no Y column needed). |
outcome |
Name of the outcome column. Default: "Y". |
treatment |
Name of the treatment column. Default: "D". |
pred_control |
Name of the control prediction column. Default: "S0". |
pred_treated |
Name of the treatment prediction column. Default: "S1". |
obs_outcome |
Outcome column name for observed data only (overrides |
obs_treatment |
Treatment column name for observed data only (overrides |
obs_pred_control |
Control prediction column for observed data only. |
obs_pred_treated |
Treatment prediction column for observed data only. |
unobs_treatment |
Treatment column name for unobserved data only. |
unobs_pred_control |
Control prediction column for unobserved data only. |
unobs_pred_treated |
Treatment prediction column for unobserved data only. |
The function supports flexible column name specification:
Default column names:
By default, the function expects columns named "Y" (outcome), "D" (treatment),
"S0" (control prediction), and "S1" (treatment prediction). Override these
using the outcome, treatment, pred_control, and
pred_treated arguments.
Global specification:
Use outcome, treatment, pred_control, pred_treated
to set column names that apply to both observed and unobserved dataframes.
Per-dataframe specification:
Use obs_* and unobs_* arguments when column names differ
between observed and unobserved dataframes. These override global settings.
Two input modes:
Mode 1: Single combined dataframe
Provide a single dataframe via the data argument. The function will
automatically split it into observed and unobserved based on whether Y is NA.
Mode 2: Separate dataframes
Provide two separate dataframes via observed and unobserved.
An S3 object of class "msd_data" containing:
observed |
Dataframe of observed units with standardized columns Y, S0, S1, D |
unobserved |
Dataframe of unobserved units with standardized columns S0, S1, D (or NULL) |
has_S0 |
Logical indicating if S0 predictions are available |
has_S1 |
Logical indicating if S1 predictions are available |
has_both_predictions |
Logical indicating if both S0 and S1 are available |
n1 |
Number of treated observed units |
n0 |
Number of control observed units |
m |
Number of unobserved units |
col_mapping |
List of original column names used |
# Default column names (Y, D, S0, S1) obs_df <- data.frame( Y = c(1.2, 0.8, 1.5, 0.9), S0 = c(1.0, 0.7, 1.3, 0.8), S1 = c(1.1, 0.9, 1.4, 1.0), D = c(1, 0, 1, 0) ) msd <- msd_data(observed = obs_df) # Custom column names (same in both dataframes) obs_df2 <- data.frame( response = c(1.2, 0.8, 1.5, 0.9), pred_ctrl = c(1.0, 0.7, 1.3, 0.8), pred_trt = c(1.1, 0.9, 1.4, 1.0), treated = c(1, 0, 1, 0) ) unobs_df2 <- data.frame( pred_ctrl = c(1.1, 0.9), pred_trt = c(1.2, 1.0), treated = c(1, 0) ) msd2 <- msd_data( observed = obs_df2, unobserved = unobs_df2, outcome = "response", treatment = "treated", pred_control = "pred_ctrl", pred_treated = "pred_trt" ) # Different column names in observed vs unobserved obs_df3 <- data.frame( outcome = c(1.2, 0.8), claude_pred_0 = c(1.0, 0.7), claude_pred_1 = c(1.1, 0.9), treatment = c(1, 0) ) unobs_df3 <- data.frame( s0_claude = c(1.1, 0.9), s1_claude = c(1.2, 1.0), D = c(1, 0) ) msd3 <- msd_data( observed = obs_df3, unobserved = unobs_df3, obs_outcome = "outcome", obs_treatment = "treatment", obs_pred_control = "claude_pred_0", obs_pred_treated = "claude_pred_1", unobs_treatment = "D", unobs_pred_control = "s0_claude", unobs_pred_treated = "s1_claude" )# Default column names (Y, D, S0, S1) obs_df <- data.frame( Y = c(1.2, 0.8, 1.5, 0.9), S0 = c(1.0, 0.7, 1.3, 0.8), S1 = c(1.1, 0.9, 1.4, 1.0), D = c(1, 0, 1, 0) ) msd <- msd_data(observed = obs_df) # Custom column names (same in both dataframes) obs_df2 <- data.frame( response = c(1.2, 0.8, 1.5, 0.9), pred_ctrl = c(1.0, 0.7, 1.3, 0.8), pred_trt = c(1.1, 0.9, 1.4, 1.0), treated = c(1, 0, 1, 0) ) unobs_df2 <- data.frame( pred_ctrl = c(1.1, 0.9), pred_trt = c(1.2, 1.0), treated = c(1, 0) ) msd2 <- msd_data( observed = obs_df2, unobserved = unobs_df2, outcome = "response", treatment = "treated", pred_control = "pred_ctrl", pred_treated = "pred_trt" ) # Different column names in observed vs unobserved obs_df3 <- data.frame( outcome = c(1.2, 0.8), claude_pred_0 = c(1.0, 0.7), claude_pred_1 = c(1.1, 0.9), treatment = c(1, 0) ) unobs_df3 <- data.frame( s0_claude = c(1.1, 0.9), s1_claude = c(1.2, 1.0), D = c(1, 0) ) msd3 <- msd_data( observed = obs_df3, unobserved = unobs_df3, obs_outcome = "outcome", obs_treatment = "treatment", obs_pred_control = "claude_pred_0", obs_pred_treated = "claude_pred_1", unobs_treatment = "D", unobs_pred_control = "s0_claude", unobs_pred_treated = "s1_claude" )
Classical difference-in-means estimator for ATE. Difference-in-Means Estimator
Computes the classical difference-in-means estimator for the average treatment effect (ATE). This estimator uses only observed (labeled) data and does not incorporate any predictions.
msd_dim( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, conf_level = 0.95 )msd_dim( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, conf_level = 0.95 )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
conf_level |
Confidence level for the confidence interval (default 0.95) |
The difference-in-means estimator is:
where is the sample mean of outcomes in arm .
The variance is estimated as:
where is the sample variance of outcomes in arm .
An msd_result object containing:
estimate |
Point estimate of the ATE: mean(Y|D=1) - mean(Y|D=0) |
variance |
Estimated variance: var(Y|D=1)/n1 + var(Y|D=0)/n0 |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
# Using msd_data object (standard interface) obs_df <- data.frame( Y = c(1.2, 1.4, 0.8, 0.6, 1.1, 0.9, 1.3, 0.7), S0 = c(1.0, 1.2, 0.7, 0.5, 1.0, 0.8, 1.1, 0.6), S1 = c(1.1, 1.3, 0.9, 0.7, 1.1, 0.9, 1.2, 0.8), D = c(1, 1, 0, 0, 1, 0, 1, 0) ) msd <- msd_data(observed = obs_df) result <- msd_dim(msd) print(result) # Using formula interface with custom column names df <- data.frame( response = c(1.2, 1.4, 0.8, 0.6), treated = c(1, 1, 0, 0) ) result2 <- msd_dim(response ~ treated, observed = df)# Using msd_data object (standard interface) obs_df <- data.frame( Y = c(1.2, 1.4, 0.8, 0.6, 1.1, 0.9, 1.3, 0.7), S0 = c(1.0, 1.2, 0.7, 0.5, 1.0, 0.8, 1.1, 0.6), S1 = c(1.1, 1.3, 0.9, 0.7, 1.1, 0.9, 1.2, 0.8), D = c(1, 1, 0, 0, 1, 0, 1, 0) ) msd <- msd_data(observed = obs_df) result <- msd_dim(msd) print(result) # Using formula interface with custom column names df <- data.frame( response = c(1.2, 1.4, 0.8, 0.6), treated = c(1, 1, 0, 0) ) result2 <- msd_dim(response ~ treated, observed = df)
DiP estimator for ATE using paired predictions. DiP (Difference-in-Predictions) Estimator
Computes the Difference-in-Predictions (DiP) estimator for the average treatment effect (ATE). This estimator uses both treatment and control predictions for each unlabeled unit, computing the contrast S^(1) - S^(0) at the unit level.
msd_dip( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, conf_level = 0.95 )msd_dip( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, conf_level = 0.95 )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
conf_level |
Confidence level for the confidence interval (default 0.95) |
The DiP estimator is:
An msd_result object containing:
estimate |
Point estimate of the ATE |
variance |
Estimated variance |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
lambda |
Tuning parameter (always 1 for DiP) |
DiP requires BOTH S0 and S1 predictions for ALL units. The key advantage of DiP over GREG is that when S^(1) and S^(0) are positively correlated, the variance of their difference is smaller.
# Using msd_data object obs_df <- data.frame( Y = c(1.2, 1.4, 0.8, 0.6), S0 = c(1.0, 1.2, 0.7, 0.5), S1 = c(1.1, 1.3, 0.9, 0.7), D = c(1, 1, 0, 0) ) unobs_df <- data.frame( S0 = c(1.1, 0.9, 1.0, 0.8), S1 = c(1.2, 1.0, 1.1, 0.9), D = c(1, 1, 0, 0) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dip(msd) # Using formula interface result2 <- msd_dip(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)# Using msd_data object obs_df <- data.frame( Y = c(1.2, 1.4, 0.8, 0.6), S0 = c(1.0, 1.2, 0.7, 0.5), S1 = c(1.1, 1.3, 0.9, 0.7), D = c(1, 1, 0, 0) ) unobs_df <- data.frame( S0 = c(1.1, 0.9, 1.0, 0.8), S1 = c(1.2, 1.0, 1.1, 0.9), D = c(1, 1, 0, 0) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dip(msd) # Using formula interface result2 <- msd_dip(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)
Power-tuned Difference-in-Predictions estimator for ATE. DiP++ Estimator
Computes the DiP++ (power-tuned difference-in-predictions) estimator for the average treatment effect (ATE). This estimator uses paired predictions S^(1) and S^(0) for each unlabeled unit, with a single tuning parameter lambda estimated via cross-fitting.
msd_dip_pp( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )msd_dip_pp( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
n_folds |
Number of folds for cross-fitting (default 2) |
conf_level |
Confidence level for the confidence interval (default 0.95) |
seed |
Random seed for fold splitting (optional) |
The DiP++ estimator is:
An msd_result object containing:
estimate |
Point estimate of the ATE |
variance |
Estimated variance (delta-method) |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
lambda |
Estimated tuning parameter (single value) |
DiP++ requires BOTH S0 and S1 predictions for ALL units.
For arm-specific tuning, use msd_dt_dip.
# Using msd_data object set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1 <- 0.5 * obs_df$Y + rnorm(n, 0, 0.5) obs_df$S0 <- 0.5 * obs_df$Y + rnorm(n, 0, 0.5) - 0.1 unobs_df <- data.frame( S0 = rnorm(200, 0, 0.5), S1 = rnorm(200, 0.2, 0.5), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dip_pp(msd) # Using formula interface result2 <- msd_dip_pp(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)# Using msd_data object set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1 <- 0.5 * obs_df$Y + rnorm(n, 0, 0.5) obs_df$S0 <- 0.5 * obs_df$Y + rnorm(n, 0, 0.5) - 0.1 unobs_df <- data.frame( S0 = rnorm(200, 0, 0.5), S1 = rnorm(200, 0.2, 0.5), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dip_pp(msd) # Using formula interface result2 <- msd_dip_pp(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)
D-T estimator with arm-specific tuning parameters for ATE. D-T (Doubly-Tuned) Estimator
Computes the Doubly-Tuned (D-T) estimator for the average treatment effect (ATE). This estimator uses arm-specific tuning parameters (lambda_1 and lambda_0) estimated via cross-fitting.
msd_dt( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )msd_dt( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
n_folds |
Number of folds for cross-fitting (default 2) |
conf_level |
Confidence level for the confidence interval (default 0.95) |
seed |
Random seed for fold splitting (optional) |
The D-T estimator uses arm-specific tuning parameters:
Each lambda_d is chosen to minimize the variance in arm d:
The tuning parameters are estimated via cross-fitting to avoid bias.
An msd_result object containing:
estimate |
Point estimate of the ATE |
variance |
Estimated variance (delta-method) |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
lambda |
Vector of arm-specific tuning parameters (lambda_1, lambda_0) |
D-T differs from PPI++ by using separate tuning parameters for each arm, which can improve efficiency when the prediction quality differs between treatment and control.
# Create sample data set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), S0 = rnorm(n, 0, 0.5), S1 = rnorm(n, 0.2, 0.5), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1[obs_df$D == 1] <- obs_df$S1[obs_df$D == 1] + 0.5 * obs_df$Y[obs_df$D == 1] obs_df$S0[obs_df$D == 0] <- obs_df$S0[obs_df$D == 0] + 0.5 * obs_df$Y[obs_df$D == 0] unobs_df <- data.frame( S0 = rnorm(200, 0, 0.5), S1 = rnorm(200, 0.2, 0.5), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dt(msd) # Using formula interface result2 <- msd_dt(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)# Create sample data set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), S0 = rnorm(n, 0, 0.5), S1 = rnorm(n, 0.2, 0.5), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1[obs_df$D == 1] <- obs_df$S1[obs_df$D == 1] + 0.5 * obs_df$Y[obs_df$D == 1] obs_df$S0[obs_df$D == 0] <- obs_df$S0[obs_df$D == 0] + 0.5 * obs_df$Y[obs_df$D == 0] unobs_df <- data.frame( S0 = rnorm(200, 0, 0.5), S1 = rnorm(200, 0.2, 0.5), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dt(msd) # Using formula interface result2 <- msd_dt(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)
D-T DiP estimator with arm-specific tuning for ATE. D-T DiP (Doubly-Tuned Difference-in-Predictions) Estimator
Computes the D-T DiP estimator for the average treatment effect (ATE). This estimator uses paired predictions S^(1) and S^(0) for each unlabeled unit, with arm-specific tuning parameters (lambda_1 and lambda_0) estimated via cross-fitting.
msd_dt_dip( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )msd_dt_dip( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
n_folds |
Number of folds for cross-fitting (default 2) |
conf_level |
Confidence level for the confidence interval (default 0.95) |
seed |
Random seed for fold splitting (optional) |
The D-T DiP estimator is:
The tuning parameters are chosen jointly to
minimize the total variance. Because the two arms share the unobserved pool,
the variance is jointly quadratic in with a
cross-term from , so the optimum solves a coupled
2x2 linear system rather than two independent regressions.
The tuning parameters are estimated via cross-fitting:
Split labeled data into K folds
For each fold k, estimate (lambda_1, lambda_0) on opposite folds
Compute the fold-k estimate using estimated lambdas
Average across folds with equal weights
An msd_result object containing:
estimate |
Point estimate of the ATE |
variance |
Estimated variance (delta-method) |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
lambda |
Vector of arm-specific tuning parameters (lambda_1, lambda_0) |
D-T DiP requires BOTH S0 and S1 predictions for ALL units. This corresponds to 2 predictions per unlabeled unit.
D-T DiP combines the benefits of:
DiP: exploiting positive correlation between S^(1) and S^(0)
D-T: arm-specific tuning for heterogeneous prediction quality
# Create sample data with both predictions set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1 <- 0.6 * obs_df$Y + rnorm(n, 0, 0.4) obs_df$S0 <- 0.4 * obs_df$Y + rnorm(n, 0, 0.5) unobs_df <- data.frame( S0 = rnorm(300, 0, 0.5), S1 = rnorm(300, 0.2, 0.4), D = rep(c(1, 0), 150) ) # Add correlation between S0 and S1 unobs_df$S1 <- unobs_df$S1 + 0.5 * unobs_df$S0 msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dt_dip(msd) # Using formula interface result2 <- msd_dt_dip(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)# Create sample data with both predictions set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1 <- 0.6 * obs_df$Y + rnorm(n, 0, 0.4) obs_df$S0 <- 0.4 * obs_df$Y + rnorm(n, 0, 0.5) unobs_df <- data.frame( S0 = rnorm(300, 0, 0.5), S1 = rnorm(300, 0.2, 0.4), D = rep(c(1, 0), 150) ) # Add correlation between S0 and S1 unobs_df$S1 <- unobs_df$S1 + 0.5 * unobs_df$S0 msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dt_dip(msd) # Using formula interface result2 <- msd_dt_dip(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)
Generalized Regression (GREG) calibration estimator for ATE. GREG Estimator
Computes the Generalized Regression (GREG) calibration estimator for the
average treatment effect (ATE). This is the untuned calibration estimator
with tuning parameter . GREG, the original
Prediction-Powered Inference (PPI) estimator, and Design-based Supervised
Learning (DSL) are all the same untuned () estimator. It is
distinct from the power-tuned PPI++ estimator in msd_ppi, which
estimates via cross-fitting to minimize variance.
msd_greg( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, conf_level = 0.95 )msd_greg( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, conf_level = 0.95 )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
conf_level |
Confidence level for the confidence interval (default 0.95) |
The GREG estimator for arm is:
The ATE estimate is:
The variance is:
An msd_result object containing:
estimate |
Point estimate of the ATE |
variance |
Estimated variance |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
lambda |
Tuning parameter (always 1 for GREG) |
GREG requires predictions for each unit's assigned arm:
Treatment arm units need S1
Control arm units need S0
# Using msd_data object obs_df <- data.frame( Y = c(1.2, 1.4, 0.8, 0.6), S0 = c(1.0, 1.2, 0.7, 0.5), S1 = c(1.1, 1.3, 0.9, 0.7), D = c(1, 1, 0, 0) ) unobs_df <- data.frame( S0 = c(1.1, 0.9, 1.0, 0.8), S1 = c(1.2, 1.0, 1.1, 0.9), D = c(1, 1, 0, 0) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_greg(msd) # Using formula interface result2 <- msd_greg(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)# Using msd_data object obs_df <- data.frame( Y = c(1.2, 1.4, 0.8, 0.6), S0 = c(1.0, 1.2, 0.7, 0.5), S1 = c(1.1, 1.3, 0.9, 0.7), D = c(1, 1, 0, 0) ) unobs_df <- data.frame( S0 = c(1.1, 0.9, 1.0, 0.8), S1 = c(1.2, 1.0, 1.1, 0.9), D = c(1, 1, 0, 0) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_greg(msd) # Using formula interface result2 <- msd_greg(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)
Power-tuned Prediction-Powered Inference estimator for ATE. PPI++ Estimator
Computes the PPI++ (power-tuned prediction-powered inference) estimator for the average treatment effect (ATE). This estimator uses a single tuning parameter lambda that is estimated via cross-fitting to minimize variance.
msd_ppi( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )msd_ppi( formula_or_data, data = NULL, observed = NULL, unobserved = NULL, n_folds = 2, conf_level = 0.95, seed = NULL )
formula_or_data |
Either an msd_data object created by |
data |
If |
observed |
If using formula with separate dataframes, the observed data. |
unobserved |
If using formula with separate dataframes, the unobserved data. |
n_folds |
Number of folds for cross-fitting (default 2) |
conf_level |
Confidence level for the confidence interval (default 0.95) |
seed |
Random seed for fold splitting (optional) |
The PPI++ estimator uses a single tuning parameter lambda across both arms:
The tuning parameter is estimated via cross-fitting:
Split labeled data into K folds
For each fold k, estimate lambda on the opposite folds
Compute the fold-k estimate using the estimated lambda
Average across folds
Lambda is chosen to minimize the combined variance across arms.
An msd_result object containing:
estimate |
Point estimate of the ATE |
variance |
Estimated variance (delta-method) |
se |
Standard error |
ci_lower, ci_upper
|
Confidence interval bounds |
method |
Name of the estimation method |
lambda |
Estimated tuning parameter (single value) |
PPI++ uses a single lambda across arms, unlike D-T which uses arm-specific
tuning parameters. For arm-specific tuning, use msd_dt.
# Create sample data set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), S0 = rnorm(n, 0, 0.5), S1 = rnorm(n, 0.2, 0.5), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1[obs_df$D == 1] <- obs_df$S1[obs_df$D == 1] + 0.5 * obs_df$Y[obs_df$D == 1] obs_df$S0[obs_df$D == 0] <- obs_df$S0[obs_df$D == 0] + 0.5 * obs_df$Y[obs_df$D == 0] unobs_df <- data.frame( S0 = rnorm(200, 0, 0.5), S1 = rnorm(200, 0.2, 0.5), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_ppi(msd) # Using formula interface result2 <- msd_ppi(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)# Create sample data set.seed(123) n <- 100 obs_df <- data.frame( Y = rnorm(n), S0 = rnorm(n, 0, 0.5), S1 = rnorm(n, 0.2, 0.5), D = rep(c(1, 0), each = n/2) ) obs_df$Y <- obs_df$Y + 0.3 * obs_df$D obs_df$S1[obs_df$D == 1] <- obs_df$S1[obs_df$D == 1] + 0.5 * obs_df$Y[obs_df$D == 1] obs_df$S0[obs_df$D == 0] <- obs_df$S0[obs_df$D == 0] + 0.5 * obs_df$Y[obs_df$D == 0] unobs_df <- data.frame( S0 = rnorm(200, 0, 0.5), S1 = rnorm(200, 0.2, 0.5), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_ppi(msd) # Using formula interface result2 <- msd_ppi(Y ~ D | S1 + S0, observed = obs_df, unobserved = unobs_df)
Find optimal budget allocation between human observations and predictions. Optimal Design Selection
Determines the optimal allocation of budget between human observations and LLM predictions, and recommends the best estimator for the given pilot data.
optimal_design( pilot_data, budget, cost_human, cost_prediction, treatment_prob = 0.5, estimators = "all", min_observed = 20, n_grid = 100 )optimal_design( pilot_data, budget, cost_human, cost_prediction, treatment_prob = 0.5, estimators = "all", min_observed = 20, n_grid = 100 )
pilot_data |
An msd_data object from a pilot study |
budget |
Total budget available (in dollars) |
cost_human |
Cost per human observation (in dollars) |
cost_prediction |
Cost per LLM prediction (in dollars) |
treatment_prob |
Probability of treatment assignment (default 0.5) |
estimators |
Which estimators to consider. Either "all" or a character vector with subset of: "dim", "greg", "ppi", "dt", "dip", "dip_pp", "dt_dip" |
min_observed |
Minimum number of observed units required (default 20) |
n_grid |
Number of grid points for optimization (default 100) |
The function uses grid search to find the optimal (n_O, n_U) allocation that minimizes expected variance given the budget constraint:
where k is the number of predictions per unit:
k = 0 for DiM (no predictions)
k = 1 for GREG, PPI++, D-T (one prediction per arm)
k = 2 for DiP, DiP++, D-T DiP (both S^(0) and S^(1))
The expected variance is computed using population moments estimated from the pilot data.
An S3 object of class "msd_design" containing:
optimal_n_obs |
Recommended number of observed (human) units |
optimal_n_unobs |
Recommended number of unobserved (prediction) units |
optimal_estimator |
Recommended estimator |
optimal_variance |
Expected variance at the optimum |
optimal_se |
Expected standard error at the optimum |
budget_used |
Total budget used |
all_results |
Full grid search results for all estimators |
# Pilot study data pilot_obs <- data.frame( Y = rnorm(50), S0 = rnorm(50), S1 = rnorm(50), D = rep(c(1, 0), each = 25) ) pilot_unobs <- data.frame( S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) pilot <- msd_data(observed = pilot_obs, unobserved = pilot_unobs) # Find optimal design with $10,000 budget design <- optimal_design( pilot_data = pilot, budget = 10000, cost_human = 10, # $10 per human observation cost_prediction = 0.01 # $0.01 per prediction ) print(design)# Pilot study data pilot_obs <- data.frame( Y = rnorm(50), S0 = rnorm(50), S1 = rnorm(50), D = rep(c(1, 0), each = 25) ) pilot_unobs <- data.frame( S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) pilot <- msd_data(observed = pilot_obs, unobserved = pilot_unobs) # Find optimal design with $10,000 budget design <- optimal_design( pilot_data = pilot, budget = 10000, cost_human = 10, # $10 per human observation cost_prediction = 0.01 # $0.01 per prediction ) print(design)
Print method for msd_data
## S3 method for class 'msd_data' print(x, ...)## S3 method for class 'msd_data' print(x, ...)
x |
An msd_data object |
... |
Additional arguments (ignored) |
Invisibly returns x
Print method for msd_design
## S3 method for class 'msd_design' print(x, digits = 4, ...)## S3 method for class 'msd_design' print(x, digits = 4, ...)
x |
An msd_design object. |
digits |
Number of digits to display. |
... |
Additional arguments (ignored). |
Invisibly returns x.
Print method for msd_result
## S3 method for class 'msd_result' print(x, digits = 4, ...)## S3 method for class 'msd_result' print(x, digits = 4, ...)
x |
An msd_result object |
digits |
Number of digits to display |
... |
Additional arguments (ignored) |
Invisibly returns x
Print method for msd_summary
## S3 method for class 'msd_summary' print(x, digits = 4, ...)## S3 method for class 'msd_summary' print(x, digits = 4, ...)
x |
An msd_summary object. |
digits |
Number of digits to display. |
... |
Additional arguments (ignored). |
Invisibly returns x.
Produces nicely formatted output similar to summary.lm, designed for applied social scientists.
## S3 method for class 'summary.msd_result' print(x, digits = 4, ...)## S3 method for class 'summary.msd_result' print(x, digits = 4, ...)
x |
A summary.msd_result object |
digits |
Number of significant digits to display (default 4) |
... |
Additional arguments (ignored) |
Invisibly returns x
Summary method for msd_design
## S3 method for class 'msd_design' summary(object, ...)## S3 method for class 'msd_design' summary(object, ...)
object |
An msd_design object. |
... |
Additional arguments passed to the print method. |
Invisibly returns object.
Produces a detailed summary of the MSD estimation results, including the coefficient table with z-statistics and p-values, sample size information, and interpretation guidance.
## S3 method for class 'msd_result' summary(object, ...)## S3 method for class 'msd_result' summary(object, ...)
object |
An msd_result object |
... |
Additional arguments (ignored) |
A summary.msd_result object (invisibly when printed)
obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dt_dip(msd, seed = 1) summary(result)obs_df <- data.frame( Y = rnorm(100), S0 = rnorm(100), S1 = rnorm(100), D = rep(c(1, 0), each = 50) ) unobs_df <- data.frame( S0 = rnorm(200), S1 = rnorm(200), D = rep(c(1, 0), each = 100) ) msd <- msd_data(observed = obs_df, unobserved = unobs_df) result <- msd_dt_dip(msd, seed = 1) summary(result)