---
title: "Introduction to Mixed-Subjects Designs with the mixedsubjects Package"
author: "Austin van Loon and Klint Kanopka"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Introduction to Mixed-Subjects Designs with the mixedsubjects Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Overview

The `mixedsubjects` package provides tools for analyzing randomized experiments that combine traditional human subjects data with predictions from large language models (LLMs) or other machine learning algorithms. This approach, which we call a **Mixed-Subjects Design (MSD)**, can substantially increase the precision of treatment effect estimates while maintaining valid statistical inference.

### Why Mixed-Subjects Designs?

Running experiments with human subjects is expensive. A typical survey response might cost several dollars, and detecting small treatment effects often requires thousands of participants. Meanwhile, LLMs can generate predictions about how humans might respond to experimental treatments at a fraction of the cost---often less than a cent per prediction.

The key insight behind mixed-subjects designs is that **LLM predictions, even if imperfect, contain information about human behavior**. By combining a smaller sample of actual human responses (the "gold standard") with a larger pool of LLM predictions, we can:

1. **Reduce variance** in our treatment effect estimates
2. **Maintain valid inference**---our confidence intervals will have correct coverage
3. **Lower costs** of experimentation

Importantly, this approach **does not assume LLM predictions are accurate**. Even if predictions are biased, the methods in this package automatically adjust for this bias using the human subjects data.

### What This Package Does

The `mixedsubjects` package implements seven estimators for the Average Treatment Effect (ATE) in mixed-subjects designs:

| Estimator | Description | Predictions Needed |
|-----------|-------------|-------------------|
| **DiM** | Difference-in-Means | None |
| **GREG** | Calibration estimator | 1 per unit |
| **PPI++** | Power-tuned, single $\lambda$ | 1 per unit |
| **D-T** | Doubly-tuned, arm-specific $\lambda$ | 1 per unit |
| **DiP** | Difference-in-Predictions | 2 per unit |
| **DiP++** | Power-tuned DiP | 2 per unit |
| **D-T DiP** | Doubly-tuned DiP | 2 per unit |

The package also helps you **design optimal experiments** by determining how to allocate your budget between human subjects and LLM predictions.

## Installation

```{r, eval=FALSE}
# Install from github
devtools::install_github('klintkanopka/mixedsubjects')
```

```{r load-package, echo=FALSE, message=FALSE}
# For the vignette, we source the files directly

library(mixedsubjects)

pkg_dir <- system.file(package = "mixedsubjects")
if (pkg_dir == "") {
  # Running from source
  pkg_dir <- "../R"
  for (f in list.files(pkg_dir, pattern = "\\.R$", full.names = TRUE)) {
    source(f)
  }
}
```

## A Quick Example

Let's start with a simple example to see the package in action.

### Simulating Experimental Data

Imagine you're running a survey experiment to test whether a persuasive message changes attitudes. You have:

- **200 human respondents** who completed the survey (100 treated, 100 control)
- **1,000 LLM predictions** of how additional respondents would have answered

```{r simulate-data}
set.seed(123)

# True treatment effect
true_ate <- 0.3

# Human subjects data (observed)
n_observed <- 200
observed_df <- data.frame(
  # Treatment assignment (balanced)
  D = rep(c(1, 0), each = n_observed / 2)
)

# Generate outcomes: Y(0) ~ N(0, 1), Y(1) ~ N(0.3, 1)
observed_df$Y <- ifelse(
  observed_df$D == 1,
  rnorm(n_observed / 2, mean = true_ate, sd = 1),
  rnorm(n_observed / 2, mean = 0, sd = 1)
)

# LLM predictions (correlated with true outcomes, but with some error)
# S^(1) predicts Y(1), S^(0) predicts Y(0)
observed_df$S1 <- 0.6 * observed_df$Y + rnorm(n_observed, 0, 0.5)
observed_df$S0 <- 0.5 * observed_df$Y + rnorm(n_observed, 0, 0.6)

# Unobserved units (only have predictions, no actual Y)
# Predictions must come from the same pipeline as observed (Assumption C: Random Labeling)
n_unobserved <- 1000
D_unobs <- rep(c(1, 0), each = n_unobserved / 2)
latent_Y <- ifelse(D_unobs == 1,
                   rnorm(n_unobserved / 2, mean = true_ate, sd = 1),
                   rnorm(n_unobserved / 2, mean = 0, sd = 1))
unobserved_df <- data.frame(
  D = D_unobs,
  S1 = 0.6 * latent_Y + rnorm(n_unobserved, 0, 0.5),
  S0 = 0.5 * latent_Y + rnorm(n_unobserved, 0, 0.6)
)
```

### Creating an MSD Data Object

First, we combine our data into an `msd_data` object:
```{r create-msd-data}
msd <- msd_data(observed = observed_df, unobserved = unobserved_df)
print(msd)
```

### Estimating the Treatment Effect

Now let's estimate the ATE using different estimators:

```{r estimate-ate}
# Classical difference-in-means (ignores predictions)
result_dim <- msd_dim(msd)

# D-T DiP (uses both predictions with arm-specific tuning)
result_dt_dip <- msd_dt_dip(msd)

# Compare
cat("Difference-in-Means:\n")
cat("  Estimate:", round(result_dim$estimate, 3), "\n")
cat("  SE:", round(result_dim$se, 3), "\n")
cat("  95% CI: [", round(result_dim$ci_lower, 3), ", ",
    round(result_dim$ci_upper, 3), "]\n\n")

cat("D-T DiP:\n")
cat("  Estimate:", round(result_dt_dip$estimate, 3), "\n")
cat("  SE:", round(result_dt_dip$se, 3), "\n")
cat("  95% CI: [", round(result_dt_dip$ci_lower, 3), ", ",
    round(result_dt_dip$ci_upper, 3), "]\n")
```

Notice that the D-T DiP estimator has a **smaller standard error** than difference-in-means, even though both use the same human subjects data. This variance reduction comes from leveraging the LLM predictions.

## Understanding the Data Structure

### What Data Do You Need?

For a mixed-subjects design, you need:

1. **Observed (labeled) data**: Human subjects who were randomized and provided actual responses
   - `Y`: The outcome variable
   - `D`: Treatment indicator (1 = treated, 0 = control)
   - `S1` and/or `S0`: LLM predictions (optional for DiM)

2. **Unobserved (unlabeled) data**: Units for which you only have LLM predictions
   - `D`: Treatment indicator
   - `S1` and/or `S0`: LLM predictions

### Two Types of Predictions

The package distinguishes between two prediction setups:

**Setup 1: Arm-specific predictions (1 prediction per unit)**

Each unit only has the prediction for their assigned treatment condition:
- Treated units have $S^{(1)}$ (prediction of their outcome under treatment)
- Control units have $S^{(0)}$ (prediction of their outcome under control)

This is typical when you ask the LLM to predict each unit's response to their actual assigned condition.

**Setup 2: Both predictions (2 predictions per unit)**

Each unit has predictions for *both* conditions:
- $S^{(1)}$: Prediction of outcome if treated
- $S^{(0)}$: Prediction of outcome if in control

This requires asking the LLM to predict each unit's response under both conditions, which doubles the prediction cost but enables the DiP family of estimators.

### Creating Your Data Object

The `msd_data()` function accepts data in two formats:

**Option 1: Separate dataframes**
```{r data-format-1, eval=FALSE}
msd <- msd_data(
  observed = observed_df,    # Must have Y, D, and optionally S0, S1
  unobserved = unobserved_df # Must have D and S0, S1 (no Y)
)
```

**Option 2: Combined dataframe**

If your data is in a single dataframe where unobserved units have `Y = NA`:

```{r data-format-2}
# Combine into single dataframe
combined_df <- rbind(
  observed_df,
  data.frame(Y = NA, D = unobserved_df$D,
             S0 = unobserved_df$S0, S1 = unobserved_df$S1)
)

# Create msd_data object
msd_combined <- msd_data(data = combined_df)
print(msd_combined)
```

### Flexible Column Names

By default, the package expects columns named `Y` (outcome), `D` (treatment), `S0` (control prediction), and `S1` (treatment prediction). If your columns use different names, specify them explicitly:

```{r custom-columns, eval=FALSE}
# Custom column names
my_data <- data.frame(
  response_var = rnorm(100),
  is_treated = rep(c(1, 0), each = 50),
  claude_pred_ctrl = rnorm(100),
  claude_pred_trt = rnorm(100)
)

msd <- msd_data(
  observed = my_data,
  outcome = "response_var",
  treatment = "is_treated",
  pred_control = "claude_pred_ctrl",
  pred_treated = "claude_pred_trt"
)
```

You can even use different column names for observed and unobserved data:

```{r different-columns, eval=FALSE}
msd <- msd_data(
  observed = obs_df,
  unobserved = unobs_df,
  obs_outcome = "Y",
  obs_treatment = "treated",
  obs_pred_control = "gpt_pred_0",
  obs_pred_treated = "gpt_pred_1",
  unobs_treatment = "D",
  unobs_pred_control = "S0",
  unobs_pred_treated = "S1"
)
```

### Formula Interface

For even more flexibility, you can use a formula interface directly in the estimator functions. This is similar to the syntax used by `ivreg` and other regression packages:

```{r formula-example, eval=FALSE}
# Formula: outcome ~ treatment | predictions
# Using custom column names directly in the estimator

# For GREG-type estimators (single prediction per arm)
result <- msd_greg(response ~ treatment | pred_1 + pred_0,
                   observed = obs_df, unobserved = unobs_df)

# For DiP-type estimators (both predictions)
result <- msd_dt_dip(Y ~ D | S1 + S0,
                     observed = obs_df, unobserved = unobs_df)
```

The formula has three parts:
1. **Left of `~`**: The outcome variable name
2. **Between `~` and `|`**: The treatment indicator variable name
3. **After `|`**: The prediction variable name(s)

For DiM (which doesn't use predictions), you can omit the prediction part:

```{r formula-dim, eval=FALSE}
result <- msd_dim(response ~ treated, observed = obs_df)
```

## The Seven Estimators

### 1. Difference-in-Means (DiM)

The classical estimator that ignores LLM predictions entirely:

$$\hat{\tau}^{DiM} = \bar{Y}_1 - \bar{Y}_0$$

Use this as your baseline. It's unbiased and requires no predictions.

```{r dim-example}
result <- msd_dim(msd)
print(result)
```

### 2. GREG (Calibration Estimator)

GREG uses predictions to "calibrate" the estimate:

$$\hat{\tau}^{GREG} = \left[\bar{S}^{(1)}_U + (\bar{Y}_1 - \bar{S}^{(1)}_{O_1})\right] - \left[\bar{S}^{(0)}_U + (\bar{Y}_0 - \bar{S}^{(0)}_{O_0})\right]$$

This adjusts the prediction mean by the observed "rectifier" (the average prediction error in the labeled data).

```{r greg-example}
result <- msd_greg(msd)
print(result)
```

### 3. PPI++ (Power-Tuned)

PPI++ improves on GREG by estimating an optimal tuning parameter $\lambda$:

$$\hat{\mu}_d^{PPI}(\lambda) = \bar{Y}_d + \lambda(\bar{S}^{(d)}_U - \bar{S}^{(d)}_O)$$

When $\lambda = 1$, this equals GREG. PPI++ estimates $\lambda$ to minimize variance using cross-fitting.

```{r ppi-example}
result <- msd_ppi(msd, n_folds = 2)
print(result)
```

### 4. D-T (Doubly-Tuned)

D-T uses separate tuning parameters for each arm ($\lambda_1$ for treatment, $\lambda_0$ for control):

$$\hat{\tau}^{D-T} = \hat{\mu}_1^{PPI}(\lambda_1) - \hat{\mu}_0^{PPI}(\lambda_0)$$

This is useful when prediction quality differs between treatment and control conditions.

```{r dt-example}
result <- msd_dt(msd, n_folds = 2)
print(result)
```

### 5. DiP (Difference-in-Predictions)

DiP uses the *contrast* $S^{(1)} - S^{(0)}$ for each unlabeled unit:

$$\hat{\tau}^{DiP} = \overline{S^{(1)} - S^{(0)}}_U + (\bar{Y}_1 - \bar{S}^{(1)}_{O_1}) - (\bar{Y}_0 - \bar{S}^{(0)}_{O_0})$$

**Key insight**: When $S^{(1)}$ and $S^{(0)}$ are positively correlated (which they often are---units that score high under treatment also tend to score high under control), the variance of their difference is smaller than the sum of their individual variances.

```{r dip-example}
result <- msd_dip(msd)
print(result)
```

### 6. DiP++ (Power-Tuned DiP)

DiP++ applies power-tuning to the DiP estimator with a single $\lambda$:

$$\hat{\tau}^{DiP++}(\lambda) = \lambda \cdot \overline{S^{(1)} - S^{(0)}}_U + (\bar{Y}_1 - \lambda\bar{S}^{(1)}_{O_1}) - (\bar{Y}_0 - \lambda\bar{S}^{(0)}_{O_0})$$

```{r dip-pp-example}
result <- msd_dip_pp(msd, n_folds = 2)
print(result)
```

### 7. D-T DiP (Doubly-Tuned DiP)

The most flexible estimator combines DiP with arm-specific tuning:

$$\hat{\tau}^{D-T\ DiP} = \overline{\lambda_1 S^{(1)} - \lambda_0 S^{(0)}}_U + (\bar{Y}_1 - \lambda_1\bar{S}^{(1)}_{O_1}) - (\bar{Y}_0 - \lambda_0\bar{S}^{(0)}_{O_0})$$

```{r dt-dip-example}
result <- msd_dt_dip(msd, n_folds = 2)
print(result)
```

### Comparing All Estimators

Use `estimate_all()` to run all applicable estimators at once:

```{r estimate-all}
all_results <- estimate_all(msd)
print(all_results)
```

## Choosing the Right Estimator

### Decision Tree

Here's a practical guide for choosing an estimator:

```
Do you have LLM predictions?
├── No → Use DiM
└── Yes → Do you have BOTH S^(1) and S^(0) for each unit?
    ├── No (only arm-specific) →
    │   └── Is prediction quality similar across arms?
    │       ├── Yes → Use PPI++
    │       └── No → Use D-T
    └── Yes (both predictions) →
        └── Are S^(1) and S^(0) positively correlated?
            ├── Yes → Use D-T DiP (recommended)
            └── No/Unsure → Use D-T
```

### When to Use DiP-Type Estimators

The DiP family (DiP, DiP++, D-T DiP) is most beneficial when:

1. **Predictions are positively correlated**: $Cov(S^{(1)}, S^{(0)}) > 0$
2. **You can afford 2 predictions per unit**: This doubles the prediction cost

You can check the correlation in your pilot data:

```{r check-correlation}
cor(unobserved_df$S1, unobserved_df$S0)
```

A positive correlation (which is common) means DiP estimators will likely outperform GREG-type estimators.

## Variance Estimation

### Delta-Method (Default)

All estimators use the delta-method for variance estimation by default. This provides fast, analytical variance estimates based on the formulas derived in the accompanying paper.

### Bootstrap (Optional)

For robustness, you can also use the fold-respecting bootstrap:

```{r bootstrap-example}
# Bootstrap variance for D-T DiP
boot_result <- bootstrap_variance(
  msd,
  estimator = "dt_dip",
  n_bootstrap = 500,
  seed = 42
)

cat("Delta-method SE:", round(result_dt_dip$se, 4), "\n")
cat("Bootstrap SE:", round(boot_result$se, 4), "\n")
```

The bootstrap is computationally more expensive but provides a useful check on the delta-method variance.

## Optimal Experimental Design

### The Design Problem

Before running your experiment, you need to decide:

1. **How many human subjects** to recruit ($n_O$)
2. **How many LLM predictions** to generate ($n_U$)
3. **Which estimator** to use

These decisions depend on:
- Your **total budget**
- The **cost per human subject**
- The **cost per LLM prediction**
- The **quality of predictions** (estimated from pilot data)

### Using `optimal_design()`

The `optimal_design()` function finds the allocation that minimizes expected variance given your budget constraints:

```{r optimal-design}
design <- optimal_design(
  pilot_data = msd,          # Pilot study data
  budget = 5000,             # Total budget in dollars
  cost_human = 5,            # $5 per human response
  cost_prediction = 0.01,    # $0.01 per LLM prediction
  treatment_prob = 0.5       # Balanced treatment assignment
)

print(design)
```

### Interpreting the Results

The output tells you:

1. **Recommended estimator**: Which of the 7 estimators to use
2. **Sample sizes**: How many observed and unobserved units
3. **Expected SE**: The anticipated standard error

### Comparing Designs

The output also shows how different estimators perform at their optimal allocations. This helps you understand the trade-offs:

- **DiM** uses all budget on human subjects (no predictions)
- **GREG/PPI++/D-T** use 1 prediction per unlabeled unit
- **DiP/DiP++/D-T DiP** use 2 predictions per unlabeled unit

## Practical Workflow

Here's a recommended workflow for using mixed-subjects designs in your research:

### Step 1: Pilot Study

Run a small pilot study with both human subjects and LLM predictions:

```{r workflow-pilot, eval=FALSE}
# Collect pilot data
pilot_observed <- collect_human_responses(n = 100)
pilot_observed$S1 <- get_llm_predictions(pilot_observed, condition = "treatment")
pilot_observed$S0 <- get_llm_predictions(pilot_observed, condition = "control")

pilot_unobserved <- generate_synthetic_units(n = 200)
pilot_unobserved$S1 <- get_llm_predictions(pilot_unobserved, condition = "treatment")
pilot_unobserved$S0 <- get_llm_predictions(pilot_unobserved, condition = "control")

pilot_msd <- msd_data(observed = pilot_observed, unobserved = pilot_unobserved)
```

### Step 2: Estimate Prediction Quality

Examine how well LLM predictions correlate with actual outcomes:

```{r workflow-quality, eval=FALSE}
# Check correlations
pilot_results <- estimate_all(pilot_msd)
print(pilot_results)

# Compare to DiM baseline
dim_se <- pilot_results$SE[pilot_results$Estimator == "Difference-in-Means (DiM)"]
best_se <- min(pilot_results$SE)
variance_reduction <- 1 - (best_se / dim_se)^2
cat("Potential variance reduction:", round(variance_reduction * 100, 1), "%\n")
```

### Step 3: Plan Your Main Study

Use the pilot data to design your main study:

```{r workflow-design, eval=FALSE}
design <- optimal_design(
  pilot_data = pilot_msd,
  budget = 10000,
  cost_human = 5,
  cost_prediction = 0.01
)
print(design)
```

### Step 4: Run the Main Study

Collect data according to your design:

```{r workflow-main, eval=FALSE}
# Collect human responses
main_observed <- collect_human_responses(n = design$optimal_n_obs)

# Generate LLM predictions
main_observed$S1 <- get_llm_predictions(main_observed, "treatment")
main_observed$S0 <- get_llm_predictions(main_observed, "control")

main_unobserved <- generate_synthetic_units(n = design$optimal_n_unobs)
main_unobserved$S1 <- get_llm_predictions(main_unobserved, "treatment")
main_unobserved$S0 <- get_llm_predictions(main_unobserved, "control")

main_msd <- msd_data(observed = main_observed, unobserved = main_unobserved)
```

### Step 5: Analyze and Report

Estimate the treatment effect using the recommended estimator:

```{r workflow-analyze, eval=FALSE}
# Run the recommended estimator
result <- msd_dt_dip(main_msd)  # Or whichever was recommended
print(result)

# Optionally verify with bootstrap
boot_result <- bootstrap_variance(main_msd, "dt_dip", n_bootstrap = 1000)
```

## Common Questions

### Q: What if my LLM predictions are biased?

**A**: That's fine! The MSD estimators are designed to be valid even when predictions are biased. The human subjects data is used to "rectify" any systematic bias in the predictions. You will still get unbiased estimates with valid confidence intervals.

### Q: How much do predictions need to correlate with outcomes?

**A**: Any positive correlation helps, but more is better. As a rough guide:
- $\rho > 0.3$: Modest variance reduction
- $\rho > 0.5$: Substantial variance reduction
- $\rho > 0.7$: Large variance reduction

You can estimate this correlation from your pilot data.

### Q: Should I use cross-fitting?

**A**: Yes, for the tuned estimators (PPI++, D-T, DiP++, D-T DiP). Cross-fitting ensures that the tuning parameters are estimated independently of the data used to compute the estimate, which prevents bias. The default `n_folds = 2` works well in most cases.

### Q: What if predictions are worse in one treatment arm?

**A**: Use the D-T or D-T DiP estimators. These use arm-specific tuning parameters ($\lambda_1$ and $\lambda_0$), which automatically adapt to different prediction quality across arms.

### Q: How many folds should I use for cross-fitting?

**A**: We recommend `n_folds = 2` for most applications. More folds use data more efficiently but increase computational cost. With 2 folds, half the labeled data is used to estimate tuning parameters and half for estimation in each fold.

### Q: Can I use predictions from multiple LLMs?

**A**: The current package supports one set of predictions. If you have multiple LLMs, you could:
1. Use the single best-performing LLM
2. Average predictions across LLMs before using them
3. Use the LLM whose predictions have the highest correlation with outcomes

## Technical Details

### Assumptions

The MSD estimators require:

1. **Random sampling**: Both observed and unobserved units are drawn from the same population
2. **Random treatment assignment**: Treatment is randomly assigned in both pools
3. **Independent predictions**: The LLM predictions don't use the observed outcomes
4. **Finite variances**: Standard regularity conditions

### Variance Formulas

For reference, the variance formulas for key estimators:

**DiM**:
$$Var(\hat{\tau}^{DiM}) = \frac{\sigma^2_{Y(1)}}{n_1} + \frac{\sigma^2_{Y(0)}}{n_0}$$

**GREG**:
$$Var(\hat{\tau}^{GREG}) = \sum_{d \in \{0,1\}} \left[\frac{\sigma^2_{S^{(d)}}}{m_d} + \frac{Var(Y(d) - S^{(d)})}{n_d}\right]$$

**DiP**:
$$Var(\hat{\tau}^{DiP}) = \frac{Var(S^{(1)} - S^{(0)})}{m} + \sum_{d \in \{0,1\}} \frac{Var(Y(d) - S^{(d)})}{n_d}$$

The tuned estimators (PPI++, D-T, DiP++, D-T DiP) have more complex variance formulas that account for the estimation of tuning parameters.

## Citation

If you use this package in your research, please cite:

```
van Loon, A. and Kanopka, K. (2025). Causal Inference in Experiments with
Mixed-Subjects Designs. R package version 0.1.0.
```

## Session Info

```{r session-info}
sessionInfo()
```