---
title: "A primer: doubly robust LATE estimation, from intuition to practice"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{A primer: doubly robust LATE estimation, from intuition to practice}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      fig.width = 7, fig.height = 4.2, dpi = 110,
                      out.width = "100%")
set.seed(1)
```

This primer explains, with as little formalism as possible, what the
estimators in `drlate` do and when you would want them. It is a companion
to two papers by Słoczyński, Uysal, and Wooldridge: *"Doubly Robust
Estimation of Local Average Treatment Effects Using Inverse Probability
Weighted Regression Adjustment"* (2022, arXiv:2208.01300), whose
estimators the authors' Stata command `drlate` implements, and *"Abadie's
Kappa and Weighting Estimators of the Local Average Treatment Effect"*
(2025, *JBES*), whose estimator menu their command `kappalate` implements.
This package implements both papers behind one interface. Along the way
the primer demonstrates every tool in the package: estimation, the
weighting-estimator menu, diagnostics, weak-instrument-robust inference,
the bootstrap, the estimator comparison, and the doubly robust Hausman
test.

## 1. The problem: a treatment people choose

Suppose you want to know the effect of a treatment $D$ (a job training
program, military service, a 401(k) plan) on an outcome $Y$ (wages,
savings). People *choose* whether to take the treatment, and the ones who
take it differ from the ones who don't in ways you can't fully observe.
Comparing treated and untreated outcomes, even with regression controls,
confounds the effect of the treatment with the effect of being the kind of
person who takes it.

The classic escape is an **instrument** $Z$: a binary variable that shifts
the *likelihood* of treatment but has no direct channel to the outcome.
Draft lottery numbers, distance to a college, randomized *encouragement* to
participate — all push some people into treatment without otherwise
touching their outcomes.

Imbens and Angrist (1994) showed what an instrument can and cannot give
you. It cannot reveal the effect of treatment for everyone. It identifies
the effect for **compliers** — the people whose treatment status is
actually moved by the instrument. That quantity is the **local average
treatment effect**:

$$
\text{LATE} \;=\; \frac{\;\text{effect of } Z \text{ on } Y\;}
                      {\;\text{effect of } Z \text{ on } D\;}
$$

The numerator is the *intent-to-treat* effect; the denominator is the
*first stage* (the share of compliers). With no covariates this ratio is
the Wald estimator, and it is exactly what two-stage least squares
computes.

### Why covariates enter

Real instruments are rarely as clean as a lottery. Distance to college is
exogenous *given* family background and region; a draft lottery is clean,
but conditioning on covariates still buys precision. So in practice you
need the instrument to be as-good-as-random **conditional on covariates
$X$**, and then both the numerator and the denominator above become
covariate-adjusted contrasts. How exactly to adjust them is the entire
subject of the 2022 paper and this package.

## 2. The four core estimators in one picture

All estimators in `drlate` build the same ratio; they differ only in how
each piece is adjusted for $X$. This section covers the four
regression-and-weighting strategies from the 2022 paper; a second family,
the Abadie-kappa weighting menu from the 2025 paper, appears in
Section 6. Each piece is an "effect of $Z$ on something", a problem
formally identical to estimating an average treatment effect of $Z$ under
unconfoundedness, so the familiar treatment-effects toolbox applies:

| `method` | How it adjusts | Needs correct… |
|---|---|---|
| `"ra"` | Regression adjustment: model $Y$ (and $D$) given $X$ in each instrument arm; average the predictions. | outcome/treatment models |
| `"ipw"` | Inverse probability weighting: model the **instrument propensity score** $p(X) = \Pr(Z=1 \mid X)$; reweight raw means by $1/p$ and $1/(1-p)$. | propensity score |
| `"aipw"` | Augmented IPW: regression adjustment plus an IPW correction term. | **either one** |
| `"ipwra"` | IPW **regression adjustment**: fit the outcome/treatment regressions *weighted* by the inverse propensity score. | **either one** |

The last two are **doubly robust (DR)**: they are consistent if the
propensity score model *or* the regression models are right — you get two
chances instead of one.

### Why the paper (and the package default) prefers IPWRA

AIPW achieves double robustness by *adding* a correction term, and that
sum can wander outside the logical range of the outcome — a predicted
probability of 1.3, a negative count. IPWRA achieves it by *reweighting a
quasi-likelihood regression*, so as long as you choose a sensible family
(logit for binary outcomes, Poisson for counts), fitted values stay in
range by construction. Hence IPWRA's appealing small-sample behavior in
the paper's simulations, and its place as the package default.

A second practical lesson from the paper concerns **normalization**. Raw
IPW weights $Z_i/\hat p(X_i)$ need not average to one in a sample; if a few
observations have small estimated propensity scores, unnormalized
estimators can be erratic. Normalizing the weights to sum to one (the
Hájek construction) fixes this and costs nothing, so `normalized = TRUE`
is the default. It matters only for `"ipw"` and `"aipw"`; IPWRA is
normalized by construction.

## 3. A worked example

`drlate_sim` is a simulated dataset built so that the truth is known: a
binary instrument `rsncode`, a binary treatment `nvstat` with two-sided
noncompliance (60% compliers), and a continuous outcome `lwage`. The true
effect for compliers is **0.5**. The design plants two traps for
the naive approaches: the treatment is endogenous (people who
always take it have higher baseline outcomes), and the instrument is *not*
unconditionally random (the probability of `rsncode = 1` rises with `age`
and education, which also shift the outcome).

```{r}
library(drlate)
data(drlate_sim)
str(drlate_sim)
```

### The naive answers fail

Comparing treated to untreated (even though this is "just a regression"):

```{r}
coef(lm(lwage ~ nvstat + age + educ, drlate_sim))["nvstat"]
```

Biased upward — always-takers are different people. The raw Wald ratio
ignores that the instrument favors older, more educated workers:

```{r}
with(drlate_sim,
     (mean(lwage[rsncode == 1]) - mean(lwage[rsncode == 0])) /
     (mean(nvstat[rsncode == 1]) - mean(nvstat[rsncode == 0])))
```

Twice the truth.

### The drlate answer

Three formulas, one per model: the **outcome** model, the **treatment**
model, and the **instrument propensity score** model.

```{r}
fit <- drlate(outcome    = lwage   ~ age + educ,
              treatment  = nvstat  ~ age + educ,
              instrument = rsncode ~ age + educ,
              data = drlate_sim)
summary(fit)
```

Read the three rows bottom-up:

* **ATE: Z on D** — the first stage: the instrument moves the treatment
  probability by about 0.60 (the complier share).
* **ATE: Z on Y** — the intent-to-treat effect of the instrument on the
  outcome.
* **LATE: D on Y** — their ratio, the causal effect for compliers. The
  95% interval comfortably covers the true 0.5.

The final line reports **first-stage strength**: with a single binary
instrument, the squared z-statistic is the first-stage robust F. Here
F ≈ 1084, far above the conventional F = 10 danger threshold, so Wald
inference on the ratio is trustworthy. Section 7 shows what the package
does when it isn't.

The standard errors here are not an afterthought: drlate stacks every
estimation stage (the propensity score, both outcome regressions, both
treatment regressions, and the ratio itself) into one moment system and
computes a joint sandwich variance. The uncertainty from estimating the
propensity score propagates into the LATE's standard error, exactly as in
the Stata original.

### Seeing double robustness work

The DGP has the propensity score and the outcomes both depending on `age`
and `educ`. Give the estimator a *wrong* propensity score model
(intercept only) but correct regressions — then a *wrong* outcome model
but a correct propensity score:

```{r}
# Propensity score misspecified; regressions correct
coef(drlate(lwage ~ age + educ, nvstat ~ age + educ, rsncode ~ 1,
            data = drlate_sim))[1]

# Regressions misspecified (intercept only); propensity score correct
coef(drlate(lwage ~ 1, nvstat ~ 1, rsncode ~ age + educ,
            data = drlate_sim))[1]
```

Either way the estimate stays near 0.5: one correct model is enough.
(Misspecify *both* and no estimator can save you.)

## 4. Checking the design: diagnostics

An estimate deserves evidence that the design behind it is sound. `drlate`
provides the standard design diagnostics — propensity-score overlap, covariate
balance, and the implied weight distribution — together with a formal balance
test and a profile of the compliers. The last several mirror the postestimation
tools of Stata's `lateffects` command.

### Overlap

IPW-type estimators divide by $\hat p(X)$ and $1-\hat p(X)$, so the
estimated instrument propensity scores must stay away from 0 and 1.
Healthy overlap looks like two well-mixed distributions:

```{r overlap}
plot(fit, type = "overlap")
```

`drlate()` refuses to estimate if any score breaches
`[pstolerance, 1 - pstolerance]`; set `osample = TRUE` to get back an
indicator of the violating observations instead, so you can inspect them
before restricting the sample. Pass `geom = "density"` to draw the same
check as smoothed kernel densities.

### Covariate balance

The whole point of weighting by the inverse propensity score is to make
the instrument arms comparable on covariates. The love plot shows
standardized mean differences before and after weighting — weighted dots
inside the conventional ±0.1 band mean the propensity score model is
doing its job:

```{r balance}
plot(fit, type = "balance")
```

The underlying numbers come from `balance()`:

```{r}
balance(fit)
```

`age` moves from a standardized difference of 0.45 to nearly zero after
weighting. Add `detail = TRUE` for the weighted arm means and variance
ratios as well. To see distributions rather than summaries,
`plot(fit, type = "balance_density")` overlays each covariate's kernel
density across the instrument arms, raw versus weighted — a balanced
covariate has arm densities that coincide in the weighted panel:

```{r balance-density, fig.height = 3}
plot(fit, type = "balance_density", var = "age")
```

### Weight distributions

A few enormous weights mean a few observations dominate the estimate —
the classic symptom of thin overlap:

```{r weights}
plot(fit, type = "weights")
```

### A formal balance test

The love plot is an eyeball test; `balance_test()` is the formal one. It
implements the Imai and Ratkovic (2014) overidentification test of whether
the propensity-score model balances the covariates. A small p-value is
evidence that the score is misspecified:

```{r balance-test}
balance_test(fit)
```

### Profiling the compliers

Because the LATE is an effect for compliers, it helps to know how they
differ from the population. `complier_means()` compares each covariate's
overall mean with its mean among compliers, the latter weighted by Abadie's
$\kappa$:

```{r compliers}
complier_means(fit)
```

The weights themselves are available through `kappa_weights(fit)` for
building other complier summaries.

## 5. Choosing models and options

### Outcome and treatment families

The treatment and the outcome may each be continuous, binary, a count, or a
fraction; choose families so fitted values respect the response's range — the
heart of the IPWRA recommendation:

```{r}
# Binary outcome: logit keeps fitted probabilities in [0, 1]
coef(drlate(hijob ~ age + educ, nvstat ~ age + educ, rsncode ~ age + educ,
            data = drlate_sim, omodel = "logit"))

# Positive outcome: Poisson (quasi-likelihood; no distributional claim)
coef(drlate(kwage ~ age + educ, nvstat ~ age + educ, rsncode ~ age + educ,
            data = drlate_sim, omodel = "poisson"))
```

Binary outcomes also accept `omodel = "probit"`, and a response confined to
the unit interval — a proportion or rate — uses `omodel = "flogit"` or
`"fprobit"`. The fractional families share all estimation with `"logit"` /
`"probit"` and only relax the response to `[0, 1]`, matching the Stata
`lateffects` `omodel` options.

### Instrument propensity score flavors

Beyond logit MLE, two estimators tailor the propensity score to its job
in the weights:

* `ivmodel = "cbps"` — covariate balancing (Imai and Ratkovic 2014):
  chooses the logit coefficients so the inverse-probability weights
  exactly balance the covariates between instrument arms.
* `ivmodel = "ipt"` — inverse probability tilting (Graham, Pinto, and
  Egel 2012): fits a separate tilted score per arm; the resulting weights
  are exactly normalized by construction.
* `ivmodel = "probit"` — probit maximum likelihood,
  mirroring `kappalate`'s `zmodel(probit)`; available for the weighting
  estimators (`"ipw"` and the kappa methods of Section 6).

```{r}
coef(drlate(lwage ~ age + educ, nvstat ~ age + educ, rsncode ~ age + educ,
            data = drlate_sim, ivmodel = "ipt"))[1]

# probit propensity score with a weighting estimator (Section 6)
coef(drlate(lwage ~ 1, nvstat ~ 1, rsncode ~ age + educ,
            data = drlate_sim, method = "kappa10",
            ivmodel = "probit"))[1]
```

### LATT: the effect for treated compliers

`estimand = "latt"` reweights everything toward the treated population —
"among those who took the treatment, what did it do?":

```{r}
coef(drlate(lwage ~ age + educ, nvstat ~ age + educ, rsncode ~ age + educ,
            data = drlate_sim, estimand = "latt"))
```

Weights (`weights =`) and clustered standard errors (`cluster =`) are
available everywhere.

## 6. Abadie's kappa: the weighting-estimator menu

The companion paper Słoczyński, Uysal, and Wooldridge (2025) studies the
family of pure weighting estimators of the LATE motivated by Abadie's
(2003) kappa theorem. These use *only* the instrument propensity score —
no outcome or treatment regressions — and differ in which kappa weights
they use and whether the weights are normalized. `drlate` implements the
full menu of the authors' Stata command `kappalate`:

| `method` | kappalate name | Weighting |
|---|---|---|
| `"kappa"` | `tau_a` | unnormalized Abadie kappa |
| `"ipw"`, `normalized = FALSE` | `tau_a,1` | unnormalized, treated-arm |
| `"kappa0"` | `tau_a,0` | unnormalized, untreated-arm |
| `"kappa10"` | `tau_a,10` | normalized Abadie kappa |
| `"ipw"` (default normalized) | `tau_u` | normalized (Hájek) |

Because the kappa methods carry no outcome or treatment model, the first
two formulas must be intercept-only, and that makes the whole menu a
like-for-like comparison (every row adjusts for covariates only
through the instrument propensity score):

```{r kappa-menu}
cmp_w <- drlate_compare(lwage ~ 1, nvstat ~ 1, rsncode ~ age + educ,
                        data = drlate_sim,
                        methods = c("ipw", "kappa", "kappa0", "kappa10"))
cmp_w
```

All four sit on top of each other here — a healthy design. The paper's
main practical lesson is *which* of these to trust when they disagree:
the **normalized** estimators (`tau_u`, i.e. `method = "ipw"`, and
`tau_a,10`), because the unnormalized variants are sensitive to the
location and scale of the outcome. You can see that sensitivity directly.
Shift the outcome by a constant, which should not change a treatment
effect:

```{r kappa-shift}
d_shift <- transform(drlate_sim, lwage = lwage + 100)
rbind(
  kappa10 = c(original = coef(drlate(lwage ~ 1, nvstat ~ 1,
                                     rsncode ~ age + educ,
                                     data = drlate_sim,
                                     method = "kappa10"))[[1]],
              shifted = coef(drlate(lwage ~ 1, nvstat ~ 1,
                                    rsncode ~ age + educ,
                                    data = d_shift,
                                    method = "kappa10"))[[1]]),
  kappa   = c(original = coef(drlate(lwage ~ 1, nvstat ~ 1,
                                     rsncode ~ age + educ,
                                     data = drlate_sim,
                                     method = "kappa"))[[1]],
              shifted = coef(drlate(lwage ~ 1, nvstat ~ 1,
                                    rsncode ~ age + educ,
                                    data = d_shift,
                                    method = "kappa"))[[1]])
)
```

The normalized `kappa10` is exactly invariant; the unnormalized `kappa`
moves. (The package prints each estimator's `kappalate` name so the
mapping above is always visible in the output.)

Unlike the Stata command, all of drlate's inference machinery applies to
the kappa menu: cluster-robust standard errors, sampling weights, the
bootstrap, and Fieller confidence sets for `"kappa"` and `"kappa0"`
(`"kappa10"` is a difference of two ratios, for which no Fieller pivot
exists, so use the bootstrap there).

## 7. When the instrument is weak: Fieller confidence sets

The LATE is a ratio, and ratios with imprecise denominators behave badly:
the usual (delta-method) confidence interval can have far-from-nominal
coverage when the first stage is weak. The package watches for this:
whenever the first-stage F drops below 10, the printout flags it and
shows a **Fieller confidence set** alongside.

The Fieller set inverts the test of $H_0:\ \text{num} - t \cdot
\text{denom} = 0$ using the *joint* covariance of the numerator and the
denominator (which the stacked moment system provides). Unlike the Wald
interval, it keeps its promised asymptotic coverage no matter how weak
the instrument — at the honest price that it may be **unbounded** when
the data barely identify the ratio.

Watch it engage on a deliberately broken design (instrument shuffled at
random in a small subsample, so the true first stage is zero):

```{r}
set.seed(42)
d_weak <- drlate_sim[1:300, ]
d_weak$zweak <- sample(d_weak$rsncode)
fit_weak <- drlate(lwage ~ age, nvstat ~ age, zweak ~ age, data = d_weak)
print(fit_weak)
```

The Fieller set is available on demand for any fit:

```{r}
confint(fit, method = "fieller")     # strong instrument: ~ Wald interval
```

In the package's Monte Carlo validation, under a very weak first stage
(about 4% compliers) the Fieller set covered the true LATE 95.7% of the
time — reporting an unbounded set in roughly 40% of replications, the
statistically honest answer — while the Wald interval over-covered at
99.3%, conveying a precision it did not have.

## 8. Bootstrap inference

The paper notes that inference is straightforward "both analytically and
using the nonparametric bootstrap." The analytic sandwich is the default;
the bootstrap re-runs the entire estimation pipeline (including the
propensity score) on each resample:

```{r}
fit_boot <- drlate(lwage ~ age + educ, nvstat ~ age + educ,
                   rsncode ~ age + educ, data = drlate_sim,
                   vcov = "bootstrap", boot_reps = 199, boot_seed = 42)
summary(fit_boot)
```

Point estimates are unchanged; standard errors and the confidence
intervals (now percentile intervals from the bootstrap distribution) come
from the resamples. When `cluster` is supplied, whole clusters are
resampled. Draws that fail — degenerate resamples, overlap violations —
are dropped and counted; a non-trivial failure rate is itself a sign of
weak identification, in which case prefer the Fieller set.

## 9. How much does the estimator choice matter?

Referees routinely ask for this table. One call runs all four estimators:

```{r compare}
cmp <- drlate_compare(lwage ~ age + educ, nvstat ~ age + educ,
                      rsncode ~ age + educ, data = drlate_sim)
cmp
plot(cmp)
```

A caveat printed in the documentation bears repeating: IPW carries no
outcome/treatment regressions and RA no propensity score, so those rows
necessarily use reduced specifications. The IPWRA-vs-AIPW pair is the
like-for-like doubly robust comparison. Here all four agree closely,
which is what a healthy design looks like.

## 10. Do you even need the instrument? The DR Hausman test

The 2022 paper's Section 5 proposes a specification test that the Stata
package does not implement. The logic: under **one-sided noncompliance**
(nobody gets the treatment without the instrument), the LATT estimated
*through the instrument* and the ATT estimated *assuming the treatment is
unconfounded given $X$* target the same parameter — if unconfoundedness
holds. So a significant difference between the two doubly robust
estimates is evidence against unconfoundedness. Unlike the textbook
OLS-vs-IV Hausman test, this comparison survives heterogeneous treatment
effects.

The simulated data are built with confounded treatment (always-takers
have higher baseline wages), so after imposing one-sided noncompliance
the test should — and does — reject:

```{r}
d_os <- drlate_sim
d_os$nvstat[d_os$rsncode == 0] <- 0L   # impose one-sided noncompliance

dr_hausman(lwage ~ age + educ, nvstat ~ age + educ, rsncode ~ age + educ,
           data = d_os)
```

The unconfoundedness-based ATT (≈ 0.63) is biased upward by exactly the
selection the DGP builds in, while the instrument-based LATT (≈ 0.38 in
this one-sided subpopulation) is protected. The difference is almost six
standard errors from zero. In practice: a rejection says the controls
alone do not suffice — you needed the instrument; a non-rejection says
the controls may suffice, and the much more precise
unconfoundedness-based estimator becomes defensible.

The standard error of the difference is analytic: both estimators' moment
conditions are stacked into one system so their correlation is accounted
for — the construction the paper sketches and this package carries out.
(In Monte Carlo under a true null the test rejects at 4.7% for a nominal
5% level.)

## 11. Coming from Stata

| Stata | R |
|---|---|
| `drlate (lwage age_5) (nvstat age_5) (rsncode age_5)` | `drlate(lwage ~ age_5, nvstat ~ age_5, rsncode ~ age_5, data = d)` |
| `(y x, poisson)` | `omodel = "poisson"` |
| `(z x, ipt)` | `ivmodel = "ipt"` |
| `, latt` | `estimand = "latt"` |
| `, method(aipw) unnrm` | `method = "aipw", normalized = FALSE` |
| `[pw = w]` | `weights = "w"` |
| `, vce(cluster id)` | `cluster = "id"` |
| `i.educ` factor syntax | a `factor()` column in the formula |
| `kappalate y (d = z) x` (tau_a) | `method = "kappa"` |
| `kappalate ..., which(all)` (tau_a,0 / tau_a,10) | `method = "kappa0"` / `"kappa10"` |
| `kappalate ...` (tau_u / tau_a,1) | `method = "ipw"` / `"ipw", normalized = FALSE` |
| `kappalate ..., zmodel(cbps)` | `ivmodel = "cbps"` |
| `kappalate ..., zmodel(probit)` | `ivmodel = "probit"` |
| `lateoverlap` | `plot(type = "overlap")` (with `geom = "density"`) |
| `latebalance summarize` | `balance(detail = TRUE)` |
| `latebalance overid` | `balance_test()` |
| `latebalance density` | `plot(type = "balance_density")` |
| `estat compliers` | `complier_means()`, `kappa_weights()` |
| — | `plot(type = "weights")`, `confint(method = "fieller")`, `vcov = "bootstrap"`, `dr_hausman()`, `drlate_compare()` (R only) |

Point estimates and standard errors agree with the Stata `drlate` and
`kappalate` packages to numerical precision; the package's test suite
enforces this against fixtures generated by those commands on the Survey of
Income and Program Participation (SIPP) extract used in the original help
files. The postestimation diagnostics in the lower rows mirror Stata's
`lateffects` suite (StataNow) and are verified against their standard
references --- weighted `glm` fits, the Abadie-kappa identity, and the
bootstrap.

## References

* Słoczyński, T., S. D. Uysal, and J. M. Wooldridge (2022). "Doubly Robust
  Estimation of Local Average Treatment Effects Using Inverse Probability
  Weighted Regression Adjustment." arXiv:2208.01300.
* Słoczyński, T., S. D. Uysal, and J. M. Wooldridge (2025). "Abadie's
  Kappa and Weighting Estimators of the Local Average Treatment Effect."
  *Journal of Business & Economic Statistics* 43(1), 164–177.
* Imbens, G. W., and J. D. Angrist (1994). "Identification and Estimation
  of Local Average Treatment Effects." *Econometrica* 62(2), 467–475.
* Abadie, A. (2003). "Semiparametric Instrumental Variable Estimation of
  Treatment Response Models." *Journal of Econometrics* 113(2), 231–263.
* Donald, S. G., Y.-C. Hsu, and R. P. Lieli (2014). "Testing the
  Unconfoundedness Assumption via Inverse Probability Weighted Estimators
  of (L)ATT." *Journal of Business & Economic Statistics* 32(3), 395–415.
* Fieller, E. C. (1954). "Some Problems in Interval Estimation." *JRSS-B*
  16(2), 175–185.
* Graham, B. S., C. C. de Xavier Pinto, and D. Egel (2012). "Inverse
  Probability Tilting for Moment Condition Models with Missing Data."
  *Review of Economic Studies* 79(3), 1053–1079.
* Imai, K., and M. Ratkovic (2014). "Covariate Balancing Propensity
  Score." *JRSS-B* 76(1), 243–263.