Bandwidth Selection and Conley Spatial HAC Standard Errors

Overview

The SpatialInference package provides tools for statistical inference with spatially correlated data. Its main features are:

  1. Conley spatial HAC standard errors — a computationally efficient implementation based on C++ code by Darin Christensen (Christensen et al. 2021) of the estimator introduced by Conley (1999).
  2. Covariogram-based bandwidth selection — a data-driven method for choosing the kernel bandwidth, based on extracting the correlation range from the empirical covariogram of regression residuals, as proposed by Lehner (2026).
  3. Diagnostic visualizations — including the inverse-U relationship between bandwidth and Conley SE magnitude (Lehner 2026), and covariogram plots with annotated range estimates.

Setup

Load the package and data containing the centroids of all 3,108 counties of the contiguous United States:

library(SpatialInference)
library(sf); library(ggplot2)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(lfe)
#> Loading required package: Matrix
library(modelsummary)
data("US_counties_centroids")

ggplot(US_counties_centroids) + geom_sf(size = .1) + theme_bw()

Simulated spatially correlated data

The dataset contains two independently generated spatially correlated noise variables (noise1 and noise2), drawn using geostatistical simulation. Although drawn independently, both exhibit strong spatial clustering — nearby counties have similar values:

ggplot(US_counties_centroids) + geom_sf(aes(col = noise1), size = .1) + theme_bw()
ggplot(US_counties_centroids) + geom_sf(aes(col = noise2), size = .1) + theme_bw()

For comparison, the dist variable is an example of spatial non-stationarity (a spatial trend rather than spatial autocorrelation):

ggplot(US_counties_centroids) + geom_sf(aes(col = dist), size = .1) + theme_bw()

The spurious regression problem

Even though the two noise variables were drawn independently (so the true coefficient is zero), a naive regression finds a highly significant relationship:

spuriouslm <- fixest::feols(noise1 ~ noise2, data = US_counties_centroids, vcov = "HC1")
spuriouslm
#> OLS estimation, Dep. Var.: noise1
#> Observations: 3,108
#> Standard-errors: Heteroskedasticity-robust 
#>                 Estimate Std. Error      t value   Pr(>|t|)    
#> (Intercept) 6.890000e-16   0.017872 3.860000e-14 1.0000e+00    
#> noise2      8.738022e-02   0.015535 5.624836e+00 2.0216e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.996015   Adj. R2: 0.007316

This is because the OLS errors \(\varepsilon_i\) are not i.i.d. — nearby residuals are positively correlated, which inflates the \(t\)-statistic. A map of the regression residuals reveals the spatial clustering:

US_counties_centroids$resid <- spuriouslm$residuals
ggplot(US_counties_centroids) +
  geom_sf(aes(col = resid), size = .1) + theme_bw() + scale_color_viridis_c()

Estimating the correlation range

Before computing Conley standard errors, we need to choose a bandwidth: the distance within which residuals are allowed to be correlated. The covgm_range() function estimates this from the data by computing the empirical covariogram of the regression residuals and identifying the distance at which covariation first crosses zero (Lehner 2026; Pebesma 2004).

covgm_range(US_counties_centroids) + theme_bw()

The plot shows the empirical covariogram — each point represents the average cross-product of residual pairs within a distance bin. The red vertical line marks the estimated correlation range \(\hat{\varsigma}\) (here around 831 km), and the red dashed line marks zero covariance. Beyond the estimated range, the covariogram fluctuates around zero, indicating that residual correlation has effectively ceased.

This is the key input for the Conley standard error: the bandwidth should match the distance over which residuals are actually correlated.

To extract just the numeric range for programmatic use:

# Compute the covariogram manually
covgm <- gstat::variogram(
  spuriouslm$residuals ~ 1,
  data = sf::as_Spatial(US_counties_centroids),
  covariogram = TRUE,
  width = 2e4,
  cutoff = as.numeric(max(sf::st_distance(US_counties_centroids))) * (2/3)
)
estimated_range <- extract_corr_range(covgm)
estimated_range
#> [1] 820.0129

The inverse-U relationship

A critical insight documented by Lehner (2026) is that the relationship between the bandwidth and the magnitude of Conley standard errors follows an inverse-U shape. This means that both too narrow and too wide bandwidths lead to underestimated standard errors — contradicting the conventional wisdom that wider bandwidths are always more conservative.

The inverseu_plot_conleyrange() function visualises this:

inverseu_plot_conleyrange(US_counties_centroids, seq(1, 2501, by = 200)) + theme_bw()

The grey dashed line represents the HC1 standard error. As the cutoff distance increases from zero, the Conley SE first rises (as positively correlated residual pairs are included), reaches a peak near the true correlation range, and then declines (as uncorrelated pairs dilute the estimate). At very wide bandwidths, the Conley SE can fall below the HC1 level — making wide-bandwidth Conley errors less conservative than heteroskedasticity-robust errors.

Computing Conley standard errors

With the estimated range in hand, we compute the spatial HAC standard error using conley_SE():

spuriouslm_fe <- felm(noise1 ~ noise2 | unit + year | 0 | lat + lon,
                       data = US_counties_centroids, keepCX = TRUE)
regfe_conley <- conley_SE(reg = spuriouslm_fe, unit = "unit", time = "year",
                          lat = "lat", lon = "lon",
                          kernel = "epanechnikov", dist_fn = "Haversine",
                          dist_cutoff = 831)

conley <- sapply(regfe_conley, function(x) diag(sqrt(x)))[2] %>% round(5)
conley
#> Spatial.noise2 
#>          0.137

The Conley SE at the estimated bandwidth is substantially larger than the HC1 error. Constructing a corrected \(t\)-statistic:

spuriouslm_fe$coefficients[1] / as.numeric(conley)
#> [1] 0.6378118

The \(t\)-statistic is far below 1.96, so we correctly fail to reject the null hypothesis — as expected, given that the two variables are genuinely independent.

Convenience wrappers

The compute_conley_lfe() function provides a shortcut:

compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov")
#> [1] 0.137

For a complete workflow — regression, Moran’s I test, and Conley SEs in one call — use lm_sac():

lmsac.out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon",
                     US_counties_centroids,
                     conley_cutoff = 831, conley_kernel = "epanechnikov")

lmsac.out$conley_SE
#> [1] 0.0000000 0.1370043

gm.param <- list(
  list("raw" = "nobs", "clean" = "Observations", "fmt" = 0),
  list("raw" = "Moran_y", "clean" = "Moran's I [y]", "fmt" = 3),
  list("raw" = "Moran_resid", "clean" = "Moran's I [resid]", "fmt" = 3),
  list("raw" = "y_mean", "clean" = "Mean [y]", "fmt" = 3),
  list("raw" = "y_SD", "clean" = "Std.Dev. [y]", "fmt" = 3)
)

modelsummary(lmsac.out,
             estimate = c("{estimate}"),
             statistic = c("({std.error})", "([{conley}])"),
             gof_omit = "Model|Range_resid|Range_resp|Range_y",
             gof_map = gm.param)
#> Found litedown! Enabling r-universe template
(1)
(Intercept) 0.000
(0.018)
([0.137])
noise2 0.087
(0.018)
([0.137])
Observations 3108

Kernel choices

The spatial HAC estimate is sensitive not only to the bandwidth but also to the kernel function. The package includes six kernels with different distance-decay profiles. The uniform kernel weights all pairs equally within the cutoff; the others downweight more distant pairs with varying decay rates:

compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "uniform")
#> [1] 0.13964
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov")
#> [1] 0.137
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "bartlett")
#> [1] 0.12253
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "parzen")
#> [1] 0.11584
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "gaussian")
#> [1] 0.13799
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "biweight")
#> [1] 0.13087

Based on Monte Carlo evidence in Lehner (2026), the Bartlett and Epanechnikov kernels tend to deliver the best size control in practice.

References

Christensen, Darin, Alexandra C. Hartman, and Cyrus Samii. 2021. “Legibility and External Investment: An Institutional Natural Experiment in Liberia.” International Organization 75 (4): 1087–108. https://doi.org/10.1017/S0020818321000187.
Conley, Timothy G. 1999. GMM Estimation with Cross Sectional Dependence.” Journal of Econometrics 92 (1): 1–45. https://doi.org/10.1016/S0304-4076(98)00084-0.
Lehner, Alexander. 2026. Bandwidth Selection for Spatial HAC Standard Errors. arXiv:2603.03997. arXiv. https://doi.org/10.48550/arXiv.2603.03997.
Pebesma, Edzer J. 2004. “Multivariable Geostatistics in S: The Gstat Package.” Computers & Geosciences 30 (7): 683–91. https://doi.org/10.1016/j.cageo.2004.03.012.