| Title: | Bias Correction Methods for Models Using Synthetic Data |
|---|---|
| Description: | Implements three bias-correction techniques from Battaglia et al. (2025 <doi:10.48550/arXiv.2402.15585>) to improve inference in regression models with covariates generated by AI or machine learning. |
| Authors: | Konrad Kurczynski [aut, cre], Timothy Christensen [aut] |
| Maintainer: | Konrad Kurczynski <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.2 |
| Built: | 2026-05-25 06:23:47 UTC |
| Source: | https://github.com/cran/MLBC |
Ordinary Least Squares regression with support for both formula and array-based interfaces. This function provides a unified interface for fitting linear models using either R formulas with data frames or raw matrices.
ols(Y, X = NULL, data = parent.frame(), se = TRUE, intercept = FALSE, ...) ## Default S3 method: ols(Y, X, data = parent.frame(), se = TRUE, intercept = FALSE, ...) ## S3 method for class 'formula' ols(Y, X = NULL, data = parent.frame(), se = TRUE, intercept = TRUE, ...)ols(Y, X = NULL, data = parent.frame(), se = TRUE, intercept = FALSE, ...) ## Default S3 method: ols(Y, X, data = parent.frame(), se = TRUE, intercept = FALSE, ...) ## S3 method for class 'formula' ols(Y, X = NULL, data = parent.frame(), se = TRUE, intercept = TRUE, ...)
Y |
numeric response vector, or a one-sided formula |
X |
numeric design matrix (if |
data |
data frame (if |
se |
logical; return heteroskedastic-robust standard errors? |
intercept |
logical; include an intercept term? |
... |
unused |
An object of class mlbc_fit and mlbc_ols with:
coef: coefficient estimates
vcov: variance-covariance matrix
sXX: scaled cross-product X'X / n
Option 1: Formula Interface
Y: A one-sided formula (e.g., y ~ x1 + x2)
data: A data frame containing the variables referenced in the formula
Option 2: Array Interface
Y: Response variable vector
X: Design matrix of covariates
# Load the remote work dataset data(SD_data) # Formula interface fit1 <- ols(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data) summary(fit1) # Array interface Y <- log(SD_data$salary) X <- model.matrix(~ wfh_wham + soc_2021_2, data = SD_data) fit2 <- ols(Y, X[, -1], intercept = TRUE) # exclude intercept column summary(fit2)# Load the remote work dataset data(SD_data) # Formula interface fit1 <- ols(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data) summary(fit1) # Array interface Y <- log(SD_data$salary) X <- model.matrix(~ wfh_wham + soc_2021_2, data = SD_data) fit2 <- ols(Y, X[, -1], intercept = TRUE) # exclude intercept column summary(fit2)
Performs an additive bias correction to regressions that include a binary covariate generated by AI/ML. This method requires an external estimate of the false-positive rate. Standard errors are adjusted to account for uncertainty in the false-positive rate estimate.
ols_bca( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## Default S3 method: ols_bca( Y, Xhat, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## S3 method for class 'formula' ols_bca( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... )ols_bca( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## Default S3 method: ols_bca( Y, Xhat, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## S3 method for class 'formula' ols_bca( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... )
Y |
numeric response vector, or a one-sided formula |
Xhat |
numeric matrix of regressors (if |
fpr |
numeric; estimated false-positive rate of the ML regressor |
m |
integer; size of the external sample used to estimate the classifier's false-positive rate. Can be set to a large number when the false-positive rate is known exactly |
data |
data frame (if |
intercept |
logical; if |
gen_idx |
integer; 1-based index of the ML-generated variable to apply bias correction to. If not specified, defaults to the first non-intercept variable |
... |
unused |
An object of class mlbc_fit and mlbc_bca with:
coef: bias-corrected coefficient estimates (ML-slope first, other slopes, intercept last)
vcov: adjusted variance-covariance matrix for those coefficients
Option 1: Formula Interface
Y: A one-sided formula string
data: Data frame containing the variables referenced in the formula
Option 2: Array Interface
Y: Response variable vector
Xhat: Design matrix of covariates
# Load the remote work dataset data(SD_data) # Formula interface fit_bca <- ols_bca(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data, fpr = 0.009, # estimated false positive rate m = 1000) # validation sample size summary(fit_bca) # Array interface Y <- log(SD_data$salary) Xhat <- model.matrix(~ wfh_wham + soc_2021_2, data = SD_data)[, -1] fit_bca2 <- ols_bca(Y, Xhat, fpr = 0.009, m = 1000, intercept = TRUE) summary(fit_bca2)# Load the remote work dataset data(SD_data) # Formula interface fit_bca <- ols_bca(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data, fpr = 0.009, # estimated false positive rate m = 1000) # validation sample size summary(fit_bca) # Array interface Y <- log(SD_data$salary) Xhat <- model.matrix(~ wfh_wham + soc_2021_2, data = SD_data)[, -1] fit_bca2 <- ols_bca(Y, Xhat, fpr = 0.009, m = 1000, intercept = TRUE) summary(fit_bca2)
Bias-corrected additive estimator for topic model regression. This method applies additive bias correction to regressions that include topic proportions as covariates, accounting for estimation uncertainty in the topic model.
ols_bca_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## Default S3 method: ols_bca_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## S3 method for class 'formula' ols_bca_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... )ols_bca_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## Default S3 method: ols_bca_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## S3 method for class 'formula' ols_bca_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... )
Y |
numeric response vector, or a one-sided formula |
Q |
numeric matrix of additional controls (if Y is numeric) |
W |
numeric matrix of document-term frequencies |
S |
numeric matrix of topic loadings |
B |
numeric matrix of topic-word distributions |
k |
numeric; bias correction parameter |
data |
data frame (if Y is a formula) |
intercept |
logical; if TRUE, includes an intercept term |
... |
additional arguments |
An object of class mlbc_fit and mlbc_bca_topic with:
coef: bias-corrected coefficient estimates
vcov: adjusted variance-covariance matrix
# Load topic model dataset data(topic_model_data) # Extract components Y <- topic_model_data$estimation_data$ly Z <- as.matrix(topic_model_data$covars) theta_full <- as.matrix(topic_model_data$theta_est_full) beta_full <- as.matrix(topic_model_data$beta_est_full) lda_data <- as.matrix(topic_model_data$lda_data) # Apply additive bias correction kappa <- mean(1.0 / lda_data[, 1]) * sqrt(nrow(lda_data)) S <- matrix(c(1.0, 0.0), nrow = 1) fit <- ols_bca_topic(Y, Z, theta_full, S, beta_full, k = kappa) summary(fit)# Load topic model dataset data(topic_model_data) # Extract components Y <- topic_model_data$estimation_data$ly Z <- as.matrix(topic_model_data$covars) theta_full <- as.matrix(topic_model_data$theta_est_full) beta_full <- as.matrix(topic_model_data$beta_est_full) lda_data <- as.matrix(topic_model_data$lda_data) # Apply additive bias correction kappa <- mean(1.0 / lda_data[, 1]) * sqrt(nrow(lda_data)) S <- matrix(c(1.0, 0.0), nrow = 1) fit <- ols_bca_topic(Y, Z, theta_full, S, beta_full, k = kappa) summary(fit)
Performs a multiplicative bias correction to regressions that include a binary covariate generated by AI/ML. This method requires an external estimate of the false-positive rate. Standard errors are adjusted to account for uncertainty in the false-positive rate estimate.
ols_bcm( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## Default S3 method: ols_bcm( Y, Xhat, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## S3 method for class 'formula' ols_bcm( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... )ols_bcm( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## Default S3 method: ols_bcm( Y, Xhat, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... ) ## S3 method for class 'formula' ols_bcm( Y, Xhat = NULL, fpr, m, data = parent.frame(), intercept = TRUE, gen_idx = 1, ... )
Y |
numeric response vector, or a one-sided formula |
Xhat |
numeric matrix of regressors (if |
fpr |
numeric; estimated false-positive rate of the ML regressor |
m |
integer; size of the external sample used to estimate the classifier's false-positive rate. Can be set to a large number when the false-positive rate is known exactly |
data |
data frame (if |
intercept |
logical; if |
gen_idx |
integer; 1-based index of the ML-generated variable to apply bias correction to. If not specified, defaults to the first non-intercept variable |
... |
unused |
An object of class mlbc_fit and mlbc_bcm with:
coef: bias-corrected coefficient estimates (ML-slope first, other slopes, intercept last)
vcov: adjusted variance-covariance matrix for those coefficients
Option 1: Formula Interface
Y: A one-sided formula string
data: Data frame containing the variables referenced in the formula
Option 2: Array Interface
Y: Response variable vector
Xhat: Design matrix of covariates
# Load the remote work dataset data(SD_data) # Formula interface fit_bcm <- ols_bcm(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data, fpr = 0.009, # estimated false positive rate m = 1000) # validation sample size summary(fit_bcm) # Compare with uncorrected OLS fit_ols <- ols(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data) # Display coefficient comparison data.frame( OLS = coef(fit_ols)[1:2], BCM = coef(fit_bcm)[1:2] )# Load the remote work dataset data(SD_data) # Formula interface fit_bcm <- ols_bcm(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data, fpr = 0.009, # estimated false positive rate m = 1000) # validation sample size summary(fit_bcm) # Compare with uncorrected OLS fit_ols <- ols(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data) # Display coefficient comparison data.frame( OLS = coef(fit_ols)[1:2], BCM = coef(fit_bcm)[1:2] )
Bias-corrected multiplicative estimator for topic model regression. This method applies multiplicative bias correction to regressions that include topic proportions as covariates, accounting for estimation uncertainty in the topic model.
ols_bcm_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## Default S3 method: ols_bcm_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## S3 method for class 'formula' ols_bcm_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... )ols_bcm_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## Default S3 method: ols_bcm_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... ) ## S3 method for class 'formula' ols_bcm_topic( Y, Q = NULL, W, S, B, k, data = parent.frame(), intercept = TRUE, ... )
Y |
numeric response vector, or a one-sided formula |
Q |
numeric matrix of additional controls (if Y is numeric) |
W |
numeric matrix of document-term frequencies |
S |
numeric matrix of topic loadings |
B |
numeric matrix of topic-word distributions |
k |
numeric; bias correction parameter |
data |
data frame (if Y is a formula) |
intercept |
logical; if TRUE, includes an intercept term |
... |
additional arguments |
An object of class mlbc_fit and mlbc_bcm_topic with:
coef: bias-corrected coefficient estimates
vcov: adjusted variance-covariance matrix
# Load topic model dataset data(topic_model_data) # Extract components Y <- topic_model_data$estimation_data$ly Z <- as.matrix(topic_model_data$covars) theta_full <- as.matrix(topic_model_data$theta_est_full) beta_full <- as.matrix(topic_model_data$beta_est_full) lda_data <- as.matrix(topic_model_data$lda_data) # Apply multiplicative bias correction kappa <- mean(1.0 / lda_data[, 1]) * sqrt(nrow(lda_data)) S <- matrix(c(1.0, 0.0), nrow = 1) fit <- ols_bcm_topic(Y, Z, theta_full, S, beta_full, k = kappa) summary(fit)# Load topic model dataset data(topic_model_data) # Extract components Y <- topic_model_data$estimation_data$ly Z <- as.matrix(topic_model_data$covars) theta_full <- as.matrix(topic_model_data$theta_est_full) beta_full <- as.matrix(topic_model_data$beta_est_full) lda_data <- as.matrix(topic_model_data$lda_data) # Apply multiplicative bias correction kappa <- mean(1.0 / lda_data[, 1]) * sqrt(nrow(lda_data)) S <- matrix(c(1.0, 0.0), nrow = 1) fit <- ols_bcm_topic(Y, Z, theta_full, S, beta_full, k = kappa) summary(fit)
Maximum likelihood estimation of the regression model, treating the generated covariate as a noisy proxy for the true latent variable. This method is particularly useful when an estimate of the false positive rate is not available. The variance of the estimates is approximated via the inverse Hessian at the optimum.
one_step( Y, Xhat = NULL, homoskedastic = FALSE, distribution = c("normal", "t", "laplace", "gamma", "beta"), nu = 4, gshape = 2, gscale = 1, ba = 2, bb = 2, intercept = TRUE, gen_idx = 1, data = parent.frame(), ... ) ## Default S3 method: one_step( Y, Xhat, homoskedastic = FALSE, distribution = c("normal", "t", "laplace", "gamma", "beta"), nu = 4, gshape = 2, gscale = 1, ba = 2, bb = 2, intercept = TRUE, gen_idx = 1, ... ) ## S3 method for class 'formula' one_step( Y, Xhat = NULL, homoskedastic = FALSE, distribution = c("normal", "t", "laplace", "gamma", "beta"), nu = 4, gshape = 2, gscale = 1, ba = 2, bb = 2, intercept = TRUE, gen_idx = 1, data = parent.frame(), ... )one_step( Y, Xhat = NULL, homoskedastic = FALSE, distribution = c("normal", "t", "laplace", "gamma", "beta"), nu = 4, gshape = 2, gscale = 1, ba = 2, bb = 2, intercept = TRUE, gen_idx = 1, data = parent.frame(), ... ) ## Default S3 method: one_step( Y, Xhat, homoskedastic = FALSE, distribution = c("normal", "t", "laplace", "gamma", "beta"), nu = 4, gshape = 2, gscale = 1, ba = 2, bb = 2, intercept = TRUE, gen_idx = 1, ... ) ## S3 method for class 'formula' one_step( Y, Xhat = NULL, homoskedastic = FALSE, distribution = c("normal", "t", "laplace", "gamma", "beta"), nu = 4, gshape = 2, gscale = 1, ba = 2, bb = 2, intercept = TRUE, gen_idx = 1, data = parent.frame(), ... )
Y |
numeric response vector, or a one-sided formula |
Xhat |
numeric matrix of regressors (if |
homoskedastic |
logical; if |
distribution |
character; distribution for error terms. One of |
nu |
numeric; degrees of freedom (for Student-t distribution) |
gshape |
numeric; shape parameter (for Gamma distribution) |
gscale |
numeric; scale parameter (for Gamma distribution) |
ba |
numeric; alpha parameter (for Beta distribution) |
bb |
numeric; beta parameter (for Beta distribution) |
intercept |
logical; if |
gen_idx |
integer; index (1-based) of the binary ML-generated variable. If not specified, defaults to the first non-intercept variable |
data |
data frame (if |
... |
unused |
An object of class mlbc_fit and mlbc_onestep with:
coef: estimated regression coefficients
vcov: variance-covariance matrix
Option 1: Formula Interface
Y: A one-sided formula string
data: Data frame containing the variables referenced in the formula
Option 2: Array Interface
Y: Response variable vector
Xhat: Design matrix of covariates
# Load the remote work dataset data(SD_data) # Basic one-step estimation fit_onestep <- one_step(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data) summary(fit_onestep) # With different error distribution fit_t <- one_step(log(salary) ~ wfh_wham + soc_2021_2, data = SD_data, distribution = "t", nu = 4) summary(fit_t) # Homoskedastic errors fit_homo <- one_step(log(salary) ~ wfh_wham + soc_2021_2, data = SD_data, homoskedastic = TRUE) summary(fit_homo)# Load the remote work dataset data(SD_data) # Basic one-step estimation fit_onestep <- one_step(log(salary) ~ wfh_wham + soc_2021_2 + employment_type_name, data = SD_data) summary(fit_onestep) # With different error distribution fit_t <- one_step(log(salary) ~ wfh_wham + soc_2021_2, data = SD_data, distribution = "t", nu = 4) summary(fit_t) # Homoskedastic errors fit_homo <- one_step(log(salary) ~ wfh_wham + soc_2021_2, data = SD_data, homoskedastic = TRUE) summary(fit_homo)
A subset of data relating to job postings on the Lightcast platform for demonstrating bias correction methods with ML-generated variables.
SD_dataSD_data
SD_dataA data frame with 16315 rows and 6 columns:
Character. City of the job posting
Character. Type of business (NAICS industry classification)
Numeric. Salary offered (response variable)
Numeric. Binary label generated via ML, indicating whether remote work is offered (subject to measurement error)
Character. Occupation code (SOC classification)
Character. Employment type (part time/full time)
Proprietary data from Lightcast job postings platform
## Not run: data(SD_data) fit <- ols_bca(log(salary) ~ wfh_wham + soc_2021_2 + naics_2022_2, data = SD_data, fpr = 0.009, m = 1000) ## End(Not run)## Not run: data(SD_data) fit <- ols_bca(log(salary) ~ wfh_wham + soc_2021_2 + naics_2022_2, data = SD_data, fpr = 0.009, m = 1000) ## End(Not run)
Dataset containing topic model outputs for demonstrating bias correction methods in topic model regressions using CEO diary data.
topic_model_datatopic_model_data
A list with 8 components:
Data frame (916 x 11): Control variables
Data frame (916 x 672): Contains outcome ly and word frequencies
Data frame (2000 x 2): MCMC draws
Data frame (916 x 2): Full sample topic proportions
Data frame (916 x 2): Subsample topic proportions
Data frame (2 x 654): Full sample topic-word distributions
Data frame (2 x 654): Subsample topic-word distributions
Data frame (916 x 2): LDA validation data
CEO diary data from Bandiera et al (2020), Journal of Political Economy
data(topic_model_data) # Basic exploration Y <- topic_model_data$estimation_data$ly theta <- as.matrix(topic_model_data$theta_est_full) cat("Sample size:", length(Y), "\n") cat("Mean log employment:", round(mean(Y), 2), "\n") cat("Topic 1 mean:", round(mean(theta[, 1]), 3), "\n")data(topic_model_data) # Basic exploration Y <- topic_model_data$estimation_data$ly theta <- as.matrix(topic_model_data$theta_est_full) cat("Sample size:", length(Y), "\n") cat("Mean log employment:", round(mean(Y), 2), "\n") cat("Topic 1 mean:", round(mean(theta[, 1]), 3), "\n")