Title: | Estimation of Conditional Average Treatment Effects with High-Dimensional Data |
---|---|
Description: | A two-step double-robust method to estimate the conditional average treatment effects (CATE) with potentially high-dimensional covariate(s). In the first stage, the nuisance functions necessary for identifying CATE are estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. The second stage consists of a low-dimensional local linear regression, reducing CATE to a function of the covariate(s) of interest. The CATE estimator implemented in this package not only allows for high-dimensional data, but also has the “double robustness” property: either the model for the propensity score or the models for the conditional means of the potential outcomes are allowed to be misspecified (but not both). This package is based on the paper by Fan et al., "Estimation of Conditional Average Treatment Effects With High-Dimensional Data" (2022), Journal of Business & Economic Statistics <doi:10.1080/07350015.2020.1811102>. |
Authors: | Qingliang Fan [aut, cre], Hengzhao Hong [aut] |
Maintainer: | Qingliang Fan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0 |
Built: | 2024-11-06 06:40:01 UTC |
Source: | CRAN |
Use a two-step procedure to estimate the conditional average treatment effects (CATE) with potentially high-dimensional covariate(s).
Run browseVignettes('hdcate')
to browse the user manual of this package.
HDCATE(data, y_name, d_name, x_formula)
HDCATE(data, y_name, d_name, x_formula)
data |
data frame of the observed data |
y_name |
variable name of the observed outcomes |
d_name |
variable name of the treatment indicators |
x_formula |
formula of the covariates |
An initialized HDCATE
model (object), ready for estimation.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # for example, and alternatively, the propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # Example 1: full-sample estimator # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # estimate HDCATE function, inference, and plot HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model) # Example 2: cross-fitting estimator # change above estimator to cross-fitting mode, 5 folds, for example. HDCATE.use_cross_fitting(model, k_fold=5) # estimate HDCATE function, inference, and plot HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # for example, and alternatively, the propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # Example 1: full-sample estimator # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # estimate HDCATE function, inference, and plot HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model) # Example 2: cross-fitting estimator # change above estimator to cross-fitting mode, 5 folds, for example. HDCATE.use_cross_fitting(model, k_fold=5) # estimate HDCATE function, inference, and plot HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
Fit the HDCATE function
HDCATE.fit(HDCATE_model, verbose = TRUE)
HDCATE.fit(HDCATE_model, verbose = TRUE)
HDCATE_model |
an object created via HDCATE |
verbose |
whether the verbose message is displayed, the default is |
None. The HDCATE_model
is fitted.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model)
Get simulation data
HDCATE.get_sim_data( n_obs = 500, n_var = 100, n_rel_var = 4, sig_strength_propensity = 0.5, sig_strength_outcome = 1, intercept = 10 )
HDCATE.get_sim_data( n_obs = 500, n_var = 100, n_rel_var = 4, sig_strength_propensity = 0.5, sig_strength_outcome = 1, intercept = 10 )
n_obs |
Num of observations |
n_var |
Num of covariates |
n_rel_var |
Num of relevant variables, only the first |
sig_strength_propensity |
signal strength in propensity score functions |
sig_strength_outcome |
signal strength in outcome functions |
intercept |
value of intercept in outcome functions |
a data.frame, which is the simulated observed data.
HDCATE.get_sim_data() HDCATE.get_sim_data(n_obs=50, n_var=4, n_rel_var=2)
HDCATE.get_sim_data() HDCATE.get_sim_data(n_obs=50, n_var=4, n_rel_var=2)
Construct uniform confidence bands
HDCATE.inference( HDCATE_model, sig_level = 0.01, n_rep_boot = 1000, verbose = FALSE )
HDCATE.inference( HDCATE_model, sig_level = 0.01, n_rep_boot = 1000, verbose = FALSE )
HDCATE_model |
an object created via HDCATE |
sig_level |
a (vector of) significant level, such as 0.01, or c(0.01, 0.05, 0.10) |
n_rep_boot |
repeat n times for bootstrap, the default is 1000 |
verbose |
whether the verbose message is displayed, the default is |
None. The HDCATE confidence bands are constructed.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model)
Plot HDCATE function and the uniform confidence bands
HDCATE.plot( HDCATE_model, output_pdf = FALSE, pdf_name = "hdcate_plot.pdf", include_band = TRUE, test_side = "both", y_axis_min = "auto", y_axis_max = "auto", display.hdcate = "HDCATEF", display.ate = "ATE", display.siglevel = "sig_level" )
HDCATE.plot( HDCATE_model, output_pdf = FALSE, pdf_name = "hdcate_plot.pdf", include_band = TRUE, test_side = "both", y_axis_min = "auto", y_axis_max = "auto", display.hdcate = "HDCATEF", display.ate = "ATE", display.siglevel = "sig_level" )
HDCATE_model |
an object created via HDCATE |
output_pdf |
if |
pdf_name |
file name when |
include_band |
if |
test_side |
|
y_axis_min |
minimum value of the Y axis to plot in the graph, the default is |
y_axis_max |
maximum value of the Y axis to plot in the graph, the default is |
display.hdcate |
the name of HDCATE function in the legend, the default is 'HDCATEF' |
display.ate |
the name of average treatment effect in the legend, the default is 'ATE' |
display.siglevel |
the name of the significant level for confidence bands in the legend, the default is 'sig_level' |
None. A plot will be shown or saved as PDF.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
Set user-defined bandwidth.
HDCATE.set_bw(model, bandwidth = "default")
HDCATE.set_bw(model, bandwidth = "default")
model |
an object created via HDCATE |
bandwidth |
the value of bandwidth |
None.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # Set user-defined bandwidth, e.g., 0.15. HDCATE.set_bw(model, 0.15)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # Set user-defined bandwidth, e.g., 0.15. HDCATE.set_bw(model, 0.15)
Set the conditional variable in CATE
HDCATE.set_condition_var( HDCATE_model, name = NA, min = NA, max = NA, step = NA )
HDCATE.set_condition_var( HDCATE_model, name = NA, min = NA, max = NA, step = NA )
HDCATE_model |
an object created via HDCATE |
name |
name of the conditional variable |
min |
minimum value of the conditional variable for evaluation |
max |
maximum value of the conditional variable for evaluation |
step |
minimum distance between two evaluation points |
None. The HDCATE_model
is ready to fit.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
Set user-defined ML methods (such as random forests, elastic-net, boosting) to run the first-stage estimation.
HDCATE.set_first_stage( model, fit.treated, fit.untreated, fit.propensity, predict.treated, predict.untreated, predict.propensity )
HDCATE.set_first_stage( model, fit.treated, fit.untreated, fit.propensity, predict.treated, predict.untreated, predict.propensity )
model |
an object created via HDCATE |
fit.treated |
function that accepts a data.frame as the only argument, fits the treated expectation function, and returns a fitted object |
fit.untreated |
function that accepts a data.frame as the only argument, fits the untreated expectation function, and returns a fitted object |
fit.propensity |
function that accepts a data.frame as the only argument, fits the propensity function, and return a fitted object |
predict.treated |
function that accepts the returned object of |
predict.untreated |
function that accepts the returned object of |
predict.propensity |
function that accepts the returned object of |
None.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # manually define a lasso method my_lasso_fit_exp <- function(df) { hdm::rlasso(as.formula(paste0('Y', "~", x_formula)), df) } my_lasso_predict_exp <- function(fitted_model, df) { predict(fitted_model, df) } my_lasso_fit_ps <- function(df) { hdm::rlassologit(as.formula(paste0('D', "~", x_formula)), df) } my_lasso_predict_ps <- function(fitted_model, df) { predict(fitted_model, df, type="response") } # Apply the "my-lasso" apporach to the first stage HDCATE.set_first_stage( model, my_lasso_fit_exp, my_lasso_fit_exp, my_lasso_fit_ps, my_lasso_predict_exp, my_lasso_predict_exp, my_lasso_predict_ps )
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # manually define a lasso method my_lasso_fit_exp <- function(df) { hdm::rlasso(as.formula(paste0('Y', "~", x_formula)), df) } my_lasso_predict_exp <- function(fitted_model, df) { predict(fitted_model, df) } my_lasso_fit_ps <- function(df) { hdm::rlassologit(as.formula(paste0('D', "~", x_formula)), df) } my_lasso_predict_ps <- function(fitted_model, df) { predict(fitted_model, df, type="response") } # Apply the "my-lasso" apporach to the first stage HDCATE.set_first_stage( model, my_lasso_fit_exp, my_lasso_fit_exp, my_lasso_fit_ps, my_lasso_predict_exp, my_lasso_predict_exp, my_lasso_predict_ps )
Inverse operation of HDCATE.set_first_stage
HDCATE.unset_first_stage(model)
HDCATE.unset_first_stage(model)
model |
an object created via HDCATE |
None.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # ... manually set user-defined first-stage estimating methods via `HDCATE.set_first_stage` # Clear those user-defined methods and use the built-in method HDCATE.unset_first_stage(model)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # ... manually set user-defined first-stage estimating methods via `HDCATE.set_first_stage` # Clear those user-defined methods and use the built-in method HDCATE.unset_first_stage(model)
Use k-fold cross-fitting estimator
HDCATE.use_cross_fitting(model, k_fold = 5, folds = NULL)
HDCATE.use_cross_fitting(model, k_fold = 5, folds = NULL)
model |
an object created via HDCATE |
k_fold |
number of folds |
folds |
you can manually set the folds, should be a list of index vector |
None.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # for example, use 5-fold cross-fitting estimator HDCATE.use_cross_fitting(model, k_fold=5) # alternatively, pass a list of index vector to the third argument to set the folds manually, # in this case, the second argument k_fold is auto detected, you can pass any value to it. HDCATE.use_cross_fitting(model, k_fold=2, folds=list(c(1:250), c(251:500)))
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # for example, use 5-fold cross-fitting estimator HDCATE.use_cross_fitting(model, k_fold=5) # alternatively, pass a list of index vector to the third argument to set the folds manually, # in this case, the second argument k_fold is auto detected, you can pass any value to it. HDCATE.use_cross_fitting(model, k_fold=2, folds=list(c(1:250), c(251:500)))
This is the default mode when creating a model via HDCATE
HDCATE.use_full_sample(model)
HDCATE.use_full_sample(model)
model |
an object created via HDCATE |
None.
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.use_full_sample(model)
# get simulation data n_obs <- 500 # Num of observations n_var <- 100 # Num of observed variables n_rel_var <- 4 # Num of relevant variables data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var) # conditional expectation model is misspecified x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+') # propensity score model is misspecified # x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+') # create a new HDCATE model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.use_full_sample(model)