Title: | Difference-in-Differences with Unpoolable Data |
---|---|
Description: | A framework for estimating difference-in-differences with unpoolable data, based on Karim, Webb, Austin, and Strumpf (2024) <doi:10.48550/arXiv.2403.15910>. Supports common or staggered adoption, multiple groups, and the inclusion of covariates. Also computes p-values for the aggregate average treatment effect on the treated via the randomization inference procedure described in MacKinnon and Webb (2020) <doi:10.1016/j.jeconom.2020.04.024>. |
Authors: | Eric Jamieson [aut, cre, cph] |
Maintainer: | Eric Jamieson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2025-01-24 03:11:47 UTC |
Source: | CRAN |
empty_diff_df.csv
Creates the empty_diff_df.csv
which lists all of the differences that
need to calculated at each silo in order to compute the aggregate ATT.
The empty_diff_df.csv
is then to be sent out to each silo to be filled out.
create_diff_df( init_filepath, date_format, freq, covariates = FALSE, freq_multiplier = FALSE, weights = "standard", filename = "empty_diff_df.csv", filepath = tempdir() )
create_diff_df( init_filepath, date_format, freq, covariates = FALSE, freq_multiplier = FALSE, weights = "standard", filename = "empty_diff_df.csv", filepath = tempdir() )
init_filepath |
A character filepath to the |
date_format |
A character specifying the date format used in the
|
freq |
A character indicating the length of the time periods to be used
when computing the differences in mean outcomes between periods at each
silo. Options are: |
covariates |
A character vector specifying covariates to be considered
at each silo. If |
freq_multiplier |
A numeric value or |
weights |
A character indicating the weighting to use in the case of
common adoption. The |
filename |
A character filename for the created CSV file. Defaults to
|
filepath |
Filepath to save the CSV file. Defaults to |
Ensure that dates in the init.csv
are entered consistently
in the same date format. Call undid_date_formats()
to see a list of valid
date formats. Covariates specified when calling create_diff_df()
will
override any covariates specified in the init.csv
.
A data frame detailing the silo and time combinations for which differences must be calculated in order to compute the aggregate ATT. A CSV copy is saved to the specified directory which is then to be sent out to each silo.
file_path <- system.file("extdata/staggered", "init.csv", package = "undidR") create_diff_df( init_filepath = file_path, date_format = "yyyy", freq = "yearly" ) unlink(file.path(tempdir(), "empty_diff_df.csv"))
file_path <- system.file("extdata/staggered", "init.csv", package = "undidR") create_diff_df( init_filepath = file_path, date_format = "yyyy", freq = "yearly" ) unlink(file.path(tempdir(), "empty_diff_df.csv"))
init.csv
The create_init_csv()
function generates a CSV file with information
on each silo's start times, end times, and treatment times.
If parameters are left empty, generates a blank CSV with only the headers.
create_init_csv( silo_names = character(), start_times = character(), end_times = character(), treatment_times = character(), covariates = character(), filename = "init.csv", filepath = tempdir() )
create_init_csv( silo_names = character(), start_times = character(), end_times = character(), treatment_times = character(), covariates = character(), filename = "init.csv", filepath = tempdir() )
silo_names |
A character vector of silo names. |
start_times |
A character vector of start times. |
end_times |
A character vector of end times. |
treatment_times |
A character vector of treatment times. |
covariates |
A character vector of covariates, or, |
filename |
A character filename for the created initializing CSV file.
Defaults to |
filepath |
Filepath to save the CSV file. Defaults to |
Ensure dates are entered consistently in the same date format.
Call undid_date_formats()
to view valid date formats. Control silos
should be marked as "control"
in the treatment_times
vector. If
covariates
is FALSE
, no covariate column will be included in the CSV.
A data frame containing the contents written to the CSV file.
The CSV file is saved in the specified directory (or in a temporary
directory by default) with the default filename init.csv
.
create_init_csv( silo_names = c("73", "46", "54", "23", "86", "32", "71", "58", "64", "59", "85", "57"), start_times = "1989", end_times = "2000", treatment_times = c(rep("control", 6), "1991", "1993", "1996", "1997", "1997", "1998"), covariates = c("asian", "black", "male") ) unlink(file.path(tempdir(), "init.csv"))
create_init_csv( silo_names = c("73", "46", "54", "23", "86", "32", "71", "58", "64", "59", "85", "57"), start_times = "1989", end_times = "2000", treatment_times = c(rep("control", 6), "1991", "1993", "1996", "1997", "1997", "1998"), covariates = c("asian", "black", "male") ) unlink(file.path(tempdir(), "init.csv"))
The plot_parallel_trends()
function combines the various
trends data CSV files and plots parallel trends figures.
All treatment and all control groups can be combined so that there
is one control line and one treatment line by setting combine = TRUE
.
plot_parallel_trends( dir_path, covariates = FALSE, save_csv = FALSE, combine = FALSE, pch = NA, pch_control = NA, pch_treated = NA, control_colour = c("darkgrey", "lightgrey"), control_color = NULL, treatment_colour = c("darkred", "lightcoral"), treatment_color = NULL, lwd = 2, xlab = NA, ylab = NA, title = NA, xticks = 4, date_format = "%Y-%m-%d", xdates = NULL, xaxlabsz = 0.8, save_png = FALSE, width = 800, height = 600, ylim = NULL, yaxlabsz = 0.8, ylabels = NULL, yticks = 4, ydecimal = 2, legend_location = "topright", simplify_legend = TRUE, legend_cex = 0.7, legend_on = TRUE, treatment_indicator_col = "grey", treatment_indicator_alpha = 0.5, treatment_indicator_lwd = 2, treatment_indicator_lty = 2, interpolate = FALSE, filepath = tempdir(), filenamecsv = "combined_trends_data.csv", filenamepng = "undid_plot.png" )
plot_parallel_trends( dir_path, covariates = FALSE, save_csv = FALSE, combine = FALSE, pch = NA, pch_control = NA, pch_treated = NA, control_colour = c("darkgrey", "lightgrey"), control_color = NULL, treatment_colour = c("darkred", "lightcoral"), treatment_color = NULL, lwd = 2, xlab = NA, ylab = NA, title = NA, xticks = 4, date_format = "%Y-%m-%d", xdates = NULL, xaxlabsz = 0.8, save_png = FALSE, width = 800, height = 600, ylim = NULL, yaxlabsz = 0.8, ylabels = NULL, yticks = 4, ydecimal = 2, legend_location = "topright", simplify_legend = TRUE, legend_cex = 0.7, legend_on = TRUE, treatment_indicator_col = "grey", treatment_indicator_alpha = 0.5, treatment_indicator_lwd = 2, treatment_indicator_lty = 2, interpolate = FALSE, filepath = tempdir(), filenamecsv = "combined_trends_data.csv", filenamepng = "undid_plot.png" )
dir_path |
A character filepath to the folder containing all of the trends data CSV files. |
covariates |
A logical value (defaults to |
save_csv |
A logical value (defaults to |
combine |
A logical value (defaults to |
pch |
An integer (0 to 25) or vector of integers (from 0 to 25)
which determine the style of points used on the plot. Setting to |
pch_control |
An integer (from 0 to 25) or vector of integers
(from 0 to 25) which determine the style of points used on the plot
for control silos. Takes value of pch if set to |
pch_treated |
An integer (from 0 to 25) or vector of integers
(from 0 to 25) which determine the style of points used on the plot
for treated silos. Takes value of pch if set to |
control_colour |
A character vector of colours
(defaults to |
control_color |
Overrides |
treatment_colour |
A character vector of colours
(defaults to |
treatment_color |
Overrides |
lwd |
An integer (defaults to |
xlab |
A character value for the x-axis label (defaults to |
ylab |
A character value for the y-axis label (defaults to |
title |
A character value for the title of the plot (defaults to |
xticks |
An integer value denoting how many ticks to display
on the x-axis (defaults to |
date_format |
A string value denoting the format with which to display
the dates along the x-axis (defaults to |
xdates |
Takes in a vector of date objects to be used as the dates shown
along the x-axis (defaults to |
xaxlabsz |
A double indicating the x-axis label sizes in comparison
to a standardized default size (defaults to |
save_png |
A logical value indicating whether or not to save the plot
as a PNG file (defaults to |
width |
An integer denoting the width of the saved PNG file. |
height |
An integer denoting the height of the saved PNG file. |
ylim |
A vector of two doubles defining the min and max range of the values on the y-axis. Defaults to the min and max values of the values to be plotted. |
yaxlabsz |
A double for specifying the y-axis label sizes
(defaults to |
ylabels |
A vector of values that you would like to appear
on the y-axis (defaults to |
yticks |
An integer denoting how many values to display
along the y-axis (defaults to |
ydecimal |
An integer value denoting to which decimal point the values along the y-axis are rounded to. |
legend_location |
A character value for determining the location
of the legend (defaults to |
simplify_legend |
A logical value which if set to |
legend_cex |
A double for adjusting the size of the text in the legend
compared to a standard default size. Defaults to |
legend_on |
A logical value for turning the legend on or off
(defaults to |
treatment_indicator_col |
A character value for determining the colour
of the dashed vertical lines showing when treatment times were
(defaults to |
treatment_indicator_alpha |
A double for for determining the
transparency level of the dashed vertical lines showing the treatment
times (defaults to |
treatment_indicator_lwd |
A double for selecting the line width
of the treatment indicator lines (defaults to |
treatment_indicator_lty |
An integer for the selecting the lty option,
i.e. the line style, for the treatment_indicator lines (defaults to |
interpolate |
A logical value (either |
filepath |
Filepath to save the CSV file. Defaults to |
filenamecsv |
A string filename for the combined trends data
Defaults to |
filenamepng |
A string filename for the PNG file output.
Defaults to |
A data frame built from the trends data from all CSV
files in the specified directory. If combine = FALSE
, the
data frame includes all silos joined by row. If combine = TRUE
,
the data frame merges treated silos into a single treatment group
and control silos into a single control group.
# Get path to example data included with package dir_path <- system.file("extdata/staggered", package = "undidR") # Basic usage with default parameters plot_parallel_trends(dir_path) # Custom plot with modified parameters plot_parallel_trends(dir_path, combine = TRUE, lwd = 4, xdates = as.Date(c("1989-01-01", "1991-01-01", "1993-01-01", "1995-01-01", "1997-01-01", "1999-01-01")))
# Get path to example data included with package dir_path <- system.file("extdata/staggered", package = "undidR") # Basic usage with default parameters plot_parallel_trends(dir_path) # Custom plot with modified parameters plot_parallel_trends(dir_path, combine = TRUE, lwd = 4, xdates = as.Date(c("1989-01-01", "1991-01-01", "1993-01-01", "1995-01-01", "1997-01-01", "1999-01-01")))
A dataset containing college enrollment and demographic data for analyzing the effects of merit programs in state 71.
silo71
silo71
A tibble with 569 rows and 7 variables:
Binary indicator for college enrollment (outcome variable)
Binary indicator for merit program (treatment variable)
Binary indicator for male students
Binary indicator for Black students
Binary indicator for Asian students
Year of observation
State identifier
https://economics.uwo.ca/people/conley_docs/code_to_download.html
The undid_date_formats()
function returns a list of all valid date formats
that can be used within the undidR
package.
undid_date_formats()
undid_date_formats()
The date formats returned by this function are used to ensure
consistency in date processing within the undidR
package.
A named list containing valid date formats:
General_Formats
: General date formats compatible with the package.
R_Specific_Formats
: Date formats specific to R.
Other_Formats
: Formats seen sometimes in Stata.
undid_date_formats()
undid_date_formats()
Takes in all of the filled diff df CSV files and uses them to compute group level ATTs as well as the aggregate ATT and its standard errors and p-values.
undid_stage_three( dir_path, agg = "silo", weights = TRUE, covariates = FALSE, interpolation = FALSE, save_csv = FALSE, filename = "UNDID_results.csv", filepath = tempdir(), nperm = 1001, verbose = TRUE )
undid_stage_three( dir_path, agg = "silo", weights = TRUE, covariates = FALSE, interpolation = FALSE, save_csv = FALSE, filename = "UNDID_results.csv", filepath = tempdir(), nperm = 1001, verbose = TRUE )
dir_path |
A character specifying the filepath to the folder containing all of the filled diff df CSV files. |
agg |
A character which specifies the aggregation methodology for
computing the aggregate ATT in the case of staggered adoption.
Options are: |
weights |
A logical value (either |
covariates |
A logical value (either |
interpolation |
A logical value or a character which specifies which,
if any, method of interpolation/extrapolation for missing values of
|
save_csv |
A logical value, either |
filename |
A string filename for the created CSV file.
Defaults to |
filepath |
Filepath to save the CSV file. Defaults to |
nperm |
Number of random permutations of gvar & silo pairs to consider
when calculating the randomization inference p-value. Defaults to |
verbose |
A logical value (either |
The agg
parameter specifies the aggregation method used in the
case of staggered adoption. By default it is set to "silo"
so that the ATTs
are aggregated across silos with each silo having equal weight, but can be
set to "gt"
or "g"
instead. Aggregating across "g"
calculates ATTs for
groups based on when the treatment time was, with each "g"
group having
equal weight. Aggregating across "gt"
calculates ATTs for groups based on
when the treatment time was and the time for which the ATT is calculated.
The agg
parameter is ignored in the case of a common treatment time and
only takes effect in the case of staggered adoption. For common adoption,
refer to the weights
parameter.
A data frame containing the aggregate ATT and its
standard errors and p-values from two-sided tests of agg_ATT
== 0.
Also returns group (silo, g, or gt) level ATTs for staggered adoption.
# Execute `undid_stage_three()` dir <- system.file("extdata/staggered", package = "undidR") undid_stage_three(dir, agg = "g", nperm = 501, verbose = FALSE)
# Execute `undid_stage_three()` dir <- system.file("extdata/staggered", package = "undidR") undid_stage_three(dir, agg = "g", nperm = 501, verbose = FALSE)
Based on the information given in the received empty_diff_df.csv
,
computes the appropriate differences in mean outcomes at the local silo
and saves as filled_diff_df_$silo_name.csv
. Also stores trends data
as trends_data_$silo_name.csv
.
undid_stage_two( empty_diff_filepath, silo_name, silo_df, time_column, outcome_column, silo_date_format, consider_covariates = TRUE, filepath = tempdir() )
undid_stage_two( empty_diff_filepath, silo_name, silo_df, time_column, outcome_column, silo_date_format, consider_covariates = TRUE, filepath = tempdir() )
empty_diff_filepath |
A character filepath to the |
silo_name |
A character indicating the name of the local silo. Ensure
spelling is the same as it is written in the |
silo_df |
A data frame of the local silo's data. Ensure any covariates
are spelled the same in this data frame as they are in the
|
time_column |
A character which indicates the name of the column in
the |
outcome_column |
A character which indicates the name of the column in
the |
silo_date_format |
A character which indicates the date format which
the date strings in the |
consider_covariates |
An optional logical parameter which if set to
|
filepath |
Character value indicating the filepath to
save the CSV files. Defaults to |
Covariates at the local silo should be renamed to match the
spelling used in the empty_diff_df.csv
.
A list of data frames. The first being the filled differences data frame, and the second being the trends data data frame. Use the suffix $diff_df to access the filled differences data frame, and use $trends_data to access the trends data data frame.
# Load data silo_data <- silo71 empty_diff_path <- system.file("extdata/staggered", "empty_diff_df.csv", package = "undidR") # Run `undid_stage_two()` results <- undid_stage_two( empty_diff_filepath = empty_diff_path, silo_name = "71", silo_df = silo_data, time_column = "year", outcome_column = "coll", silo_date_format = "yyyy" ) # View results head(results$diff_df) head(results$trends_data) # Clean up temporary files unlink(file.path(tempdir(), c("diff_df_71.csv", "trends_data_71.csv")))
# Load data silo_data <- silo71 empty_diff_path <- system.file("extdata/staggered", "empty_diff_df.csv", package = "undidR") # Run `undid_stage_two()` results <- undid_stage_two( empty_diff_filepath = empty_diff_path, silo_name = "71", silo_df = silo_data, time_column = "year", outcome_column = "coll", silo_date_format = "yyyy" ) # View results head(results$diff_df) head(results$trends_data) # Clean up temporary files unlink(file.path(tempdir(), c("diff_df_71.csv", "trends_data_71.csv")))