Title: | Example Data Sets for Causal Inference Textbooks |
---|---|
Description: | Example data sets to run the example problems from causal inference textbooks. Currently, contains data sets for Huntington-Klein, Nick (2021 and 2025) "The Effect" <https://theeffectbook.net>, first and second edition, Cunningham, Scott (2021 and 2025, ISBN-13: 978-0-300-25168-5) "Causal Inference: The Mixtape", and Hernán, Miguel and James Robins (2020) "Causal Inference: What If" <https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/>. |
Authors: | Nick Huntington-Klein [aut, cre] , Malcolm Barrett [aut] |
Maintainer: | Nick Huntington-Klein <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.4 |
Built: | 2024-10-25 05:27:28 UTC |
Source: | CRAN |
This data looks at the effect of abortion legalization on the incidence of gonnorhea among 15-19 year olds, as a measure of risky behavior. Treatment is whether abortion is legalized at the time that the eventual 15-19 year olds are born.
abortion
abortion
A data frame with 19584 rows and 22 variables
State FIPS code
Age in years
Race - 1 = white, 2 = black
Year
Year but counted on a different scale
Sex: 1 = male, 2 = female
Total population
Incarcerated Males per 100,000
Crack index
Alcohol consumption per capita
Real income per capita
State unemployment rate
Poverty rate
In a state with an early repeal of abortion prohibition
AIDS mortality per 100,000 cumulative in t, t-1, t-2, t-3
White Indicator
Male Indicator
Logged gonnorhea cases per 100,000 in 15-19 year olds
From the younger group
State-younger interaction
Parental involvement law in effect
Is a black female in the 15-19 age group
This data is used in the Difference-in-Differences chapter of Causal Inference: The Mixtape by Cunningham.
Cunningham, Scott, and Christopher Cornwell. 2013. “The Long-Run Effect of Abortion on Sexually Transmitted Infections.” American Law and Economics Review 15 (1): 381–407.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
This data comes from a survey of 700 internet-mediated sex workers in 2008 and 2009, asking the same sex workers standard labor market information over several time periods.
adult_services
adult_services
A data frame with 1787 rows and 31 variables
Provider identifier
Client session identifier
Age of provider
Age of Client
Client Attractiveness (Scale of 1 to 10)
Body Mass Index
Imputed Years of Schooling
Age of Client Squared
Second Provider Involved
Asian Client
Black Client
Hispanic Client
Other Ethnicity Client
Client was a Regular
Met Client in Hotel
Gave Client a Massage
Log of Hourly Wage
Ln(Length)
Unprotected sex with client of any kind
race==1. Asian
race==2. Black
race==3. Hispanic
race==4. Other
race==5. White
Age of provider squared
ms==Cohabitating (living with a partner) but unmarried
ms==Currently married and living with your spouse
ms==Divorced and not remarried
ms==Married but not currently living with your spouse
ms==Single and never married
ms==Widowed and not remarried
This data is used in the Panel Data chapter of Causal Inference: The Mixtape by Cunningham.
Cunningham, Scott, and Todd D. Kendall. 2011. “Prostitution 2.0: The Changing Face of Sex Work.” Journal of Urban Economics 69: 273–87.
Cunningham, Scott, and Todd D. Kendall. 2014. “Examining the Role of Client Reviews and Reputation Within Online Prostitution.” In, edited by Scott Cunningham and Manisha Shah. Vol. Handbook on the Economics of Prostitution. Oxford University Press.
Cunningham, Scott, and Todd D. Kendall. 2016. “Prostitution Labor Supply and Education.” Review of Economics of the Household. Forthcoming.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
This data, which comes standard in Stata, originally came from the April 1979 issue of Consumer Reports and from the United States Government EPA statistics on fuel consumption; they were compiled and published by Chambers et al. (1983).
auto
auto
A data frame with 74 rows and 12 variables
Make and Model
Price
Mileage (mpg)
Repair Record 1978
Headroom (in.)
Trunk space (cu. ft.)
Weight (lbs.)
Length (in.)
Turn Circle (ft.)
Displacement (cu. in.)
Gear Ratio
Car type; 0 = Domestic, 1 = Foreign
This data is used in the Probability and Regression Review chapter of Causal Inference: The Mixtape.
Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey. 1983. Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
This data set includes information on the average price and total amount of avocados sold across 169 weeks from 2015 to 2018. This data covers only sales of 'conventional' avocados that take place in California.
avocado
avocado
A data frame with 169 rows and 3 variables:
Date of observation
Average avocado price
Total volume of avocados sold
This data was used in the Identification chapter of The Effect by Huntington-Klein
Kiggins, Justin. 2018. https://www.kaggle.com/neuromusic/avocado-prices/
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The black_politicians
data contains data from Broockman (2013) on a field experiment where the author sent fictional emails purportedly sent by Black people to legislators in the United States. The experiment sought to determine whether the effect of the email being from "out-of-district" (someone who can't vote for you and so provides no extrinsic motivation to reply) would have a smaller effect on response rates for Black legislators than for non-Black ones, providing evidence of additional intrinsic motivation on the part of Black legislators to help Black people.
black_politicians
black_politicians
A data frame with 5593 rows and 14 variables
Legislator receiving email is Black
Email is from out-of-district
Legislator responded to email
District population
District median household income
District median household income among Black people
District median household income among White people
Percentage of district that is Black
State's Squire index
Legislator receiving email is neither Black nor White
Percentage of district that is urban
Legislator receiving email is a senator
Legislator receiving email is in the Democratic party
Legislator receiving email is in the Southern United States
This data is used in the Matching chapter of The Effect.
Broockman, D.E., 2013. Black politicians are more intrinsically motivated to advance blacks’ interests: A field experiment manipulating political incentives. American Journal of Political Science, 57(3), pp.521-536.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
This data looks at the impact of castle-doctrine statutes on violent crime. Data from the FBI Uniform Crime Reports Summary files are combined with information on castle-doctrine/stand-your-ground law impementation in different states.
castle
castle
A data frame with 19584 rows and 22 variables
Year
After-treatment
state id
Region-quarter fixed effects
justifiable homicide by private citizen count
justifiable homicide by police count
homicide count per 100,000 state population
Region-quarter fixed effects
aggravated assault count per 100,000 state population
burglary count per 100,000 state population
larceny count per 100,000 state population
motor vehicle theft count per 100,000 state population
murder count per 100,000 state population
unemployment rate
% of black male aged 15-24
% of white male aged 15-24
% of black male aged 25-44
% of white male aged 25-44
poverty rate
Logged crime rate
Logged crime rate
Logged crime rate
Logged police presence
Logged income
Logged number of prisoners
Lagged log prisoners
Logged subsidy spending
Logged public welfare spending
Indicators of how many time periods until/since treatment
Population weight
Region-quarter fixed effects
State linear time trends
This data is used in the Difference-in-Differences chapter of Causal Inference: The Mixtape by Cunningham.
Cheng, Cheng, and Mark Hoekstra. 2013. “Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine.” Journal of Human Resources 48 (3): 821–54.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
The ccdrug
data contains data on drug arrests from the Crown Court Sentencing Survey between 2012 and 2015 in England and Wales, allowing for a look at differential sentencing rates for men and women, with a set of controls for features that should impact sentencing.
ccdrug
ccdrug
A data frame with 16973 rows and 45 variables
Taken in to custody.
Is a male
This is the first offense
Age in ten-year bins
Offense type
Previous convictions, in bins of None, 1-3, 4-9, or 10+
Type of drug
Level of culpability for crime
A set of indicators that should increase or reduce the likelihood of being taken into custody. See variable labels for specific definitions.
This data set is used in the Partial Identification chapter of The Effect.
Pina Sanchez, J., & Harris, L., 2020. Sentencing gender? Investigating the presence of gender disparities in Crown Court sentences. Criminal Law Review, 2020(1), pp. 3-28.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from the National Longitudinal Survey Young Men Cohort. This data is used to estimate the effect of college education on earnings, using the presence of a nearby (in-county) college as an instrument for college attendance.
close_college
close_college
A data frame with 3010 rows and 8 variables
Log wages
Years of education
Years of work experience
Race: Black
In the southern United States
Is married
In a Standard Metropolitan Statistical Area (urban)
There is a four-year college in the county
This data is used in the Instrumental Variables chapter of Causal Inference: The Mixtape by Cunningham.
Card, David. 1995. “Aspects of Labour Economics: Essays in Honour of John Vanderkamp.” In. University of Toronto Press.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
This data comes from a close-elections regression discontinuity study from Lee, Moretti, and Butler (2004). The design is intended to test convergence and divergence in policy. Major effects of electing someone from a particular party on policy outcomes *in a close race* indicates that the victor does what they want. Small or null effects indicate that the electee moderates their position towards their nearly-split electorate.
close_elections_lmb
close_elections_lmb
A data frame with 13588 rows and 9 variables
ICPSR state code
district code
Election ID
ADA voting score (higher = more liberal)
Year of election
Democratic share of the vote
Democratic victory
Lagged Democratic victory
Lagged democratic share of the vote
This data is used in the Regression Discontinuity chapter of Causal Inference: The Mixtape by Cunningham.
Lee, David S., Enrico Moretti, and Matthew J. Butler. 2004. “Do Voters Affect or Elect Policies: Evidence from the U.S. House.” Quarterly Journal of Economics 119 (3): 807–59.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data from the Current Population Survey on participation in the National Supported Work Demonstration (NSW) job-training program experiment. This is used as an observational comparison to the NSW experimental data from the nsw_mixtape data.
cps_mixtape
cps_mixtape
A data frame with 15992 rows and 11 variables
Individual ID
In the National Supported Work Demonstration Job Training Program
Age in years
Years of education
Race: Black
Ethnicity: Hispanic
Married
Has no degree
Real earnings 1974
Real earnings 1975
Real earnings 1978
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Dehejia, Rajeev H., and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.".
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data from the UCI Machine Learning Repository on Taiwanese credit card holders, the amount of their credit card bill, and whether their payment was late.
credit_cards
credit_cards
A data frame with 30000 rows and 4 variables
Credit card payment is late in Sept 2005
Credit card payment is late in April 2005
Total bill in April 2005 in thousands of New Taiwan Dollars
Age of card-holder
This data is used in the Matching chapter of The Effect by Huntington-Klein.
Lichman, Moshe. 2013. UCI Machine Learning Repository.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The gapminder
data contains data on life expectancy and GDP per capita by country and year.
gapminder
gapminder
A data frame with 1704 rows and 6 variables
The country
The continent the country is in
The year data was collected. Ranges from 1952 to 2007 in increments of 5 years
Life expectancy at birth, in years
Population
GDP per capita (US$, inflation-adjusted)
This data set is the same one found in the gapminder package in R as of 2020. This data set is used in the Fixed Effects chapter of The Effect.
https://www.gapminder.org/data/
Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The google_stock
data contains data on daily stock returns for Google and the S&P 500 for May through Augut 2015, centering around the August 10, 2015 announcement that Google would reorganize under parent company Alphabet.
google_stock
google_stock
A data frame with 84 rows and 3 variables
The date
Daily GOOG Stock Return (1 = 100 percent daily return)
Daily S&P 500 Index Return (1 = 100 percent daily return)
This data was downloaded using the tidyquant package, and is used in the Event Studies chapter of The Effect.
Matt Dancho and Davis Vaughan (2021). tidyquant: Tidy Quantitative Financial Analysis. R package version 1.0.3. https://CRAN.R-project.org/package=tidyquant
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The gov_transfers
data contains data from Manacorda, Miguel, and Vigorito (2011) on government transfer program that was administered based on an income cutoff. Data is pre-limited to households that were just around the income cutoff.
gov_transfers
gov_transfers
A data frame with 1948 rows and 5 variables
Income measure, centered around program cutoff (negative value = eligible)
Household average years of education among those 16+
Household average age
Participation in transfers
Measure of support for the government
This data is used in the Regression Discontinuity chapter of The Effect.
Manacorda, M., Miguel, E. and Vigorito, A., 2011. Government transfers and political support. American Economic Journal: Applied Economics, 3(3), pp.1-28.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The gov_transfers_density
data contains data from Manacorda, Miguel, and Vigorito (2011) on government transfer program that was administered based on an income cutoff. As opposed to the gov_transfers
data set, this data set only contains income information, but has a wider range of it, for use with density discontinuity tests.
gov_transfers_density
gov_transfers_density
A data frame with 52549 rows and 1 variable:
Income measure, centered around program cutoff (negative value = eligible)
This data is used in the Regression Discontinuity chapter of The Effect.
Manacorda, M., Miguel, E. and Vigorito, A., 2011. Government transfers and political support. American Economic Journal: Applied Economics, 3(3), pp.1-28.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
greek_data
is a fictional data set from Table 2.2 in Chapter 2 of Causal Inference. From the book: "Table 2.2 shows the data from our heart transplant randomized study. Besides data on treatment A (1 if the individual received a transplant, 0 otherwise) and outcome Y (1 if the individual died, 0 otherwise), Table 2.2 also contains data on the prognostic factor L (1 if the individual was in critical condition, 0 otherwise), which we measured before treatment was assigned."
greek_data
greek_data
A data frame with 20 rows and 4 variables:
The name of a Greek god
A prognostic factor
The treatment, a heart transplant
The outcome, death
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
The mortgages
data contains data from Fetter (2015) on home ownership rates by men, focusing on whether they were born at the right time to be eligible for mortgage subsidies based on their military service.
mortgages
mortgages
A data frame with 214144 rows and 6 variables
Birth State
Quarter of birth
White/nonwhite race indicator. 1 = Nonwhite
Veteran of either the Korean war or World War II
Owns a home
Quarter of birth centered on eligibility for mortgage subsidy (0+ = eligible)
This data is used in the Regression Discontinuity chapter of The Effect.
Fetter, D.K., 2013. How do mortgage subsidies affect home ownership? Evidence from the mid-century GI bills. American Economic Journal: Economic Policy, 5(2), pp.111-47.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The Mroz
data frame has 753 rows and 8 columns. The observations, from the Panel Study of Income Dynamics (PSID), are married women.
Mroz
Mroz
A data frame with 753 rows and 8 variables
Labor-force participation
Number of children 5 years old or younger
Number of children 6 to 17 years old
Age in years
Wife attended college
Husband attended college
Log expected wage rate. For women in the labor force, the actual wage rate; for women not in the labor force, an imputed value based on the regression of lwg on the other variables.
Family income exclusive of wife's income
This data set is a lightly edited version of the one found in the carData package in R. It is used in the Describing Relationships chapter of The Effect.
Mroz, T. A. (1987) The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. *Econometrica* 55, 765–799.
John Fox, Sanford Weisberg and Brad Price (2020). carData: Companion to Applied Regression Data Sets. R package version 3.0-4. https://CRAN.R-project.org/package=carData
Fox, J. (2016) *Applied Regression Analysis and Generalized Linear Models,* Third Edition. Sage.
Fox, J. (2000) *Multiple and Generalized Nonparametric Regression.* Sage.
Fox, J. and Weisberg, S. (2019) *An R Companion to Applied Regression.* Third Edition, Sage.
Long. J. S. (1997) *Regression Models for Categorical and Limited Dependent Variables.* Sage.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
nhefs
is a cleaned data set of the data used in Causal Inference by Hernán and Robins. nhefs
is dataset containing data from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). The NHEFS was jointly initiated by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the United States Public Health Service. A detailed description of the NHEFS, together with publicly available data sets and documentation, can be found at https://wwwn.cdc.gov/nchs/nhanes/nhefs/.
nhefs
nhefs
A data frame with 1629 rows and 67 variables. The codebook is available as nhefs_codebook
.
https://wwwn.cdc.gov/nchs/nhanes/nhefs/
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
nhefs_codebook
is the codebook for nhefs
and nhefs_complete
.
nhefs_codebook
nhefs_codebook
A data frame with 64 rows and 2 variables.
The variable being described
The variable description
https://wwwn.cdc.gov/nchs/nhanes/nhefs/
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
nhefs_complete
is the same as nhefs
, but only participants with complete data are included. The variables that need to be complete to be included are: qsmk
, sex
, race
, age
, school
, smokeintensity
, smokeyrs
, exercise
, active
, wt71
, wt82
, and wt82_71
.
nhefs_complete
nhefs_complete
A data frame with 1556 rows and 67 variables. The codebook is available as nhefs_codebook
.
https://wwwn.cdc.gov/nchs/nhanes/nhefs/
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Data from the National Supported Work Demonstration (NSW) job-training program experiment, where those treated were guaranteed a job for 9-18 months.
nsw_mixtape
nsw_mixtape
A data frame with 445 rows and 11 variables
Individual ID
In the National Supported Work Demonstration Job Training Program
Age in years
Years of education
Race: Black
Ethnicity: Hispanic
Married
Has no degree
Real earnings 1974
Real earnings 1975
Real earnings 1978
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Lalonde, Robert. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” American Economic Review 76 (4): 604–20.
Dehejia, Rajeev H., and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.".
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
The organ_donation
data contains data from Kessler and Roth (2014) on organ donation rates by state and quarter. The state of California enacted an active-choice phrasing for their organ donation sign-up questoin in Q32011. The only states included in the data are California and those that can serve as valid controls; see Kessler and Roth (2014).
organ_donations
organ_donations
A data frame with 162 rows and 3 variables
The state, where California is the Treated group
Quarter of observation, in "Q"QYYYY format
Organ donation rate
Quarter of observation in numerical format. 1 = Quarter 4, 2010
This data is used in the Difference-in-Differences chapter of The Effect.
Kessler, J.B. and Roth, A.E., 2014. Don't take 'no' for an answer: An experiment with actual organ donor registrations. National Bureau of Economic Research working paper No. 20378. https://www.nber.org/papers/w20378
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
The restaurant_inspections
data contains data on restaurant health inspections performed in Anchorage, Alaska.
restaurant_inspections
restaurant_inspections
A data frame with 27178 rows and 5 variables
Name of restaurant/chain
Health Inspection Score
Year of inspection
Number of locations in restaurant chain
Was the inspection performed on a weekend?
This data set is used in the Regression chapter of The Effect.
Camus, Louis-Ashley. 2020. https://www.kaggle.com/loulouashley/inspection-score-restaurant-inspection
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
This simulated data allows for a quick and easy calculation of a p-value using randomization inference.
ri
ri
A data frame with 8 rows and 5 variables
Fictional Name
Treatment
Outcome
Outcome if untreated
Outcome if treated
This data is used in the Potential Outcomes Causal Model chapter of Causal Inference: The Mixtape by Cunningham.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
From the College Scorecard, this data set contains by-college-by-year data on how students who attended those colleges are doing.
scorecard
scorecard
A data frame with 48,445 rows and 8 variables:
College identifiers
Name of the college or university
Two-letter abbreviation for the state the college is in
Predominant degree awarded. 1 = less-than-two-year, 2 = two-year, 3 = four-year+
Year in which outcomes are measured
Median earnings among students (a) who received federal financial aid, (b) who began as undergraduates at the institution ten years prior, (c) with positive yearly earnings
Number of students who are (a) not working (not necessarily unemployed), (b) received federal financial aid, and (c) who began as undergraduates at the institution ten years prior
Number of students who are (a) working, (b) who received federal financial aid, and (c) who began as undergraduates at the institution ten years prior
This data is not just limited to four-year colleges and includes a very wide variety of institutions.
Note that the labor market (earnings, working) and repayment rate data do not refer to the same cohort of students, but rather are matched on the year in which outcomes are recorded. Labor market data refers to cohorts beginning college as undergraduates ten years prior, repayment rate data refers to cohorts entering repayment seven years prior.
Data was downloaded using the Urban Institute's educationdata
package.
This data was used in the Describing Variables chapter of The Effect by Huntington-Klein
Education Data Portal (Version 0.4.0 - Beta), Urban Institute, Center on Education Data and Policy, accessed June 28, 2019. https://educationdata.urban.org/documentation/, Scorecard.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
A subset of the aggregated death rate data from Snow's legendary study of the source of the London Cholera outbreak.
snow
snow
A data frame with 4 rows and 4 variables
Year
Water pump supplier
Status of water pump
Deaths per 10k 1851 population
This data is used in the Difference-in-Differences chapter of The Effect by Huntington-Klein.
Snow, John. 1855. 'On the Mode of Communication of Cholera'. John Churchill."
Coleman, Thomas. 2019. 'Causality in the time of cholera: John Snow as a prototype for causal inference.' SSRN 3262234."
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
This data looks at the massive expansion in prison capacity in Texas that occurred in 1993 under Governor Ann Richards, and the effect of that expansion on the number of Black men in prison.
texas
texas
A data frame with 816 rows and 12 variables
State FIPS code
Year
Number of Black men in prison
Number of White men in prison
Alcohol consumption per capita
Median income
Unemployment rate
Poverty rate
Percentage of the population that is Black
Percentage of the population that is age 15-19
AIDS mortality per 100,000 in t
State name
This data is used in the Synthetic Control chapter of Causal Inference: The Mixtape by Cunningham.
Cunningham and Kang. 2019. “Studying the Effect of Incarceration Shocks to Drug Markets.” Unpublished manuscript. http://www.scunning.com/files/mass_incarceration_and_drug_abuse.pdf
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
thornton_hiv
comes from an experiment in Malawi looking at whether cash incentives could encourage people to learn the results of their HIV tests.
thornton_hiv
thornton_hiv
A data frame with 4820 rows and 7 variables
Village ID
Got HIV results
Distance in kilometers
Total incentive
Received any incentive
Age
HIV results
This data is used in the Potential Outcomes Causal Model chapter of Causal Inference: The Mixtape by Cunningham.
Thornton, Rebecca L. 2008. 'The Demand for, and Impact of, Learning Hiv Status.' American Economic Review 98 (5): 1829–63.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
titanic
comes from the sinking of the Titanic, and can be used to look at survival by different demographic characteristics.
titanic
titanic
A data frame with 4820 rows and 7 variables
class (ticket)
Age (Child vs. Adult)
Gender
Survived
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
This simulated data is used to demonstrate the bias-reduction method in matching as per Abadie and Imbens (2011).
training_bias_reduction
training_bias_reduction
A data frame with 8 rows and 4 variables
Unit ID
Outcome
Treatment
Matching variable
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
This simulated data, which is presented in the form of a full results, table, is used to demonstrate a matching procedure.
training_example
training_example
A data frame with 25 rows and 9 variables
Unit ID for treated observations
age for treated observations
earnings for treated observations
Unit ID for control observations
age for control observations
earnings for control observations
Unit ID for matched controls
age for matched controls
earnings for matched controls
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
yule
allows for a look at the correlation between poverty relief and poverty rates in England in the 19th century.
yule
yule
A data frame with 32 rows and 5 variables
Location in England
Pauperism Growth
Poverty Relief Growth
Annual growth in aged population
Annual growth in population
This data is used in the Potential Outcomes Causal Model chapter of Causal Inference: The Mixtape by Cunningham.
Yule, G. Udny. 1899. 'An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Interensal Decades.' Journal of Royal Statistical Society 62: 249–95.
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data from "Social Networks and the Decision to Insure"
Description
The
social_insure
data contains data from Jai, De Janvry, and Saoudlet (2015) on a two-round social network-based experiment on getting farmers to get insurance. See the paper for more details.Usage
Format
A data frame with 1410 rows and 13 variables
Natural village
Administrative village
Whether farmer ended up purchasing insurance. (1 = yes)
Household Characteristics - Age
Household Characteristics - Household Size
Area of Rice Production
Perceived Probability of Disasters Next Year
Household Caracteristics: Gender of Household Head (1 = male)
"Default option" in experimental format assigned to. (1 = default is to buy, 0 = default is to not buy)
Whether or not was assigned to "intensive" experimental session (1 = yes)
Risk aversion measurement
1 = literate, 0 = illiterate
Takeup rate prior to experiment
Details
This data is used in the Instrumental Variables chapter of The Effect.
Source
Cai, J., De Janvry, A. and Sadoulet, E., 2015. Social networks and the decision to insure. American Economic Journal: Applied Economics, 7(2), pp.81-108.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.