Title: | Matched Samples that are Balanced and Representative by Design |
---|---|
Description: | Includes functions for the construction of matched samples that are balanced and representative by design. Among others, these functions can be used for matching in observational studies with treated and control units, with cases and controls, in related settings with instrumental variables, and in discontinuity designs. Also, they can be used for the design of randomized experiments, for example, for matching before randomization. By default, 'designmatch' uses the 'highs' optimization solver, but its performance is greatly enhanced by the 'Gurobi' optimization solver and its associated R interface. For their installation, please follow the instructions at <https://www.gurobi.com/documentation/quickstart.html> and <https://www.gurobi.com/documentation/7.0/refman/r_api_overview.html>. We have also included directions in the gurobi_installation file in the inst folder. |
Authors: | Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>, Juan P. Vielma <[email protected]>, Eric R. Cohn <[email protected]> |
Maintainer: | Jose R. Zubizarreta <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 0.5.4 |
Built: | 2024-10-27 06:26:57 UTC |
Source: | CRAN |
designmatch
includes two functions for the construction of matched samples that are balanced and representative by design. These two functions are bmatch
and nmatch
for bipartite and nonbipartite matching, respectively. Both functions include options for directly balancing means, higher order moments, and distributions of the observed covariates. In both bmatch
and nmatch
, an integer programming (IP) problem is solved. This IP problem either minimizes the total sum of covariate distances between matched units, maximizes the total number of matched units, or optimizes a combination of the two, subject to matching and covariate balancing constraints. In order to solve these problems, four different optimization solvers can be used: CPLEX, GLPK, Gurobi, HiGHS, and Symphony. By default, both bmatch
and nmatch
solve a relaxation of these integer programs using HiGHS, which runs quickly but may violate to some extent some of the balancing constraints. If the user wants to solve for an exact solution of the program, we strongly recommend using either CPLEX or Gurobi, which are much faster but require a license (free for academic users) and special installation (see the installation instructions). Between the two, Gurobi is considerably easier to install. Among others, designmatch
can be used for matching in treatment-control as well as case-control observational studies; observational studies with instrumental variables and discontinuity designs; and for the design of randomized experiments, for example for matching before randomization. The package also includes functions for assessing covariate balance in the matched samples.
Package: | designmatch |
Type: | Package |
Version: | 0.5.4 |
Date: | 2023-08-29 |
License: | GPL-2 | GPL-3 |
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Maintainer: Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), "Optimal Multivariate Matching Before Randomization," Biostatistics, 5, 263-275.
Hsu. J., Zubizarreta, J. R., Small, D. S., and Rosenbaum, P. R. (2015), "Strong Control of the Family-Wise Error Rate in Observational Studies that Discover Effect Modification by Exploratory Methods," Biometrika, 102, 767-782.
Keele, L., Titiunik, R., and Zubizarreta, J. R., (2015), "Enhancing a Geographic Regression Discontinuity Design Through Matching to Estimate the Effect of Ballot Initiatives on Voter Turnout," Journal of the Royal Statistical Society: Series A, 178, 223-239.
Kilcioglu, C., and Zubizarreta, J. R., (2016), "Maximizing the Information Content of a Balanced Matched Sample in a Study of the Economic Performance of Green Buildings," working paper.
Lu, B., Greevy, R., Xu, X., and Beck C. (2011), "Optimal Nonbipartite Matching and its Statistical Applications," The American Statistician, 65, 21-30.
Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.
Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.
Yang, D., Small, D., Silber, J. H., and Rosenbaum, P. R. (2012), "Optimal Matching With Minimal Deviation From Fine Balance in a Study of Obesity and Surgical Outcomes," Biometrics, 68, 628-636.
Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.
Zou, J., and Zubizarreta, J. R., (2015) "Covariate Balanced Restricted Randomization: Optimal Designs, Exact Tests, and Asymptotic Results," working paper.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.
Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.
Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), "Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit High Schools in Chile," Annals of Applied Statistics, 8, 204-231.
Function for calculating absolute differences in means between the covariates in the treatment and control groups in terms of the original units of the covariates. Here, the absolute differences in means are normalized by the simple average of the treated and control standard deviations before matching (see Rosenbaum and Rubin 1985 for details).
absstddif(X_mat, t_ind, std_dif)
absstddif(X_mat, t_ind, std_dif)
X_mat |
matrix of covariates: a matrix of covariates used to build the rank-based Mahalanobis distance matrix. |
t_ind |
treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). |
std_dif |
standardized differences: a scalar determining the number of absolute standardized differences. |
A vector that can be used with the mom
, near
and far
options of bmatch
and nmatch
.
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Rosenbaum, P. R., and Rubin, D. B. (1985), "Constructing a Control Group by Multivariate Matched Sampling Methods that Incorporate the Propensity Score," The American Statistician, 39, 33-38.
# Load and attach data data(lalonde) attach(lalonde) # Treatment indicator t_ind = treatment # Constrain differences in means to be at most .05 standard deviations apart mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) mom_tols = absstddif(mom_covs, t_ind, .05)
# Load and attach data data(lalonde) attach(lalonde) # Treatment indicator t_ind = treatment # Constrain differences in means to be at most .05 standard deviations apart mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) mom_tols = absstddif(mom_covs, t_ind, .05)
Function for optimal bipartite matching in observational studies that directly balances the observed covariates. bmatch
allows the user to enforce different forms of covariate balance in the matched samples such as moment balance (e.g., of means, variances and correlations), distributional balance (e.g., fine balance, near-fine balance, strength-k balancing) and exact matching. In observational studies where an instrumental variable is available, bmatch
can be used to handle weak instruments and strengthen them by means of the far
options (see Yang et al. 2014 for an example). bmatch
can also be used in discontinuity designs by matching units in a neighborhood of the discontinuity (see Keele et al. 2015 for details). In any of these settings, bmatch
either minimizes the total sum of covariate distances between matched units, maximizes the total number of matched units, or optimizes a combination of the two, subject to matching, covariate balancing, and representativeness constraints (see the examples below).
bmatch(t_ind, dist_mat = NULL, subset_weight = NULL, n_controls = 1, total_groups = NULL, mom = NULL, ks = NULL, exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL, near = NULL, far = NULL, solver = NULL)
bmatch(t_ind, dist_mat = NULL, subset_weight = NULL, n_controls = 1, total_groups = NULL, mom = NULL, ks = NULL, exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL, near = NULL, far = NULL, solver = NULL)
t_ind |
treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). Please note that the data needs to be sorted in decreasing order according to this treatment indicator. |
dist_mat |
distance matrix: a matrix of positive distances between treated units (rows) and controls (columns). If |
subset_weight |
subset matching weight: a scalar that regulates the trade-off between the total sum of distances between matched pairs and the total number of matched pairs. The larger If If |
n_controls |
number of controls: a scalar defining the number of controls to be matched with a fixed rate to each treated unit. The default is |
total_groups |
total number of matched pairs: a scalar specifying the number of matched pairs to be obtained. If |
mom |
moment balance parameters: a list with three arguments,
|
ks |
Kolmogorov-Smirnov balance parameters: a list with three objects,
|
exact |
Exact matching parameters: a list with one argument,
where |
near_exact |
Near-exact matching parameters: a list with two objects,
|
fine |
Fine balance parameters: a list with one argument,
where |
near_fine |
Near-fine balance parameters: a list with two objects,
|
near |
Near matching parameters: a list with three objects,
|
far |
Far matching parameters: a list with three objects,
|
solver |
Optimization solver parameters: a list with four objects,
|
A list containing the optimal solution, with the following objects:
obj_total |
value of the objective function at the optimum; |
obj_dist_mat |
value of the total sum of distances term of the objective function at the optimum; |
t_id |
indexes of the matched treated units at the optimum; |
c_id |
indexes of the matched controls at the optimum; |
group_id |
matched pairs or groups at the optimum; |
time |
time elapsed to find the optimal solution. |
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Keele, L., Titiunik, R., and Zubizarreta, J. R., (2015), "Enhancing a Geographic Regression Discontinuity Design Through Matching to Estimate the Effect of Ballot Initiatives on Voter Turnout," Journal of the Royal Statistical Society: Series A, 178, 223-239.
Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.
Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.
Yang, D., Small, D., Silber, J. H., and Rosenbaum, P. R. (2012), "Optimal Matching With Minimal Deviation From Fine Balance in a Study of Obesity and Surgical Outcomes," Biometrics, 68, 628-36.
Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.
Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.
Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), "Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit High Schools in Chile," Annals of Applied Statistics, 8, 204-231.
sensitivitymv, sensitivitymw.
## Uncomment the following examples ## Load, sort, and attach data #data(lalonde) #lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] #attach(lalonde) ################################# ## Example 1: cardinality matching ################################# ## Cardinality matching finds the largest matched sample of pairs that meets balance ## requirements. Here the balance requirements are mean balance, fine balance and ## exact matching for different covariates. The solver used is glpk with the ## approximate option. ## Treatment indicator; note that the data needs to be sorted in decreasing order ## according to this treatment indicator #t_ind = treatment #t_ind ## Distance matrix #dist_mat = NULL ## Subset matching weight #subset_weight = 1 ## Moment balance: constrain differences in means to be at most .05 standard deviations apart #mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) #mom_tols = round(absstddif(mom_covs, t_ind, .05), 2) #mom = list(covs = mom_covs, tols = mom_tols) ## Fine balance #fine_covs = cbind(black, hispanic, married, nodegree) #fine = list(covs = fine_covs) ## Exact matching #exact_covs = cbind(black) #exact = list(covs = exact_covs) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, #mom = mom, fine = fine, exact = exact, solver = solver) ## Indices of the treated units and matched controls #t_id = out$t_id #c_id = out$c_id ## Time #out$time/60 ## Matched group identifier (who is matched to whom) #out$group_id ## Assess mean balance #meantab(mom_covs, t_ind, t_id, c_id) ## Assess fine balance (note here we are getting an approximate solution) #for (i in 1:ncol(fine_covs)) { # print(finetab(fine_covs[, i], t_id, c_id)) #} ## Assess exact matching balance #table(exact_covs[t_id]==exact_covs[c_id]) ################################## ## Example 2: minimum distance matching ################################## ## The goal here is to minimize the total of distances between matched pairs. In ## this example there are no covariate balance requirements. Again, the solver ## used is glpk with the approximate option ## Treatment indicator #t_ind = treatment ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat = distmat(t_ind, X_mat) ## Subset matching weight #subset_weight = NULL ## Total pairs to be matched #total_groups = sum(t_ind) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace_cplex = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, total_groups = total_groups, #solver = solver) ## Indices of the treated units and matched controls #t_id = out$t_id #c_id = out$c_id ## Total of distances between matched pairs #out$obj_total ## Assess mean balance #meantab(X_mat, t_ind, t_id, c_id) ################################## ## Example 3: optimal subset matching ################################## ## Optimal subset matching pursues two competing goals at ## the same time: to minimize the total sum of covariate distances ## while matching as many observations as possible. The trade-off ## between these two goals is regulated by the parameter subset_weight ## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion). ## Here the balance requirements are mean balance, near-fine balance ## and near-exact matching for different covariates. ## Again, the solver used is glpk with the approximate option. ## Treatment indicator #t_ind = treatment ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat = distmat(t_ind, X_mat) ## Subset matching weight #subset_weight = median(dist_mat) ## Moment balance: constrain differences in means to be at most .05 standard deviations apart #mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) #mom_tols = round(absstddif(mom_covs, t_ind, .05), 2) #mom = list(covs = mom_covs, tols = mom_tols) ## Near-fine balance #near_fine_covs = cbind(married, nodegree) #near_fine_devs = rep(5, 2) #near_fine = list(covs = near_fine_covs, devs = near_fine_devs) ## Near-exact matching #near_exact_covs = cbind(black, hispanic) #near_exact_devs = rep(5, 2) #near_exact = list(covs = near_exact_covs, devs = near_exact_devs) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace_cplex = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, #mom = mom, near_fine = near_fine, near_exact = near_exact, solver = solver) ## Indices of the treated units and matched controls #t_id = out$t_id #c_id = out$c_id ## Time #out$time/60 ## Matched group identifier (who is matched to whom) #out$group_id ## Assess mean balance (note here we are getting an approximate solution) #meantab(X_mat, t_ind, t_id, c_id) ## Assess fine balance #for (i in 1:ncol(near_fine_covs)) { # print(finetab(near_fine_covs[, i], t_id, c_id)) #} ## Assess exact matching balance #for (i in 1:ncol(near_exact_covs)) { # print(table(near_exact_covs[t_id, i]==near_exact_covs[c_id, i])) #}
## Uncomment the following examples ## Load, sort, and attach data #data(lalonde) #lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] #attach(lalonde) ################################# ## Example 1: cardinality matching ################################# ## Cardinality matching finds the largest matched sample of pairs that meets balance ## requirements. Here the balance requirements are mean balance, fine balance and ## exact matching for different covariates. The solver used is glpk with the ## approximate option. ## Treatment indicator; note that the data needs to be sorted in decreasing order ## according to this treatment indicator #t_ind = treatment #t_ind ## Distance matrix #dist_mat = NULL ## Subset matching weight #subset_weight = 1 ## Moment balance: constrain differences in means to be at most .05 standard deviations apart #mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) #mom_tols = round(absstddif(mom_covs, t_ind, .05), 2) #mom = list(covs = mom_covs, tols = mom_tols) ## Fine balance #fine_covs = cbind(black, hispanic, married, nodegree) #fine = list(covs = fine_covs) ## Exact matching #exact_covs = cbind(black) #exact = list(covs = exact_covs) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, #mom = mom, fine = fine, exact = exact, solver = solver) ## Indices of the treated units and matched controls #t_id = out$t_id #c_id = out$c_id ## Time #out$time/60 ## Matched group identifier (who is matched to whom) #out$group_id ## Assess mean balance #meantab(mom_covs, t_ind, t_id, c_id) ## Assess fine balance (note here we are getting an approximate solution) #for (i in 1:ncol(fine_covs)) { # print(finetab(fine_covs[, i], t_id, c_id)) #} ## Assess exact matching balance #table(exact_covs[t_id]==exact_covs[c_id]) ################################## ## Example 2: minimum distance matching ################################## ## The goal here is to minimize the total of distances between matched pairs. In ## this example there are no covariate balance requirements. Again, the solver ## used is glpk with the approximate option ## Treatment indicator #t_ind = treatment ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat = distmat(t_ind, X_mat) ## Subset matching weight #subset_weight = NULL ## Total pairs to be matched #total_groups = sum(t_ind) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace_cplex = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, total_groups = total_groups, #solver = solver) ## Indices of the treated units and matched controls #t_id = out$t_id #c_id = out$c_id ## Total of distances between matched pairs #out$obj_total ## Assess mean balance #meantab(X_mat, t_ind, t_id, c_id) ################################## ## Example 3: optimal subset matching ################################## ## Optimal subset matching pursues two competing goals at ## the same time: to minimize the total sum of covariate distances ## while matching as many observations as possible. The trade-off ## between these two goals is regulated by the parameter subset_weight ## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion). ## Here the balance requirements are mean balance, near-fine balance ## and near-exact matching for different covariates. ## Again, the solver used is glpk with the approximate option. ## Treatment indicator #t_ind = treatment ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat = distmat(t_ind, X_mat) ## Subset matching weight #subset_weight = median(dist_mat) ## Moment balance: constrain differences in means to be at most .05 standard deviations apart #mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) #mom_tols = round(absstddif(mom_covs, t_ind, .05), 2) #mom = list(covs = mom_covs, tols = mom_tols) ## Near-fine balance #near_fine_covs = cbind(married, nodegree) #near_fine_devs = rep(5, 2) #near_fine = list(covs = near_fine_covs, devs = near_fine_devs) ## Near-exact matching #near_exact_covs = cbind(black, hispanic) #near_exact_devs = rep(5, 2) #near_exact = list(covs = near_exact_covs, devs = near_exact_devs) ## Solver options #t_max = 60*5 #solver = "glpk" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, #round_cplex = 0, trace_cplex = 0) ## Match #out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, #mom = mom, near_fine = near_fine, near_exact = near_exact, solver = solver) ## Indices of the treated units and matched controls #t_id = out$t_id #c_id = out$c_id ## Time #out$time/60 ## Matched group identifier (who is matched to whom) #out$group_id ## Assess mean balance (note here we are getting an approximate solution) #meantab(X_mat, t_ind, t_id, c_id) ## Assess fine balance #for (i in 1:ncol(near_fine_covs)) { # print(finetab(near_fine_covs[, i], t_id, c_id)) #} ## Assess exact matching balance #for (i in 1:ncol(near_exact_covs)) { # print(table(near_exact_covs[t_id, i]==near_exact_covs[c_id, i])) #}
Function for optimal cardinality matching in observational studies. cardmatch
finds the largest matched sample that is balanced and representative by design. The formulation of cardmatch
has been simplified to handle larger data than bmatch
or nmatch
. Similar to bmatch
or nmatch
, the performance of cardmatch
is greatly enhanced by using the solver
options cplex
or gurobi
.
cardmatch(t_ind, mom = NULL, fine = NULL, solver = NULL)
cardmatch(t_ind, mom = NULL, fine = NULL, solver = NULL)
t_ind |
treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). Please note that the data needs to be sorted in decreasing order according to this treatment indicator. |
mom |
moment balance parameters: a list with three arguments,
|
fine |
Fine balance parameters: a list with one argument,
where |
solver |
Optimization solver parameters: a list with four objects,
|
A list containing the optimal solution, with the following objects:
obj_total |
value of the objective function at the optimum; |
obj_dist_mat |
value of the total sum of distances term of the objective function at the optimum; |
t_id |
indexes of the matched treated units at the optimum; |
c_id |
indexes of the matched controls at the optimum; |
group_id |
matched pairs or groups at the optimum; |
time |
time elapsed to find the optimal solution. |
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>, Juan P. Vielma <[email protected]>.
Bennett, M., Vielma, J. P., and Zubizarreta, J. R. (2017), "Building a Representative Matched Sample with Treatment Doses," working paper.
Visconti, G., and Zubizarreta, J. R. (2017), "Finding the Largest Matched Sample that is Balanced by Design: A Case Study of the Effect of an Earthquake on Electoral Outcomes," working paper.
Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.
Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), "Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit High Schools in Chile," Annals of Applied Statistics, 8, 204-231.
sensitivitymv, sensitivitymw.
# Load, sort, and attach data data(lalonde) lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] attach(lalonde) ################################# # Step 1: use cardinality matching to find the largest sample of matched pairs for which # all the covariates are finely balanced. ################################# # Discretize covariates quantiles = function(covar, n_q) { p_q = seq(0, 1, 1/n_q) val_q = quantile(covar, probs = p_q, na.rm = TRUE) covar_out = rep(NA, length(covar)) for (i in 1:n_q) { if (i==1) {covar_out[covar<val_q[i+1]] = i} if (i>1 & i<n_q) {covar_out[covar>=val_q[i] & covar<val_q[i+1]] = i} if (i==n_q) {covar_out[covar>=val_q[i] & covar<=val_q[i+1]] = i}} covar_out } age_5 = quantiles(age, 5) education_5 = quantiles(education, 5) re74_5 = quantiles(re74, 5) re75_5 = quantiles(re75, 5) # Treatment indicator; note that the data needs to be sorted in decreasing order # according to this treatment indicator t_ind = treatment t_ind # Fine balance fine_covs = cbind(black, hispanic, married, nodegree, age_5, education_5, re74_5, re75_5) fine = list(covs = fine_covs) # Solver options t_max = 60*5 solver = "highs" approximate = 0 solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) # Match out_1 = cardmatch(t_ind, fine = fine, solver = solver) # Indices of the treated units and matched controls t_id_1 = out_1$t_id c_id_1 = out_1$c_id # Mean balance covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) meantab(covs, t_ind, t_id_1, c_id_1) # Fine balance (note here we are getting an approximate solution) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_1, c_id_1)) } ################################# # Step 2: use optimal matching (minimum distance matching) to find the (re)pairing of # treated and control that minimizes the total sum of covariate distances between matched # pairs. For this, use the function 'distmatch' which is a wrapper for 'bmatch'. ################################# # New treatment indicator t_ind_2 = t_ind[c(t_id_1, c_id_1)] table(t_ind_2) # To build the distance matrix, the idea is to use strong predictors of the outcome dist_mat_2 = abs(outer(re74[t_id_1], re74[c_id_1], "-")) dim(dist_mat_2) # Match out_2 = distmatch(t_ind_2, dist_mat_2, solver) # Indices of the treated units and matched controls t_id_2 = t_id_1[out_2$t_id] c_id_2 = c_id_1[out_2$c_id-length(out_2$c_id)] # Covariate balance is preserved... meantab(covs, t_ind, t_id_2, c_id_2) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_2, c_id_2)) } # ... but covariate distances are reduced distances_step_1 = sum(diag(dist_mat_2)) distances_step_2 = sum(diag(dist_mat_2[out_2$t_id, out_2$c_id-length(out_2$c_id)])) distances_step_1 distances_step_2 # The mean difference in outcomes is the same... mean(re78[t_id_1]-re78[c_id_1]) mean(re78[t_id_2]-re78[c_id_2]) # ... but their standard deviation is reduced sd(re78[t_id_1]-re78[c_id_1]) sd(re78[t_id_2]-re78[c_id_2])
# Load, sort, and attach data data(lalonde) lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] attach(lalonde) ################################# # Step 1: use cardinality matching to find the largest sample of matched pairs for which # all the covariates are finely balanced. ################################# # Discretize covariates quantiles = function(covar, n_q) { p_q = seq(0, 1, 1/n_q) val_q = quantile(covar, probs = p_q, na.rm = TRUE) covar_out = rep(NA, length(covar)) for (i in 1:n_q) { if (i==1) {covar_out[covar<val_q[i+1]] = i} if (i>1 & i<n_q) {covar_out[covar>=val_q[i] & covar<val_q[i+1]] = i} if (i==n_q) {covar_out[covar>=val_q[i] & covar<=val_q[i+1]] = i}} covar_out } age_5 = quantiles(age, 5) education_5 = quantiles(education, 5) re74_5 = quantiles(re74, 5) re75_5 = quantiles(re75, 5) # Treatment indicator; note that the data needs to be sorted in decreasing order # according to this treatment indicator t_ind = treatment t_ind # Fine balance fine_covs = cbind(black, hispanic, married, nodegree, age_5, education_5, re74_5, re75_5) fine = list(covs = fine_covs) # Solver options t_max = 60*5 solver = "highs" approximate = 0 solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) # Match out_1 = cardmatch(t_ind, fine = fine, solver = solver) # Indices of the treated units and matched controls t_id_1 = out_1$t_id c_id_1 = out_1$c_id # Mean balance covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) meantab(covs, t_ind, t_id_1, c_id_1) # Fine balance (note here we are getting an approximate solution) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_1, c_id_1)) } ################################# # Step 2: use optimal matching (minimum distance matching) to find the (re)pairing of # treated and control that minimizes the total sum of covariate distances between matched # pairs. For this, use the function 'distmatch' which is a wrapper for 'bmatch'. ################################# # New treatment indicator t_ind_2 = t_ind[c(t_id_1, c_id_1)] table(t_ind_2) # To build the distance matrix, the idea is to use strong predictors of the outcome dist_mat_2 = abs(outer(re74[t_id_1], re74[c_id_1], "-")) dim(dist_mat_2) # Match out_2 = distmatch(t_ind_2, dist_mat_2, solver) # Indices of the treated units and matched controls t_id_2 = t_id_1[out_2$t_id] c_id_2 = c_id_1[out_2$c_id-length(out_2$c_id)] # Covariate balance is preserved... meantab(covs, t_ind, t_id_2, c_id_2) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_2, c_id_2)) } # ... but covariate distances are reduced distances_step_1 = sum(diag(dist_mat_2)) distances_step_2 = sum(diag(dist_mat_2[out_2$t_id, out_2$c_id-length(out_2$c_id)])) distances_step_1 distances_step_2 # The mean difference in outcomes is the same... mean(re78[t_id_1]-re78[c_id_1]) mean(re78[t_id_2]-re78[c_id_2]) # ... but their standard deviation is reduced sd(re78[t_id_1]-re78[c_id_1]) sd(re78[t_id_2]-re78[c_id_2])
Function for building a normalized rank-based Mahalanobis distance matrix with a penalty for caliper violation.
distmat(t_ind, X_mat, calip_cov = NULL, calip_size = NULL, calip_penalty = NULL, near_exact_covs = NULL, near_exact_penalties = NULL, digits = 1)
distmat(t_ind, X_mat, calip_cov = NULL, calip_size = NULL, calip_penalty = NULL, near_exact_covs = NULL, near_exact_penalties = NULL, digits = 1)
t_ind |
treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). |
X_mat |
matrix of covariates: a matrix of covariates used to build the based Mahalanobis distance matrix. |
calip_cov |
caliper covariate: a covariate vector used to define the caliper. In most applications this is the propensity score, but a covariate can be used as well. |
calip_size |
caliper size: a scalar that determines the size of the caliper for which there will be no penalty. Most applications use |
calip_penalty |
a scalar used to multiply the magnitude of the violation of the caliper. |
near_exact_covs |
a matrix of covariates used for near-exact matching. |
near_exact_penalties |
a vector of scalars used for near-exact matching. The length of |
digits |
a scalar indicating the number of digits used to produce each entry of the distance matrix. The default is 1 digit. |
distmat
is a function for building a normalized rank-based Mahalanobis distance matrix with a penalty for caliper violations on a covariate (say, the propensity score) and penalties for near-exact matching.
As explained in Rosenbaum (2010), the use of a rank-based Mahalanobis distance prevents an outlier from inflating the variance for a variable, and it thus decreases its importance in the matching. In the calculation of the matrix the variances are constrained to not decrease as ties become more common, so that it is not more important to match on a rare binary variable than on a common binary one. The penalty for caliper violations ensures good balance on the propensity score or the covariate used. In this way the rank-based Mahalanobis distance with a penalty for caliper violations in the propensity score constitutes a robust distance for matching.
As explained in Zubizarreta et al. (2011), the distance matrix can also be modified for near-exact matching. Penalties are added to the distance matrix every time that a treated and a control unit have a different value for the corresponding near-exact matching covariate.
A matrix that can be used for optimal matching with the bmatch
functions in the designmatch
package.
Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.
# Load data data(germancities) attach(germancities) # Treatment indicator t_ind = treat # Matrix of covariates X_mat = cbind(log2pop, popgrowth1939, popgrowth3339, emprate, indrate, rubble, rubblemiss, flats, flatsmiss, refugees) # Distance matrix dist_mat = distmat(t_ind, X_mat)
# Load data data(germancities) attach(germancities) # Treatment indicator t_ind = treat # Matrix of covariates X_mat = cbind(log2pop, popgrowth1939, popgrowth3339, emprate, indrate, rubble, rubblemiss, flats, flatsmiss, refugees) # Distance matrix dist_mat = distmat(t_ind, X_mat)
Function for optimal distance matching in observational studies. distmatch
minimizes the total sum of covariate distances between matches. distmatch
is a wrapper to bmatch
.
distmatch(t_ind, dist_mat = NULL, solver = NULL)
distmatch(t_ind, dist_mat = NULL, solver = NULL)
t_ind |
treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). Please note that the data needs to be sorted in decreasing order according to this treatment indicator. |
dist_mat |
distance matrix: a matrix of positive distances between treated units (rows) and controls (columns). If |
solver |
Optimization solver parameters: a list with four objects,
|
A list containing the optimal solution, with the following objects:
obj_total |
value of the objective function at the optimum; |
obj_dist_mat |
value of the total sum of distances term of the objective function at the optimum; |
t_id |
indexes of the matched treated units at the optimum; |
c_id |
indexes of the matched controls at the optimum; |
group_id |
matched pairs or groups at the optimum; |
time |
time elapsed to find the optimal solution. |
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.
sensitivitymv, sensitivitymw.
# Load, sort, and attach data data(lalonde) lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] attach(lalonde) ################################# # Step 1: use cardinality matching to find the largest sample of matched pairs for which # all the covariates are finely balanced. ################################# # Discretize covariates quantiles = function(covar, n_q) { p_q = seq(0, 1, 1/n_q) val_q = quantile(covar, probs = p_q, na.rm = TRUE) covar_out = rep(NA, length(covar)) for (i in 1:n_q) { if (i==1) {covar_out[covar<val_q[i+1]] = i} if (i>1 & i<n_q) {covar_out[covar>=val_q[i] & covar<val_q[i+1]] = i} if (i==n_q) {covar_out[covar>=val_q[i] & covar<=val_q[i+1]] = i}} covar_out } age_5 = quantiles(age, 5) education_5 = quantiles(education, 5) re74_5 = quantiles(re74, 5) re75_5 = quantiles(re75, 5) # Treatment indicator; note that the data needs to be sorted in decreasing order # according to this treatment indicator t_ind = treatment t_ind # Fine balance fine_covs = cbind(black, hispanic, married, nodegree, age_5, education_5, re74_5, re75_5) fine = list(covs = fine_covs) # Solver options t_max = 60*5 solver = "highs" approximate = 0 solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) # Match out_1 = cardmatch(t_ind, fine = fine, solver = solver) # Indices of the treated units and matched controls t_id_1 = out_1$t_id c_id_1 = out_1$c_id # Mean balance covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) meantab(covs, t_ind, t_id_1, c_id_1) # Fine balance (note here we are getting an approximate solution) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_1, c_id_1)) } ################################# # Step 2: use optimal matching (minimum distance matching) to find the (re)pairing of # treated and control that minimizes the total sum of covariate distances between matched # pairs. For this, use the function 'distmatch' which is a wrapper for 'bmatch'. ################################# # New treatment indicator t_ind_2 = t_ind[c(t_id_1, c_id_1)] table(t_ind_2) # To build the distance matrix, the idea is to use strong predictors of the outcome dist_mat_2 = abs(outer(re74[t_id_1], re74[c_id_1], "-")) dim(dist_mat_2) # Match out_2 = distmatch(t_ind_2, dist_mat_2, solver) # Indices of the treated units and matched controls t_id_2 = t_id_1[out_2$t_id] c_id_2 = c_id_1[out_2$c_id-length(out_2$c_id)] # Covariate balance is preserved... meantab(covs, t_ind, t_id_2, c_id_2) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_2, c_id_2)) } # ... but covariate distances are reduced distances_step_1 = sum(diag(dist_mat_2)) distances_step_2 = sum(diag(dist_mat_2[out_2$t_id, out_2$c_id-length(out_2$c_id)])) distances_step_1 distances_step_2 # The mean difference in outcomes is the same... mean(re78[t_id_1]-re78[c_id_1]) mean(re78[t_id_2]-re78[c_id_2]) # ... but their standard deviation is reduced sd(re78[t_id_1]-re78[c_id_1]) sd(re78[t_id_2]-re78[c_id_2])
# Load, sort, and attach data data(lalonde) lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] attach(lalonde) ################################# # Step 1: use cardinality matching to find the largest sample of matched pairs for which # all the covariates are finely balanced. ################################# # Discretize covariates quantiles = function(covar, n_q) { p_q = seq(0, 1, 1/n_q) val_q = quantile(covar, probs = p_q, na.rm = TRUE) covar_out = rep(NA, length(covar)) for (i in 1:n_q) { if (i==1) {covar_out[covar<val_q[i+1]] = i} if (i>1 & i<n_q) {covar_out[covar>=val_q[i] & covar<val_q[i+1]] = i} if (i==n_q) {covar_out[covar>=val_q[i] & covar<=val_q[i+1]] = i}} covar_out } age_5 = quantiles(age, 5) education_5 = quantiles(education, 5) re74_5 = quantiles(re74, 5) re75_5 = quantiles(re75, 5) # Treatment indicator; note that the data needs to be sorted in decreasing order # according to this treatment indicator t_ind = treatment t_ind # Fine balance fine_covs = cbind(black, hispanic, married, nodegree, age_5, education_5, re74_5, re75_5) fine = list(covs = fine_covs) # Solver options t_max = 60*5 solver = "highs" approximate = 0 solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) # Match out_1 = cardmatch(t_ind, fine = fine, solver = solver) # Indices of the treated units and matched controls t_id_1 = out_1$t_id c_id_1 = out_1$c_id # Mean balance covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75) meantab(covs, t_ind, t_id_1, c_id_1) # Fine balance (note here we are getting an approximate solution) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_1, c_id_1)) } ################################# # Step 2: use optimal matching (minimum distance matching) to find the (re)pairing of # treated and control that minimizes the total sum of covariate distances between matched # pairs. For this, use the function 'distmatch' which is a wrapper for 'bmatch'. ################################# # New treatment indicator t_ind_2 = t_ind[c(t_id_1, c_id_1)] table(t_ind_2) # To build the distance matrix, the idea is to use strong predictors of the outcome dist_mat_2 = abs(outer(re74[t_id_1], re74[c_id_1], "-")) dim(dist_mat_2) # Match out_2 = distmatch(t_ind_2, dist_mat_2, solver) # Indices of the treated units and matched controls t_id_2 = t_id_1[out_2$t_id] c_id_2 = c_id_1[out_2$c_id-length(out_2$c_id)] # Covariate balance is preserved... meantab(covs, t_ind, t_id_2, c_id_2) for (i in 1:ncol(fine_covs)) { print(finetab(fine_covs[, i], t_id_2, c_id_2)) } # ... but covariate distances are reduced distances_step_1 = sum(diag(dist_mat_2)) distances_step_2 = sum(diag(dist_mat_2[out_2$t_id, out_2$c_id-length(out_2$c_id)])) distances_step_1 distances_step_2 # The mean difference in outcomes is the same... mean(re78[t_id_1]-re78[c_id_1]) mean(re78[t_id_2]-re78[c_id_2]) # ... but their standard deviation is reduced sd(re78[t_id_1]-re78[c_id_1]) sd(re78[t_id_2]-re78[c_id_2])
Function that plots the empirical cumulative distribution function of a given covariate for treated units and matched controls.
ecdfplot
can be used to visually inspect the balance of the entire empirical distribution function of the covariate in question.
ecdfplot(x, t_id, c_id, main_title = "", legend_position = "right")
ecdfplot(x, t_id, c_id, main_title = "", legend_position = "right")
x |
a covariate vector to be used to assess balance. |
t_id |
a vector of indexes of the treated units. |
c_id |
a vector of indexes of the matched controls. |
main_title |
a string defining the main title of the plot. |
legend_position |
a string specifying the position of the legend.
The default is |
Function that plots the empirical cumulative distribution function of a given covariate for treated units and matched controls.
ecdfplot
can be used to visually inspect the balance of the entire empirical distribution function of the covariate in question.
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Indexes of the controls before matching c_id_before = which(t_ind == 0) # Indixes of the matched controls (obtained using bmatch in designmatch) c_id_after = c(80, 82, 35, 59, 69, 68, 34, 62, 104, 61, 106, 120, 56, 119, 28, 113, 76, 118, 75, 71) # ecdfplot par(mfrow = c(2, 1)) ecdfplot(rubble, t_id, c_id_before, "Before matching") ecdfplot(rubble, t_id, c_id_after, "After matching")
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Indexes of the controls before matching c_id_before = which(t_ind == 0) # Indixes of the matched controls (obtained using bmatch in designmatch) c_id_after = c(80, 82, 35, 59, 69, 68, 34, 62, 104, 61, 106, 120, 56, 119, 28, 113, 76, 118, 75, 71) # ecdfplot par(mfrow = c(2, 1)) ecdfplot(rubble, t_id, c_id_before, "Before matching") ecdfplot(rubble, t_id, c_id_after, "After matching")
Function for tabulating the marginal distributions of a nominal covariate for the treated units and matched controls.
finetab(nom_cov, t_id, c_id)
finetab(nom_cov, t_id, c_id)
nom_cov |
a nominal covariate vector used to assess balance. |
t_id |
a vector of indexes of the treated units. |
c_id |
a vector of indexes of the matched controls. |
finetab
is a function for tabulating the marginal distributions of a nominal covariate for the treated units and matched controls.
finetab
is useful for assessing covariate balance after matching with after exact, near-exact matching, fine and near-balance with the bmatch
or nmatch
functions in the designmatch
package.
A table with the counts for the treated units and matched controls for each category of a nominal covariate.
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Indixes of the matched controls (obtained using bmatch in designmatch) c_id = c(80, 82, 35, 59, 69, 68, 34, 62, 104, 61, 106, 120, 56, 119, 28, 113, 76, 118, 75, 71) # finetab finetab(publicat, t_id, c_id) finetab(busiservcat, t_id, c_id)
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Indixes of the matched controls (obtained using bmatch in designmatch) c_id = c(80, 82, 35, 59, 69, 68, 34, 62, 104, 61, 106, 120, 56, 119, 28, 113, 76, 118, 75, 71) # finetab finetab(publicat, t_id, c_id) finetab(busiservcat, t_id, c_id)
This is part of the data used by Redding and Sturm (2008) to study the impact of market access on economic development in West German cities after the division of Germany after the Second World War. There are 119 rows corresponding to different cities and 21 columns that stand for different variables. These variables are: one treatment indicator, 15 baseline covariates, and five outcomes. Treated cities are those West German cities within 75 kilometers of the border between East and West Germany after the Second World War (see Redding and Sturm (2008) for details). The complete dataset is available at http://www.aeaweb.org/articles.php?doi=10.1257/aer.98.5.1766.
data(germancities)
data(germancities)
A data frame with 122 observations corresponding to 20 treated and 102 control cities. The treatment assignment indicator is the first column of the data frame: treat (1 = treated; 0 = control). The next 15 columns are the covariates:
log2pop, logarithm base 2 of the population in each city in 1939;
popgrowth1939, population growth in each city from 1919 to 1939;
popgrowth3339, population growth in each city from 1919 to 1939;
emprate, employment rates in each city in 1939;
indrate, industry rates in each city in 1939;
rubble, amount of rubble in cubic meters per capita in each city in 1939;
rubblemiss, missing data indicator for rubble; the missing values were imputed with the mean;
flats, number of destroyed dwellings in each city in 1939 as a percentage of the stock of dwelling;
flatsmiss, missing data indicator for flats; the missing values were imputed with the mean;
refugees, proportion of each city's population that identified themselves as refugees in 1939;
educat, categories for the employment rates in the educational sector in each city in 1939;
publicat, categories for the employment rates in the public administration sector in each city in 1939;
busiservcat, categories for the employment rates in the bussiness services sector in each city in 1939;
mineralcat, categories for the employment rates in the minerals sector in each city in 1939;
transcat, categories for the employment rates in the transport sector in each city in 1939.
The last five columns of the data frame are outcomes: pop50, pop60, pop70, pop80 and pop88, the populations in each city in 1950, 1960, 1970, 1980 and 1988, respectively.
http://www.aeaweb.org/articles.php?doi=10.1257/aer.98.5.1766
Redding, S. J., and Daniel M. S. (2008), "The Costs of Remoteness: Evidence from German Division and Reunification," American Economic Review, 98, 1766-1797.
This is one of the data sets from the National Supported Work Demonstration used by Dehejia and Wahba (1999) to evaluate propensity score matching methods. This and other related data sets are available at https://users.nber.org/~rdehejia/nswdata2.html.
data(lalonde)
data(lalonde)
A data frame with 445 observations, corresponding to 185 treated and 260 control subjects, and 10 variables.
The treatment assignment indicator is the first variable of the data frame: treatment
(1 = treated; 0 = control).
The next 7 columns are the covariates:
age
, measured in years;
education
, measured in years;
black
, indicating race (1 if black, 0 otherwise);
hispanic
, indicating race (1 if Hispanic, 0 otherwise);
married
, indicating marital status (1 if married, 0 otherwise);
nodegree
, indicating high school diploma (1 if no degree, 0 otherwise);
re74
, real earnings in 1974;
re75
, real earnings in 1975.
The last variable of the data frame is re78
, the real the earnings in 1978.
https://users.nber.org/~rdehejia/nswdata2.html
Dehejia, R., and Wahba, S. (1999), "Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs," Journal of the American Statistical Association, 94, 1053-1062.
Lalonde, R. (1986), "Evaluating the Econometric Evaluations of Training Programs," American Economic Review, 76, 604-620.
Function that creates a Love plot for assessing covariate balance after matching.
loveplot(X_mat, t_id, c_id, v_line, legend_position = "topright")
loveplot(X_mat, t_id, c_id, v_line, legend_position = "topright")
X_mat |
matrix of covariates: a matrix of covariates used to assess balance. |
t_id |
a vector of indexes of the treated units. |
c_id |
a vector of indexes of the matched controls. |
v_line |
a scalar defining the location of the vertical line that denotes a satisfactory balance. |
legend_position |
a string specifying the position of the legend.
The default is |
In the spirit of Love (2004), loveplot
draws a love plot for assessing covariate balance after matching.
Specifically, loveplot
plots the absolute standardized differences in means before and after matching for all the covariates specified in X_mat
.
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Love, T. (2004), "Graphical Display of Covariate Balance," http://chrp.org/love/JSM2004RoundTableHandout.pdf.
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Matrix of covariates X_mat = cbind(log2pop, popgrowth1939, popgrowth3339, emprate, indrate, rubble, rubblemiss, flats, flatsmiss, refugees) # Indices of the matched controls (obtained using bmatch in designmatch) c_id = c(67, 75, 39, 104, 38, 93, 79, 59, 64, 55, 106, 99, 97, 61, 82, 57, 76, 47, 46, 49) # Vertical line for satisfactory balance vline = 0.15 # loveplot loveplot(X_mat, t_id, c_id, vline)
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Matrix of covariates X_mat = cbind(log2pop, popgrowth1939, popgrowth3339, emprate, indrate, rubble, rubblemiss, flats, flatsmiss, refugees) # Indices of the matched controls (obtained using bmatch in designmatch) c_id = c(67, 75, 39, 104, 38, 93, 79, 59, 64, 55, 106, 99, 97, 61, 82, 57, 76, 47, 46, 49) # Vertical line for satisfactory balance vline = 0.15 # loveplot loveplot(X_mat, t_id, c_id, vline)
Function for tabulating the means and other basic statistics useful to assess covariate balance after matching.
meantab(X_mat, t_ind, t_id, c_id, exact = NULL, digits = 2)
meantab(X_mat, t_ind, t_id, c_id, exact = NULL, digits = 2)
X_mat |
matrix of covariates: a matrix of covariates used to assess balance. |
t_ind |
treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). |
t_id |
a vector of indexes of the treated units. |
c_id |
a vector of indexes of the matched controls. |
exact |
a vector of characters equal to "f" or "w" indicating whether Fisher's exact test or Wilcoxon rank-sum test should be used for binary (or categorical) and continous covariates, respectively. Otherwise, if exact |
digits |
a scalar indicating the number of digits to display in the columns of the table. |
meantab
is a function for tabulating the means and other basic statistics useful to assess covariate balance after matching.
A table with the following columns:
Mis |
proportion of missing values for each covariate; |
Min |
minimum value for each covariate; |
Max |
maximum value for each covariate; |
Mean T |
mean of the treated units for each covariate; |
Mean C |
mean of the matched controls for each covariate; |
Std Dif |
standardized differences in means after matching for each covariate; |
P-val |
P-values for t-tests for differences in means between treated units and matched controls for each covariate. |
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Matrix of covariates X_mat = cbind(log2pop, popgrowth1939, popgrowth3339, emprate, indrate, rubble, rubblemiss, flats, flatsmiss, refugees) # Indices of the matched controls (obtained using bmatch in designmatch) c_id = c(67, 75, 39, 104, 38, 93, 79, 59, 64, 55, 106, 99, 97, 61, 82, 57, 76, 47, 46, 49) # meantab meantab(X_mat, t_ind, t_id, c_id) # meantab meantab(X_mat, t_ind, t_id, c_id, exact = c(rep("w", 6), "f", "w", "f", "w"), digits = 3)
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Matrix of covariates X_mat = cbind(log2pop, popgrowth1939, popgrowth3339, emprate, indrate, rubble, rubblemiss, flats, flatsmiss, refugees) # Indices of the matched controls (obtained using bmatch in designmatch) c_id = c(67, 75, 39, 104, 38, 93, 79, 59, 64, 55, 106, 99, 97, 61, 82, 57, 76, 47, 46, 49) # meantab meantab(X_mat, t_ind, t_id, c_id) # meantab meantab(X_mat, t_ind, t_id, c_id, exact = c(rep("w", 6), "f", "w", "f", "w"), digits = 3)
Function for optimal nonbipartite matching in randomized experiments and observational studies that directly balances the observed covariates. nmatch
allows the user to enforce different forms of covariate balance in the matched samples, such as moment balance (e.g., of means, variances, and correlations), distributional balance (e.g., fine balance, near-fine balance, strength-k balancing), and exact matching. Among others, nmatch
can be used in the design of randomized experiments for matching before randomization (Greevy et al. 2004, Zou and Zubizarreta 2016), and in observational studies for matching with doses and strengthening an instrumental variable (Baiocchi et al. 2010, Lu et al. 2011).
nmatch(dist_mat, subset_weight = NULL, total_pairs = NULL, mom = NULL, exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL, near = NULL, far = NULL, solver = NULL)
nmatch(dist_mat, subset_weight = NULL, total_pairs = NULL, mom = NULL, exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL, near = NULL, far = NULL, solver = NULL)
dist_mat |
distance matrix: a matrix of positive distances between units. |
subset_weight |
subset matching weight: a scalar that regulates the trade-off between the total sum of distances between matched pairs and the total number of matched pairs. The larger |
total_pairs |
total number of matched pairs: a scalar specifying the number of matched pairs to be obtained. If |
mom |
moment balance parameters: a list with three arguments,
|
exact |
Exact matching parameters: a list with one argument,
where |
near_exact |
Near-exact matching parameters: a list with two arguments,
|
fine |
Fine balance parameters: a list with one argument,
where |
near_fine |
Near-fine balance parameters: a list with two arguments,
|
near |
Near matching parameters: a list with three arguments,
|
far |
Far matching parameters: a list with three arguments,
|
solver |
Optimization solver parameters: a list with four objects,
|
A list containing the optimal solution, with the following objects:
obj_total |
value of the objective function at the optimum; |
obj_dist_mat |
value of the total sum of distances term of the objective function at the optimum; |
id_1 |
indexes of the matched units in group 1 at the optimum; |
id_2 |
indexes of the matched units in group 2 at the optimum; |
group_id |
matched pairs at the optimum; |
time |
time elapsed to find the optimal solution. |
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Baiocchi, M., Small, D., Lorch, S. and Rosenbaum, P. R. (2010), "Building a Stronger Instrument in an Observational Study of Perinatal Care for Premature Infants," Journal of the American Statistical Association, 105, 1285-1296.
Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), "Optimal Multivariate Matching Before Randomization," Biostatistics, 5, 263-275.
Lu, B., Greevy, R., Xu, X., and Beck C. (2011), "Optimal Nonbipartite Matching and its Statistical Applications," The American Statistician, 65, 21-30.
Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.
Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.
Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.
Zou, J., and Zubizarreta, J. R. (2016), "Covariate Balanced Restricted Randomization: Optimal Designs, Exact Tests, and Asymptotic Results," working paper.
Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.
Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.
## Uncomment the following example ## Load and attach data #data(lalonde) #attach(lalonde) ################################# ## Example: optimal subset matching ################################# ## Optimal subset matching pursues two competing goals at ## the same time: to minimize the total of distances while ## matching as many observations as possible. The trade-off ## between these two is regulated by the parameter subset_weight ## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion). ## Here the balance requirements are mean and fine balance for ## different covariates. We require 50 pairs to be matched. ## Again, the solver used is HiGHS with the approximate option. ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat_covs = round(dist(X_mat, diag = TRUE, upper = TRUE), 1) #dist_mat = as.matrix(dist_mat_covs) ## Subset matching weight #subset_weight = 1 ## Total pairs to be matched #total_pairs = 50 ## Moment balance: constrain differences in means to be at most .1 standard deviations apart #mom_covs = cbind(age, education) #mom_tols = apply(mom_covs, 2, sd)*.1 #mom = list(covs = mom_covs, tols = mom_tols) ## Solver options #t_max = 60*5 #solver = "highs" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, #trace_cplex = 0) ## Match #out = nmatch(dist_mat = dist_mat, subset_weight = subset_weight, total_pairs = total_pairs, #mom = mom, solver = solver) ## Indices of the treated units and matched controls #id_1 = out$id_1 #id_2 = out$id_2 ## Assess mean balance #a = apply(mom_covs[id_1, ], 2, mean) #b = apply(mom_covs[id_2, ], 2, mean) #tab = round(cbind(a, b, a-b, mom_tols), 2) #colnames(tab) = c("Mean 1", "Mean 2", "Diffs", "Tols") #tab ## Assess fine balance (note here we are getting an approximate solution) #for (i in 1:ncol(fine_covs)) { # print(finetab(fine_covs[, i], id_1, id_2)) #}
## Uncomment the following example ## Load and attach data #data(lalonde) #attach(lalonde) ################################# ## Example: optimal subset matching ################################# ## Optimal subset matching pursues two competing goals at ## the same time: to minimize the total of distances while ## matching as many observations as possible. The trade-off ## between these two is regulated by the parameter subset_weight ## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion). ## Here the balance requirements are mean and fine balance for ## different covariates. We require 50 pairs to be matched. ## Again, the solver used is HiGHS with the approximate option. ## Matrix of covariates #X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75) ## Distance matrix #dist_mat_covs = round(dist(X_mat, diag = TRUE, upper = TRUE), 1) #dist_mat = as.matrix(dist_mat_covs) ## Subset matching weight #subset_weight = 1 ## Total pairs to be matched #total_pairs = 50 ## Moment balance: constrain differences in means to be at most .1 standard deviations apart #mom_covs = cbind(age, education) #mom_tols = apply(mom_covs, 2, sd)*.1 #mom = list(covs = mom_covs, tols = mom_tols) ## Solver options #t_max = 60*5 #solver = "highs" #approximate = 1 #solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, #trace_cplex = 0) ## Match #out = nmatch(dist_mat = dist_mat, subset_weight = subset_weight, total_pairs = total_pairs, #mom = mom, solver = solver) ## Indices of the treated units and matched controls #id_1 = out$id_1 #id_2 = out$id_2 ## Assess mean balance #a = apply(mom_covs[id_1, ], 2, mean) #b = apply(mom_covs[id_2, ], 2, mean) #tab = round(cbind(a, b, a-b, mom_tols), 2) #colnames(tab) = c("Mean 1", "Mean 2", "Diffs", "Tols") #tab ## Assess fine balance (note here we are getting an approximate solution) #for (i in 1:ncol(fine_covs)) { # print(finetab(fine_covs[, i], id_1, id_2)) #}
Function for visualizing matched pairs in two dimensions.
pairsplot(cov1, cov2, t_id, c_id, xlab, ylab, main)
pairsplot(cov1, cov2, t_id, c_id, xlab, ylab, main)
cov1 |
a vector for the covariate to be plotted on the x axis. |
cov2 |
a vector for the covariate to be plotted on the y axis. |
t_id |
a vector of indexes of the treated units. |
c_id |
a vector of indexes of the matched controls. |
xlab |
a string specifying the label of the x axis. |
ylab |
a string specifying the label of the y axis. |
main |
a string specifying the main title of the plot. |
pairsplot
is a function for visualizing matched pairs in two dimensions, usually defined by two of the matching covariates.
Matched pairs are connected by line segments.
Horizontal and vertical lines show the means of treated units and matched controls for each of the covariates.
Among others, pairsplot
can be useful for visualizing near/far matches, e.g. when building a stronger instrumental variable (Baiocchi et al., 2010).
Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>.
Baiocchi, M., Small, D., Lorch, S. and Rosenbaum, P. R. (2010), "Building a Stronger Instrument in an Observational Study of Perinatal Care for Premature Infants," Journal of the American Statistical Association, 105, 1285-1296.
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Indices of the matched controls (obtained using bmatch in designmatch) c_id = c(67, 75, 39, 104, 38, 93, 79, 59, 64, 55, 106, 99, 97, 61, 82, 57, 76, 47, 46, 49) # pairsplot pairsplot(rubble, flats, t_id, c_id, "Rubble", "Flats", "")
# Load data data(germancities) # Sort and attach data germancities = germancities[order(germancities$treat, decreasing = TRUE), ] attach(germancities) # Treatment indicator t_ind = treat # Indexes of the treated units t_id = which(t_ind == 1) # Indices of the matched controls (obtained using bmatch in designmatch) c_id = c(67, 75, 39, 104, 38, 93, 79, 59, 64, 55, 106, 99, 97, 61, 82, 57, 76, 47, 46, 49) # pairsplot pairsplot(rubble, flats, t_id, c_id, "Rubble", "Flats", "")
Function for optimal profile matching to construct matched samples that are balanced toward a user-specified covariate profile. This covariate profile can represent a specific population or a target individual, facilitating the generalization and personalization of causal inferences (Cohn and Zubizarreta 2022). For each treatment group reservoir, profmatch
finds the largest sample that is balanced relative to the profile. The formulation of profmatch
has been simplified to handle larger data than bmatch
or nmatch
. Similar to bmatch
or nmatch
, the performance of profmatch
is greatly enhanced by using the solver
options cplex
or gurobi
.
profmatch(t_ind, mom, solver = NULL)
profmatch(t_ind, mom, solver = NULL)
t_ind |
treatment indicator: a vector indicating treatment group of each observation. |
mom |
moment balance parameters: a list with three arguments,
|
solver |
Optimization solver parameters: a list with four objects,
|
A list containing the optimal solution, with the following objects:
obj_totals |
values of the objective functions at the optima (one value for each treatment group matching problem); |
ids |
indices of the matched units at the optima; |
times |
time elapsed to find the optimal solutions (one value for each treatment group matching problem). |
Eric R. Cohn <[email protected]>, Jose R. Zubizarreta <[email protected]>, Cinar Kilcioglu <[email protected]>, Juan P. Vielma <[email protected]>.
Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.
Cohn, E. R. and Zubizarreta, J. R. (2022) "Profile Matching for the Generalization and Personalization of Causal Inferences," Epidemiology
sensitivitymv, sensitivitymw.
### Load, sort, and attach data #data(lalonde) #lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] #attach(lalonde) ### Specify covariates #covs = c("age", "education", "black", "hispanic", "married", "nodegree", "re74", "re75") ### Vector of treatment group indicators #t_ind = lalonde$treatment ### Covariate matrix #mom_covs = as.matrix(lalonde[, covs]) ### Tolerances will be 0.05 * each covariate's standard deviation #mom_sds = apply(lalonde[, covs], 2, sd) #mom_tols = 0.05 * mom_sds ### Target moments will be the overall means in the sample #mom_targets = colMeans(lalonde[, covs]) ### Solver options #t_max = 60*30 #solver = "gurobi" #approximate = 0 #solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) #mom = list(covs = mom_covs, tols = mom_tols, targets = mom_targets) #pmatch_out = profmatch(t_ind, mom, solver) ### Selecting the matched units #lalonde.matched = lalonde[pmatch_out$id,] ### Comparing TASMDs before and after matching #TASMD.0.2 = abs(colMeans(lalonde.matched[which(lalonde.matched$treatment == 0), covs]) # - mom_targets) / mom_sds #TASMD.1.2 = abs(colMeans(lalonde.matched[which(lalonde.matched$treatment == 1), covs]) # - mom_targets) / mom_sds #TASMD.0.1 = abs(colMeans(lalonde[which(lalonde$treatment == 0), covs]) - mom_targets) / mom_sds #TASMD.1.1 = abs(colMeans(lalonde[which(lalonde$treatment == 1), covs]) - mom_targets) / mom_sds ### For each treatment group, ASAMDs are reduced after matching (i.e., balance is achieved) #cbind(TASMD.0.1, TASMD.0.2) #cbind(TASMD.1.1, TASMD.1.2)
### Load, sort, and attach data #data(lalonde) #lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ] #attach(lalonde) ### Specify covariates #covs = c("age", "education", "black", "hispanic", "married", "nodegree", "re74", "re75") ### Vector of treatment group indicators #t_ind = lalonde$treatment ### Covariate matrix #mom_covs = as.matrix(lalonde[, covs]) ### Tolerances will be 0.05 * each covariate's standard deviation #mom_sds = apply(lalonde[, covs], 2, sd) #mom_tols = 0.05 * mom_sds ### Target moments will be the overall means in the sample #mom_targets = colMeans(lalonde[, covs]) ### Solver options #t_max = 60*30 #solver = "gurobi" #approximate = 0 #solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, trace = 0) #mom = list(covs = mom_covs, tols = mom_tols, targets = mom_targets) #pmatch_out = profmatch(t_ind, mom, solver) ### Selecting the matched units #lalonde.matched = lalonde[pmatch_out$id,] ### Comparing TASMDs before and after matching #TASMD.0.2 = abs(colMeans(lalonde.matched[which(lalonde.matched$treatment == 0), covs]) # - mom_targets) / mom_sds #TASMD.1.2 = abs(colMeans(lalonde.matched[which(lalonde.matched$treatment == 1), covs]) # - mom_targets) / mom_sds #TASMD.0.1 = abs(colMeans(lalonde[which(lalonde$treatment == 0), covs]) - mom_targets) / mom_sds #TASMD.1.1 = abs(colMeans(lalonde[which(lalonde$treatment == 1), covs]) - mom_targets) / mom_sds ### For each treatment group, ASAMDs are reduced after matching (i.e., balance is achieved) #cbind(TASMD.0.1, TASMD.0.2) #cbind(TASMD.1.1, TASMD.1.2)