Title: | Combining Matching and Linear Regression for Causal Inference |
---|---|
Description: | Core functions as well as diagnostic and calibration tools for combining matching and linear regression for causal inference in observational studies. |
Authors: | Alireza S. Mahani, Mansour T.A. Sharabiani |
Maintainer: | Alireza S. Mahani <alireza.s.mahani@gmail.com> |
License: | GPL (>= 2) |
Version: | 0.8.1 |
Built: | 2024-02-15 08:08:01 UTC |
Source: | CRAN |
One of the datasets used by Dehejia and Wahba in their paper "Causal Effects in Non-Experimental Studies: Reevaluating the Evaluation of Training Programs." Also used as an example dataset in the MatchIt package.
data("lalonde")
A data frame with 614 observations on the following 10 variables.
treat
treatment indicator; 1 if treated in the National Supported Work Demonstration, 0 if from the Current Population Survey
age
age, a numeric vector.
educ
years of education, a numeric vector between 0 and 18.
black
a binary vector, 1 if black
, 0 otherwise.
hispan
a binary vector, 1 if hispanic
, 0 otherwise.
married
a binary vector, 1 if married
, 0 otherwise.
nodegree
a binary vector, 1 if no degree, 0 otherwise.
re74
earnings in 1974, a numeric vector.
re75
earnings in 1975, a numeric vector.
re78
earnings in 1978, a numeric vector (outcome variable).
This data set has been taken from twang
package, with small changes to field descriptions.
http://www.columbia.edu/~rd247/nswdata.html http://cran.r-project.org/src/contrib/Descriptions/MatchIt.html
Lalonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76: 604-620.
Dehejia, R.H. and Wahba, S. (1999). Causal Effects in Nonexperimental Studies: Re-Evaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94: 1053-1062.
These data are adapted from the lindner dataset in the USPS package. The description comes from that package, except for the variable sixMonthSurvive, which is a recode of lifepres
Data from an observational study of 996 patients receiving an initial Percutaneous Coronary Intervention (PCI) at Ohio Heart Health, Christ Hospital, Cincinnati in 1997 and followed for at least 6 months by the staff of the Lindner Center. The patients thought to be more severely diseased were assigned to treatment with abciximab (an expensive, high-molecular-weight IIb/IIIa cascade blocker); in fact, only 298 (29.9 percent) of patients received usual-care-alone with their initial PCI.
data("lindner")
A data frame of 10 variables collected on 996 patients; no NAs.
lifepres
Mean life years preserved due to survival for at least 6 months following PCI; numeric value of either 11.4 or 0.
cardbill
Cardiac related costs incurred within 6 months of patient's initial PCI; numeric value in 1998 dollars; costs were truncated by death for the 26 patients with lifepres == 0.
abcix
Numeric treatment selection indicator; 0 implies usual PCI care alone; 1 implies usual PCI care deliberately augmented by either planned or rescue treatment with abciximab.
stent
Coronary stent deployment; numeric, with 1 meaning YES and 0 meaning NO.
height
Height in centimeters; numeric integer from 108 to 196.
female
Female gender; numeric, with 1 meaning YES and 0 meaning NO.
diabetic
Diabetes mellitus diagnosis; numeric, with 1 meaning YES and 0 meaning NO.
acutemi
Acute myocardial infarction within the previous 7 days; numeric, with 1 meaning YES and 0 meaning NO.
ejecfrac
Left ejection fraction; numeric value from 0 percent to 90 percent.
ves1proc
Number of vessels involved in the patient's initial PCI procedure; numeric integer from 0 to 5.
sixMonthSurvive
Survival at six months - a recoded version of lifepres.
This data set and documentation is taken from twang
package.
Kereiakes DJ, Obenchain RL, Barber BL, et al. Abciximab provides cost effective survival advantage in high volume interventional practice. Am Heart J 2000; 140: 603-610.
Creating a series of matched data sets with different calibration parameters. The output of this function can be supplied to summary.mlr
and then plot.summary.mlr
methods to generate diagnostic and calibration plots.
mlr(tr, Z.i = NULL, Z.o = mlr.generate.Z.o(Z.i), psm = TRUE
, caliper.vec = c(0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 5, Inf)
, ...)
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
Z.i |
Matrix of adjustment covariates included in linear regression. We must have |
Z.o |
Matrix of adjustment covariates (present in generative model but) omitted from regression estimation. We must have |
psm |
Boolean flag, indicating whether propensity score matching should be used ( |
caliper.vec |
Vector of matching calipers used. |
... |
Other parameters passed to |
A list with the following fields:
tr |
Same as input. |
Z.i |
Same as input. |
Z.o |
Same as input. |
idx.list |
List of observation indexes for each matched data set. |
caliper.vec |
Same as input. |
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
Calculating treatment effect bias due to misspecified regression, using coefficients of omitted covariates (if supplied) or a constrained bias estimation approach.
mlr.bias(tr, Z.i = NULL, Z.o, gamma.o = NULL
, idx = 1:length(tr))
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
Z.i |
Matrix of adjustment covariates included in linear regression. We must have |
Z.o |
Matrix of adjustment covariates (present in generative model but) omitted from regression estimation. We must have |
gamma.o |
Vector of coefficients for omitted adjustment covariates. |
idx |
Index of observations to be used, with possible duplication, e.g. as indexes of matched subset. |
For single
, subspace
and absolute
, biases are calculated using the constrained bias estimation framework, i.e. L2 norm of Z.o%*%gamma.o
is taken to be length(tr)
(mean squared of 1).
A list with the following elements is returned:
gamma.o |
If function argument |
single |
A list with elements: 1) |
subspace |
A list with elements: 1) |
absolute |
A list with elements: 1) |
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
# number of included adjustment covariates
K <- 10
# number of observations in treatment group
Nt <- 100
# number of observations in control group
Nc <- 100
N <- Nt + Nc
# number of omitted covariates
Ko <- 3
# treatment indicator variable
tr <- c(rep(1, Nt), rep(0, Nc))
# matrix of included (adjustment) covariates
Z.i <- matrix(runif(K*N), ncol = K)
# matrix of omitted covariates
Z.o <- matrix(runif(Ko*N), ncol = Ko)
# coefficients of omitted covariates
gamma.o <- runif(Ko)
retobj <- mlr.bias(tr = tr, Z.i = Z.i, Z.o = Z.o, gamma.o = gamma.o)
# 1) using actual coefficients for computing bias
ret <- retobj$gamma.o
# comparing with brute-force approach
X.i <- cbind(tr, 1, Z.i)
ret2 <- (solve(t(X.i) %*% X.i, t(X.i) %*% Z.o %*% gamma.o))[1]
cat("check 1:", all.equal(ret2, ret), "\n")
# comparing with single method
Z.o.proj <- mlr.orthogonalize(X = cbind(1, Z.i), Z = Z.o, normalize = TRUE)
ret3 <- (solve(t(X.i) %*% X.i, t(X.i) %*% Z.o.proj))[1, ]
cat("check 2:", all.equal(ret3, retobj$single$bias.vec), "\n")
ret4 <- (solve(t(X.i) %*% X.i, t(X.i) %*% retobj$subspace$dir))[1, ]
cat("check 3:", all.equal(as.numeric(ret4), as.numeric(retobj$subspace$bias)), "\n")
ret4 <- (solve(t(X.i) %*% X.i, t(X.i) %*% retobj$absolute$dir))[1, ]
cat("check 4:", all.equal(as.numeric(ret4), as.numeric(retobj$absolute$bias)), "\n")
Generaring the vector that, multiplied by Z.o%*%gamma.o
(contribution of omitted covariates to outcome), produces the treatment effect bias - due to model misspecification in the form of covariate omission - when using linear regression for causal inference.
mlr.bias.constructor(tr, Z.i = NULL, details = FALSE, idx = 1:length(tr))
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
Z.i |
Matrix of adjustment covariates included in linear regression. We must have |
details |
Boolean flag, indicating whether intermediate objects used in generating the constrcutor vector must be returned or not. This only works if at least one adjustment covariate is included in the regression ( |
idx |
Index of observations to be used, with possible duplication, e.g. as indexes of matched subset. |
A vector of same length as tr
is returned. If details = TRUE
and Z.i
is not NULL
, then the following objects are attached as attributes:
p |
Vector of length |
q |
Vector of length |
u.i |
Vector of length |
A |
Weighted, within-group covariance matrix of included covariates. It is a square matrix of dimension |
iA |
Inverse of |
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
# number of included adjustment covariates
K <- 10
# number of observations in treatment group
Nt <- 100
# number of observations in control group
Nc <- 100
N <- Nt + Nc
# treatment indicator variable
tr <- c(rep(1, Nt), rep(0, Nc))
# matrix of included (adjustment) covariates
Z.i <- matrix(runif(K*N), ncol = K)
ret <- mlr.bias.constructor(tr = tr, Z.i = Z.i)
# comparing with brute-force approach
X.i <- cbind(tr, 1, Z.i)
ret2 <- (solve(t(X.i) %*% X.i, t(X.i)))[1, ]
cat("check 1:", all.equal(ret2, ret), "\n")
# sampling with replacement
idx <- sample(1:N, size = round(0.75*N), replace = TRUE)
ret3 <- mlr.bias.constructor(tr = tr, Z.i = Z.i, idx = idx)
ret4 <- (solve(t(X.i[idx, ]) %*% X.i[idx, ], t(X.i[idx, ])))[1, ]
cat("check 2:", all.equal(ret3, ret4), "\n")
Combining normalized bias and variance over a range of values for omitted R-squared to produce normalized MSE.
mlr.combine.bias.variance(tr, bvmat, orsq.min = 0.001, orsq.max = 1, n.orsq = 100)
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
bvmat |
Matrix of bias and variances. First column must be bias, and second column must be variance. Each row corresponds to a different ‘calibration index’ or scenario, which we want to compare and find the best among them. |
orsq.min |
Minimum omitted R-squared used for combining bias and variance. |
orsq.max |
Maximum omitted R-squared. |
n.orsq |
Number of values for omitted R-squared generated in the vector. |
A list with the following elements:
orsq.vec |
Vector of omitted R-squared values used for combining bias and variance. |
errmat |
Matrix of MSE, with each row corresponding to an omitted R-squared value, and each column for a value of calibration index, i.e. one row if |
biassq.mat |
Matrix of squared biases, with a structure similar to |
which.min.vec |
Value of calibration index (row number for |
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
Utility function for generating interaction terms and step functions from a set of base covariates, to be used as candidate omitted covariates.
mlr.generate.Z.o(X, interaction.order = 3, step.funcs = TRUE
, step.thresh = 20, step.ncuts = 3)
X |
Matrix of base covariates. |
interaction.order |
Order of interactions to generate. It must be at least 2. |
step.funcs |
Boolean flag, indicating whether (binary) step functions must be generated from continuous variables. |
step.thresh |
Minimum number of distinct values in a numeric vector to generate step functions from. |
step.ncuts |
How many cuts to apply for generating step functions. |
TBD
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
Match
function from Matching
package
Performs propensity score or Mahalanobis matching and return indexes of treatment and control groups.
mlr.match(tr, X, psm = TRUE, replace = F, caliper = Inf
, verbose = TRUE)
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
X |
Covariates used in matching, either directly (Mahalanobis matching) or indirectly (propensity score). |
psm |
Boolean flag, indicating whether propensity score matching should be used ( |
replace |
Boolean flag, indicating whether matching must be done with or without replacement. |
caliper |
Size of caliper (standardized distance of two observations) used in matching. Treatment and control observations with standardized distance larger than |
verbose |
Boolean flag, indicating whether size of treatment and control groups before and after matching will be printed. |
For propensity score matching, linear predictors from logistic regression are used (rather than predicted probabilities).
A vector of matched indexes, containing both treatment and control groups. Also, the following attributes are attached: 1) nt
: size of treatment group, 2) nc
: size of control group, 3) psm.reg
: logistic regression object used in generating propensity scores (NA
if psm
is FALSE
), 4) match.obj
: matching object returned by Match
function.
Alireza S. Mahani, Mansour T.A. Sharabiani
data(lalonde)
tr <- lalonde$treat
Z.i <- as.matrix(lalonde[, c("age", "educ", "black"
, "hispan", "married", "nodegree", "re74", "re75")])
Z.o <- model.matrix(~ I(age^2) + I(educ^2) + I(re74^2) + I(re75^2) - 1, lalonde)
# propensity score matching on all covariates
idx <- mlr.match(tr = tr, X = cbind(Z.i, Z.o), caliper = 1.0, replace = FALSE)
# improvement in maximum single-covariate bias due to matching
bias.obj.before <- mlr.bias(tr = tr, Z.i = Z.i, Z.o = Z.o)
bias.before <- bias.obj.before$subspace$bias
dir <- bias.obj.before$subspace$dir
bias.after <- as.numeric(mlr.bias(tr = tr[idx]
, Z.i = Z.i[idx, ], Z.o = dir[idx], gamma.o = 1.0)$single$bias)
# percentage bias-squared rediction
cat("normalized bias - before:", bias.before, "\n")
cat("normalized bias - after:", bias.after, "\n")
cat("percentage squared-bias reduction:"
, (bias.before^2 - bias.after^2)/bias.before^2, "\n")
# matching with replacement
idx.wr <- mlr.match(tr = tr, X = cbind(Z.i, Z.o), caliper = 1.0
, replace = TRUE)
bias.after.wr <- as.numeric(mlr.bias(tr = tr
, Z.i = Z.i, Z.o = dir, gamma.o = 1.0, idx = idx.wr)$single$bias)
cat("normalized bias - after (with replacement):", bias.after.wr, "\n")
Decomposing a collection of vectors into parallel and orthogonal components with respect to the subspace spanned by columns of a reference matrix.
mlr.orthogonalize(X, Z, normalize = FALSE, tolerance = .Machine$double.eps^0.5)
X |
Matrix whose columns form the subspace, with respect to which we want to orthogonalize columns of |
Z |
Matrix whose columns we want to orthogonalize with respect to the subpsace spanned by columns of |
normalize |
Boolean flag, indicating whether the orthogonal component of |
tolerance |
If unnormalized projection of a column of |
Current implementation uses Singular Value Decomposition (svd
) of X
to form an orthonormal basis from columns of X
to facilitate the projection process.
A matrix of same dimensions as Z
is returned, with each column containing the orthogonal component of the corresponding column of Z
. Parallel components are attached as parallel
attribute.
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
K <- 10
N <- 100
Ko <- 5
X <- matrix(runif(N*K), ncol = K)
Z <- matrix(runif(N*Ko), ncol = Ko)
ret <- mlr.orthogonalize(X = X, Z = Z, normalize = FALSE)
orthogonal <- ret
parallel <- attr(ret, "parallel")
Z.rec <- parallel + orthogonal
# check that parallel and orthogonal components add up to Z
cat("check 1:", all.equal(as.numeric(Z.rec), as.numeric(Z)), "\n")
# check that inner product of orthogonal columns and X columns are zero
cat("check 2:", all.equal(t(orthogonal) %*% X, matrix(0, nrow = Ko, ncol = K)), "\n")
Monte Carlo based calculation of study power for treatment effect estimation using linear regression on treatment indicator and adjustment covariates.
mlr.power(tr, Z.i = NULL, d, sig.level = 0.05, niter = 1000
, verbose = FALSE, idx = 1:length(tr), rnd = FALSE)
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
Z.i |
Matrix of adjustment covariates included in linear regression. We must have |
d |
Standardized effect size, equal to treatment effect divided by standard deviation of generative noise. |
sig.level |
Significance level for rejecting null hypothesis. |
niter |
Number of Monte Carlo simulations used for calculating power. |
verbose |
If |
idx |
Subset of observations to use for power calculation. |
rnd |
Boolean flag. If |
In each Monte Carlo iteration, response variable is generated from a normal distribution whose mean is equal to d * tr
(other coefficients are assumed to be zero since their value does not affect power calculation), and whose standard deviation is 1.0
. Then OLS-based regression is performed on data, and p-value for treatment effect is compared to sig.level
, based on which null hypothesis (no effect) is rejected or accepted. The fraction of iterations where null hypothesis is rejected is taken to be power. Standard error is calculated using a binomial-distribution assumption.
A numeric vector is returned. If rnd
is FALSE
, meand and standard error of calculated power is returned. If rnd
is TRUE
, mean and standard error of power calculated for random subsampling of observations is returned as well.
Alireza S. Mahani, Mansour T.A. Sharabiani
Calculate standardized mean difference for each column of a matrix, given a binary treatment indicator vector.
mlr.smd(tr, X)
tr |
Binary treatment indicator vector; 1 means treatment, 0 means control. |
X |
Matrix of covariates; each column is a covariate whose standardized mean difference we want to calculate. |
A vector of length ncol(X)
, containing standardized mean differences for each column of X
, given treatment variable tr
.
Alireza S. Mahani, Mansour T.A. Sharabiani
Calculating treatment effect variance, resulting from linear regression.
mlr.variance(tr, Z.i = NULL, sigsq = 1, details = FALSE
, idx =1:length(tr))
tr |
Binary treatment indicator vector (1=treatment, 0=control), whose coefficient in the linear regression model is TE. |
Z.i |
Matrix of adjustment covariates included in linear regression. We must have |
sigsq |
Variance of data generation noise. |
details |
Boolean flag, indicating whether intermediate objects used in generating the constrcutor vector must be returned or not (only when no repeated observations). |
idx |
Index of observations to be used, with possible duplication, e.g. as indexes of matched subset. |
A scalar value is returned for TE variance. If details = TRUE
and Z.i
is not NULL
, then the following objects are attached as attributes:
u.i |
Vector of length |
A |
Weighted, within-group covariance matrix of included covariates. It is a square matrix of dimension |
iA |
Inverse of |
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
data(lalonde)
tr <- lalonde$treat
Z.i <- as.matrix(lalonde[, c("age", "educ", "black"
, "hispan", "married", "nodegree", "re74", "re75")])
ret <- mlr.variance(tr = tr, Z.i = Z.i)
# comparing with brute-force approach
X.i <- cbind(tr, 1, Z.i)
ret2 <- (solve(t(X.i) %*% X.i))[1, 1]
cat("check 1:", all.equal(ret2, ret), "\n")
# matching with/without replacement
idx <- mlr.match(tr = tr, X = Z.i, caliper = 1.0
, replace = FALSE)
idx.wr <- mlr.match(tr = tr, X = Z.i, caliper = 1.0
, replace = TRUE)
ret3 <- mlr.variance(tr = tr, Z.i = Z.i, idx = idx)
cat("variance - matching without replacement:"
, ret3, "\n")
ret4 <- mlr.variance(tr = tr, Z.i = Z.i, idx = idx.wr)
cat("variance - matching with replacement:"
, ret4, "\n")
summary.mlr
Diagnostic and calibration plots, inlcuding relative squared bias reduction, constrained bias estimation, bias-variance trade-off, and power analysis.
## S3 method for class 'summary.mlr'
plot(x, which = 1
, smd.index = 1:min(10, ncol(x$smd))
, bias.index = 1:min(10, ncol(x$bias.terms))
, orsq.plot = c(0.01, 0.05, 0.25)
, caption.vec = c("relative squared bias reduction", "normalized bias"
, "standardized mean difference", "maximum bias"
, "error components", "optimum choice", "power analysis")
, ...)
x |
An object of class |
which |
Selection of which plots to generate: |
smd.index |
Index of columns in |
bias.index |
Index of columns in |
orsq.plot |
Which values for omitted R-squared to generate plots for. |
caption.vec |
Character vector to be used as caption for plots. Values will be repeated if necessary if length is shorter than number of plots requested. |
... |
Parameters to be passed to/from other functions. |
Currently, 7 types of plots can be generated, as specified by the which
flag: 1) relative squared bias reduction, by candidate omitted term, comparing before and after matching, 2) normalized squared bias, by candidate omitted term, vs. calibration index, 3) standardized mean difference, for all included and (candidate) omitted terms, vs. calibration index, 4) aggregate bias (single-covariate maximum, covariate-subspace maximum, and absolute maximum) vs. calibration index, 5) bias/variance/MSE vs. calibration index, at user-supplied values for omitted R-squared, 6) optimal index vs. omitted R-squared, and 7) study power vs. calibration index.
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.
Applying a series of diagnostic and calibration functions to a series of matched data sets to determine impact of matching on TE bias, variance and total error, and to select the best matching parameters.
## S3 method for class 'mlr'
summary(object, power = FALSE
, power.control = list(rnd = TRUE, d = 0.5, sig.level = 0.05
, niter = 1000, rnd = TRUE)
, max.method = c("single-covariate", "covariate-subspace"
, "absolute")
, verbose = FALSE, ...
, orsq.min = 1e-03, orsq.max = 1e0, n.orsq = 100)
object |
An object of class |
power |
Boolean flag indicating whether Monte-Carlo based power analysis must be performed or not. |
power.control |
A list containing parameters to be passed to |
max.method |
Which constrained bias estimation method must be used in bias-variance trade-off and other analyses? |
verbose |
Whether progress message must be printed. |
... |
Parameters to be passed to/from other functions. |
orsq.min |
Minimum value of omitted R-squared used for combining normalized bias and variance. |
orsq.max |
Maximum value of omitted R-squared used for combining normalized bias and variance. |
n.orsq |
Number of values for omitted R-squared to generate in the specified range. |
An object of class summary.mlr
, with the following elements:
mlr.obj |
Same as input. |
bias |
Matrix of aggregate bias values, one row per calibration index, and three columns: 1) single-covariate maximum, 2) covariate-subspace maximum, and 3) absolute maximum, in that order. |
bias.terms |
Matrix of biases, one row per calibration index, and one column per candidate omitted term. |
variance |
Vector of normalized variances, one per each value of calibration index. |
power |
Matrix of power calculations, one row per calibration index. Each row is identical to output of |
smd |
Matrix of standardized mean differences, one row per calibration index, and one column for each included or omitted covariates. |
combine.obj |
Output of |
Alireza S. Mahani, Mansour T.A. Sharabiani
Link to a draft paper, documenting the supporting mathematical framework, will be provided in the next release.