Title: | Coarsened Exact Matching |
---|---|
Description: | Implementation of the Coarsened Exact Matching algorithm discussed along with its properties in Iacus, King, Porro (2011) <DOI:10.1198/jasa.2011.tm09599>; Iacus, King, Porro (2012) <DOI:10.1093/pan/mpr013> and Iacus, King, Porro (2019) <DOI:10.1017/pan.2018.29>. |
Authors: | Stefano M. Iacus [aut, cre], Gary King [aut], Giuseppe Porro [aut], Richard Nielsen [aut] |
Maintainer: | Stefano M. Iacus <[email protected]> |
License: | GPL-2 |
Version: | 1.1.31 |
Built: | 2024-11-15 06:44:04 UTC |
Source: | CRAN |
An example of ATT estimation from CEM output
att(obj, formula, data, model="linear", extrapolate=FALSE, ntree=2000) ## S3 method for class 'cem.att' plot(x, obj, data, vars=NULL, plot=TRUE, ecolors, ...) ## S3 method for class 'cem.att' summary(object, ...)
att(obj, formula, data, model="linear", extrapolate=FALSE, ntree=2000) ## S3 method for class 'cem.att' plot(x, obj, data, vars=NULL, plot=TRUE, ecolors, ...) ## S3 method for class 'cem.att' summary(object, ...)
obj |
a |
formula |
a model formula. See Details. |
data |
a single data.frame or a list of data.frame's in case of |
model |
one model. See Details. |
extrapolate |
extrapolate the CEM restriced estimate to the whole data. Default = |
ntree |
number of trees to generate in random forest model. Default = |
x |
the output from the |
vars |
a vector of variable names to be used in the parallel plots. By default all variables involved in data matching are used. |
object |
an object of class |
plot |
if |
ecolors |
a vector of three colors respectively for positive, zero and negative
treatment effect. Default |
... |
passed to the plot function or to |
Argument model
can be lm, linear
for linear regression
model; logit
for the the logistic model;
lme, linear-RE
for the linear model with random effects.
Also rf, forest
for the randomforest algorithm.
If the outcome is y
and the
treatment variable is T
, then a formula
like y ~ T
will produce the simplest estimate the ATT: with lm, it is just the
coefficient on T
, which is the same as the difference in means,
weighted by CEM stratum size. Users can add covariates to span any
remaining imbalance after the match, such as y ~ T + age + sex
,
to adjust for variables age
and sex
.
In the case of multiply imputed datasets, the model is applied to each single matched data and the ATT and is the standard error estimated using the standard formulas for combining results of multiply imputed data.
When extrapolate
= TRUE
, the estimate model is extrapolated
to the whole set of data.
There is a print
method for the output of att
. Specifying the
option TRUE
in a print
command gives complete output from the
estimated model when availalble.
A matrix of estimates with their standard error, or a list in
the case of cem.match.list
. For plot.att
a list of strata
estimated treatment effect and group ("positive", "negative", "zero") and
individual treatment and effect and group. The individual treatment effect
and group is given by the treatment effect of the strata. Similarly for
the group ("positive", "negative", "zero"). Also, colors associated to
estimated treatment effects are returned for easy subsequent plotting.
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) # cem match: automatic bin choice mat <- cem(treatment="treated",data=LL, drop="re78", keep.all=TRUE) mat mat$k2k # ATT estimate homo1 <- att(mat, re78~treated, data=LL) rand1 <- att(mat, re78~treated, data=LL, model="linear-RE") rf1 <- att(mat, re78~treated, data=LL, model="rf") homo2 <- att(mat, re78~treated, data=LL, extra=TRUE) rand2 <- att(mat, re78~treated, data=LL, model="linear-RE", extra=TRUE) rf2 <- att(mat, re78~treated, data=LL, model="rf", extra=TRUE) homo1 summary(homo1) rand1 rf1 homo2 rand2 rf2 plot( homo1, mat, LL, vars=c("age","education","re74","re75")) plot( rand1, mat, LL, vars=c("age","education","re74","re75")) plot( rf1, mat, LL, vars=c("age","education","re74","re75")) plot( homo2, mat, LL, vars=c("age","education","re74","re75")) plot( rand2, mat, LL, vars=c("age","education","re74","re75")) plot( rf2, mat, LL, vars=c("age","education","re74","re75")) # reduce the match into k2k using euclidean distance within cem strata mat2 <- k2k(mat, LL, "euclidean", 1) mat2 mat2$k2k # ATT estimate after k2k att(mat2, re78~treated, data=LL) # example with missing data # using multiply imputated data # we use Amelia for multiple imputation if(require(Amelia)){ data(LL) n <- dim(LL)[1] k <- dim(LL)[2] # we generate missing values in 30 percent of the rows of LL data # randomly in one colum per row set.seed(123) LL1 <- LL idx <- sample(1:n, .3*n) for(i in idx){ LL1[i,sample(2:k,1)] <- NA } imputed <- amelia(LL1) imputed <- imputed$imputations[1:5] mat <- cem("treated", datalist=imputed, data=LL1, drop="re78") print(mat) att(mat, re78 ~ treated, data=imputed) }
data(LL) # cem match: automatic bin choice mat <- cem(treatment="treated",data=LL, drop="re78", keep.all=TRUE) mat mat$k2k # ATT estimate homo1 <- att(mat, re78~treated, data=LL) rand1 <- att(mat, re78~treated, data=LL, model="linear-RE") rf1 <- att(mat, re78~treated, data=LL, model="rf") homo2 <- att(mat, re78~treated, data=LL, extra=TRUE) rand2 <- att(mat, re78~treated, data=LL, model="linear-RE", extra=TRUE) rf2 <- att(mat, re78~treated, data=LL, model="rf", extra=TRUE) homo1 summary(homo1) rand1 rf1 homo2 rand2 rf2 plot( homo1, mat, LL, vars=c("age","education","re74","re75")) plot( rand1, mat, LL, vars=c("age","education","re74","re75")) plot( rf1, mat, LL, vars=c("age","education","re74","re75")) plot( homo2, mat, LL, vars=c("age","education","re74","re75")) plot( rand2, mat, LL, vars=c("age","education","re74","re75")) plot( rf2, mat, LL, vars=c("age","education","re74","re75")) # reduce the match into k2k using euclidean distance within cem strata mat2 <- k2k(mat, LL, "euclidean", 1) mat2 mat2$k2k # ATT estimate after k2k att(mat2, re78~treated, data=LL) # example with missing data # using multiply imputated data # we use Amelia for multiple imputation if(require(Amelia)){ data(LL) n <- dim(LL)[1] k <- dim(LL)[2] # we generate missing values in 30 percent of the rows of LL data # randomly in one colum per row set.seed(123) LL1 <- LL idx <- sample(1:n, .3*n) for(i in idx){ LL1[i,sample(2:k,1)] <- NA } imputed <- amelia(LL1) imputed <- imputed$imputations[1:5] mat <- cem("treated", datalist=imputed, data=LL1, drop="re78") print(mat) att(mat, re78 ~ treated, data=imputed) }
Implementation of Coarsened Exact Matching
cem(treatment=NULL, data = NULL, datalist=NULL, cutpoints = NULL, grouping = NULL, drop=NULL, eval.imbalance = FALSE, k2k=FALSE, method=NULL, mpower=2, L1.breaks = NULL, L1.grouping = NULL, verbose = 0, baseline.group="1",keep.all=FALSE)
cem(treatment=NULL, data = NULL, datalist=NULL, cutpoints = NULL, grouping = NULL, drop=NULL, eval.imbalance = FALSE, k2k=FALSE, method=NULL, mpower=2, L1.breaks = NULL, L1.grouping = NULL, verbose = 0, baseline.group="1",keep.all=FALSE)
treatment |
character, name of the treatment variable |
baseline.group |
character, name of the baseline level treatment. See Details. |
data |
a data.frame |
datalist |
a list of optional multiply imputed data.frame's |
cutpoints |
named list each describing the cutpoints for numerical variables (the names are variable names). Each list element is either a vector of cutpoints, a number of cutpoints, or a method for automatic bin contruction. See Details. |
grouping |
named list, each element of which is a list of groupings for a single categorical variable. See Details. |
drop |
a vector of variable names in the data frame to ignore during matching |
eval.imbalance |
Boolean. See Details. |
k2k |
boolean, restrict to k-to-k matching? Default = |
method |
distance method to use in |
mpower |
power of the Minkowski distance. See Details. |
L1.breaks |
list of cutpoints for the calculation of the L1 measure. |
L1.grouping |
as |
verbose |
controls level of verbosity. Default=0. |
keep.all |
if |
For multilevel (and a binary) treatment variables, the cem weights
are calulated with respect to the baseline
. Therefore,
matched units with treatment variable equal to the baseline level receive weight 1, the others the usual cem weights. Unless specified,
by default baseline
is set
to "1"
. If this level is not one of the possible values taken by
the treatment
variable, then the baseline is set to the first level of the treatment
variable.
When specifying cutpoints, several automatic methods may be chosen, including
“sturges
” (Sturges' rule, the default),
“fd
” (Freedman-Diaconis' rule), “scott
”
(Scott's rule) and “ss” (Shimazaki-Shinomoto's rule).
See references for a description of each rule.
The grouping
option is a list where each element is itself a
list. For example, suppose for variable quest1
you have the
following possible levels "no answer", NA, "negative", "neutral",
"positive"
and you want to collect ("no answer", NA, "neutral")
into a single group, then the grouping
argument should contain
list(quest1=list(c("no answer", NA, "neutral")))
. Or if you have
a discrete variable elements
with values 1:10
and you want
to collect it into groups “1:3,NA
”, “4
”,
“5:9
”, “10
” you specify in grouping
the
following list list(elements=list(c(1:3,NA), 5:9))
. Values not
defined in the grouping
are left as they are. If cutpoints
and groupings
are defined for the same variable, the
groupings
take precedence and the corresponding cutpoints are set
to NULL
.
verbose
: a number greater or equal to 0. The higher, the
more info are provided during the execution of the algorithm.
If eval.imbalance
= TRUE
,
cem$imbalance
contains the imbalance measure by absolute
difference in means for numerical variables and chi-square distance for
categorical variables. If FALSE
(the default) then cem$imbalance
is set
to NULL
. If data contains missing data, the imbalance measures
are not calculated.
If L1.breaks
is missing, the default rule to calculate cutpoints
is the Scott's rule.
If k2k
is set to TRUE
, the algorithm return strata with
the same number of treated and control units per stratum, otherwise all
the matched units are returned (default). When k2k
= TRUE
,
the user can choose a method
(between 'euclidean
',
'maximum
', 'manhattan
', 'canberra
', 'binary
'
and 'minkowski
') for nearest neighbor matching inside each
cem
strata. By default method
is set to 'NULL
',
which means random matching inside cem
strata. For the Minkowski
distance the power can be specified via the argument mpower
'.
For more information on method != NULL
, refer to
dist
help page.
If k2k
is set to TRUE
also keep.all
is set to TRUE
.
By default, cem
treats missing values as distinct categories and
matches observations with missing values in the same variable in the
same stratum provided that all the remaining (corasened) covariates
match.
If argument data
is non-NULL
and datalist
is
NULL
, CEM is applied to the single data set in data
.
Argument datalist
is a list of (multiply imputed) data frames
(i.e., with missing cell values imputed). If data
is
NULL
, the function cem
is applied independently to each
element of the list, resulting in separately matched data sets with
different numbers of treated and control units.
When data
and datalist
are both non-NULL
, each
multiply imputed observation is assigned to the stratum in which it has
been matched most frequently. In this case, the algorithm outputs the
same matching solution for each multiply imputed data set (i.e., an
observation, and the number of treated and control units matched, in one
data set has the same meaning in all, and is the same for all)
Returns an object of class cem.match
if only data
is not
NULL
or an object of class cem.match.list
, which is a list of
objects of class cem.match
plus a field called unique
which
is true only if data
and datalist
are not both NULL
.
A cem.match
object is a list
with the following slots:
call |
the call |
strata |
vector of stratum number in which each observation belongs, NA if the observation has not been matched |
n.strata |
number of strata generated |
vars |
report variables names used for the match |
drop |
variables removed from the match |
X |
the coarsened dataset or NULL if |
breaks |
named list of cutpoints, eventually NULL |
treatment |
name of the treatment variable |
groups |
factor, each observation belong to one group generated by the treatment variable |
n.groups |
number of groups identified by the treatment variable |
group.idx |
named list, index of observations belonging to each group |
group.len |
sizes of groups |
tab |
summary table of matched by group |
imbalance |
NULL or a vector of imbalances. See Details. |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Shimazaki, Shinomoto (2007) doi:10.1162/neco.2007.19.6.1503
data(LL) todrop <- c("treated","re78") imbalance(LL$treated, LL, drop=todrop) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78") mat # cem match: user choiced coarsening re74cut <- hist(LL$re74, br=seq(0,max(LL$re74)+1000, by=1000),plot=FALSE)$breaks re75cut <- hist(LL$re75, br=seq(0,max(LL$re75)+1000, by=1000),plot=FALSE)$breaks agecut <- hist(LL$age, br=seq(15,55, length=14),plot=FALSE)$breaks mycp <- list(re75=re75cut, re74=re74cut, age=agecut) mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp) mat # cem match: user choiced coarsening, k-to-k matching mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp,k2k=TRUE) mat # mahalnobis matching: we use MatchIt if(require(MatchIt)){ mah <- matchit(treated~age+education+re74+re75+black+hispanic+nodegree+married+u74+u75, distance="mahalanobis", data=LL) mah #imbalance imbalance(LL$treated, LL, drop=todrop, weights=mah$weights) } # Multiply Imputed data # making use of Amelia for multiple imputation if(require(Amelia)){ data(LL) n <- dim(LL)[1] k <- dim(LL)[2] set.seed(123) LL1 <- LL idx <- sample(1:n, .3*n) for(i in idx){ LL1[i,sample(2:k,1)] <- NA } imputed <- amelia(LL1,noms=c("black","hispanic","treated","married", "nodegree","u74","u75")) imputed <- imputed$imputations[1:5] # without information on which observation has missing values mat1 <- cem("treated", datalist=imputed, drop="re78") mat1 # ATT estimation out <- att(mat1, re78 ~ treated, data=imputed) # with information about missingness mat2 <- cem("treated", datalist=imputed, drop="re78", data=LL1) mat2 # ATT estimation out <- att(mat2, re78 ~ treated, data=imputed) }
data(LL) todrop <- c("treated","re78") imbalance(LL$treated, LL, drop=todrop) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78") mat # cem match: user choiced coarsening re74cut <- hist(LL$re74, br=seq(0,max(LL$re74)+1000, by=1000),plot=FALSE)$breaks re75cut <- hist(LL$re75, br=seq(0,max(LL$re75)+1000, by=1000),plot=FALSE)$breaks agecut <- hist(LL$age, br=seq(15,55, length=14),plot=FALSE)$breaks mycp <- list(re75=re75cut, re74=re74cut, age=agecut) mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp) mat # cem match: user choiced coarsening, k-to-k matching mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp,k2k=TRUE) mat # mahalnobis matching: we use MatchIt if(require(MatchIt)){ mah <- matchit(treated~age+education+re74+re75+black+hispanic+nodegree+married+u74+u75, distance="mahalanobis", data=LL) mah #imbalance imbalance(LL$treated, LL, drop=todrop, weights=mah$weights) } # Multiply Imputed data # making use of Amelia for multiple imputation if(require(Amelia)){ data(LL) n <- dim(LL)[1] k <- dim(LL)[2] set.seed(123) LL1 <- LL idx <- sample(1:n, .3*n) for(i in idx){ LL1[i,sample(2:k,1)] <- NA } imputed <- amelia(LL1,noms=c("black","hispanic","treated","married", "nodegree","u74","u75")) imputed <- imputed$imputations[1:5] # without information on which observation has missing values mat1 <- cem("treated", datalist=imputed, drop="re78") mat1 # ATT estimation out <- att(mat1, re78 ~ treated, data=imputed) # with information about missingness mat2 <- cem("treated", datalist=imputed, drop="re78", data=LL1) mat2 # ATT estimation out <- att(mat2, re78 ~ treated, data=imputed) }
Exploration tool for CEM
cemspace(treatment=NULL, data = NULL, R=100, grouping = NULL, drop=NULL, L1.breaks = NULL, L1.grouping=NULL, plot = TRUE, fixed = NULL, minimal = 1, maximal = 5, M=250, raw.profile=NULL, keep.weights=FALSE, verbose=1)
cemspace(treatment=NULL, data = NULL, R=100, grouping = NULL, drop=NULL, L1.breaks = NULL, L1.grouping=NULL, plot = TRUE, fixed = NULL, minimal = 1, maximal = 5, M=250, raw.profile=NULL, keep.weights=FALSE, verbose=1)
treatment |
character, name of the treatment variable. |
data |
a data.frame. |
verbose |
integer, controls output. |
R |
number of possible random coarsening for the CEM. |
grouping |
named list, each element of which is a list of
groupings for a single categorical variable. For more details see |
drop |
a vector of variable names in the data frame to ignore during matching |
L1.breaks |
list of cutpoints for the calculation of the L1 measure. |
L1.grouping |
as |
plot |
plot the space of solutions? |
fixed |
vector of variable names which will not be relaxed. |
minimal |
the minimal number of intervals acceptable after relaxation. Should be a nameed list of positive integers or if a number, this is applied to all variables. |
maximal |
the maximal number of intervals acceptable after relaxation. Should be a nameed list of positive integers or if a number, this is applied to all variables. |
M |
number of possible random coarsening for the L1 measure |
raw.profile |
an object of class |
keep.weights |
if |
This is a tool to help the user to explore different cem solutions by choosing
random coarsenings. The algorithm tries R
random choiches
of coarsenings into
intervals between minimal
and maximal
for numerical, integer or
ordered factors. It drops or include dichotomous or boolean variables.
Calling directly plot
on the output of cemspace
has the same
effect of calling directly imbspace.plot
.
If you want to relax a given cem solution, use the function
imbspace
instead.
val |
an invisible object of class |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) set.seed(123) tmp <- cemspace("treated", LL, drop="re78", M=50)
data(LL) set.seed(123) tmp <- cemspace("treated", LL, drop="re78", M=50)
Combine two spacegraph objects so that their results can be plotted together. Both spacegraphs must be from the same dataset using the same distance metric.
combine.spacegraphs(x,y)
combine.spacegraphs(x,y)
x |
a spacegraph object to be combined |
y |
a spacegraph object to be combined |
This allows users to combine two spacegraph objects rather than having to re-run the spacegraph command from the start.
Inputs must be created using spacegraph
.
val |
an object of class |
Richard Nielsen
data(LL) set.seed(123) sp1 <- spacegraph("treated", LL, drop="re78", M=5, R=list(cem=5,psm=5, mdm=0)) ## Note that we must use the same L1 measure from the first spacegraph! sp2 <- spacegraph("treated", LL, drop="re78", raw.profile=sp1$raw.profile, R=list(cem=0,psm=0, mdm=5)) sp3 <- combine.spacegraphs(sp1,sp2) plot(sp3)
data(LL) set.seed(123) sp1 <- spacegraph("treated", LL, drop="re78", M=5, R=list(cem=5,psm=5, mdm=0)) ## Note that we must use the same L1 measure from the first spacegraph! sp2 <- spacegraph("treated", LL, drop="re78", raw.profile=sp1$raw.profile, R=list(cem=0,psm=0, mdm=5)) sp3 <- combine.spacegraphs(sp1,sp2) plot(sp3)
A subset of the Lalonde dataset (see cited reference).
data(DW)
data(DW)
A data frame with 445 observations on the following 10 variables.
treated
treated variable indicator
age
age
education
years of education
black
race indicator variable
married
marital status indicator variable
nodegree
indicator variable of not possessing a degree
re74
real earnings in 1974
re75
real earnings in 1975
re78
real earnings in 1978 (post treatment outcome)
hispanic
ethnic indicator variable
u74
unemployment in 1974 indicator variable
u75
unemployment in 1975 indicator variable
see references
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Dehejia, Wahba (1999) doi:10.1080/01621459.1999.10473858
Calculates several imbalance measures for the original and matched data sets
imbalance(group, data, drop=NULL, breaks = NULL, weights, grouping = NULL)
imbalance(group, data, drop=NULL, breaks = NULL, weights, grouping = NULL)
group |
the group variable |
data |
the data |
drop |
a vector of variable names in the data frame to ignore |
breaks |
a list of vectors of cutpoints used to calculate the L1 measure. See Details. |
weights |
weights |
grouping |
named list, each element of which is a list of groupings for a single categorical variable. See Details. |
This function calculates several imbalance measures.
For numeric variables, the difference in means (under the column
statistic
), the difference in quantiles and the L1 measure is
calculated. For categorical variables the L1 measure and the
Chi-squared distance (under column statistic
) is calculated.
Column type
reports either (diff)
or (Chi2)
to
indicate the type of statistic being calculated.
If breaks
is not specified, the Scott automated bin calculation
is used (which coarsens less than Sturges, which used in
cem
). Please refer to cem
help page. In
this case, breaks are used to calculate the L1 measure.
This function also calculate the global L1 imbalance measure.
If breaks
is missing, the default rule to calculate cutpoints
is the Scott's rule.
The grouping
option is a list where each element is itself a
list. For example, suppose for variable quest1
you have the
following possible levels "no answer", NA, "negative", "neutral",
"positive"
and you want to collect ("no answer", NA, "neutral")
into a single group, then the grouping
argument should contain
list(quest1=list(c("no answer", NA, "neutral")))
. Or if you have
a discrete variable elements
with values 1:10
and you want
to collect it into groups “1:3,NA
”, “4
”,
“5:9
”, “10
” you specify in grouping
the
following list list(elements=list(c(1:3,NA), 5:9))
. Values not
defined in the grouping
are left as they are. If cutpoints
and groupings
are defined for the same variable, the
groupings
take precedence and the corresponding cutpoints are set
to NULL
.
See L1.meas
help page for details.
An object of class imbalance
which is a list with the following
two elements
tab |
Table of imbalance measures |
L1 |
The global L1 measure of imbalance |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) todrop <- c("treated","re78") imbalance(LL$treated, LL, drop=todrop) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78")
data(LL) todrop <- c("treated","re78") imbalance(LL$treated, LL, drop=todrop) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78")
Diagnostic tools for CEM
imbspace(obj, data, depth = 1, L1.breaks = NULL, plot = TRUE, fixed = NULL, minimal = 1, maximal = 6, M=250, raw.profile=NULL, verbose=1)
imbspace(obj, data, depth = 1, L1.breaks = NULL, plot = TRUE, fixed = NULL, minimal = 1, maximal = 6, M=250, raw.profile=NULL, verbose=1)
obj |
an object of class |
data |
the original data. |
depth |
if 1, relaxes up to dropping one var, if 2 relaxes (up to dropping) two vars, etc. |
L1.breaks |
list of cutpoints for the calculation of the L1 measure. |
plot |
plot the space of solutions? |
fixed |
vector of variable names which will not be relaxed. |
minimal |
the minimal number of intervals acceptable after relaxation. Should be a nameed list of positive integers or if a number, this is applied to all variables. |
maximal |
the maximal number of intervals acceptable after relaxation. Should be a nameed list of positive integers or if a number, this is applied to all variables. |
M |
number of possible random coarsening for the L1 measure |
raw.profile |
and object of class |
verbose |
integer, controls output. |
This is a diagnostic tool to help the user in the search of different choices
of coarsenings. The algorithm tries all possible combination of coarsenings into
intervals between minimal
and maximal
one variable at time, for
pairs, triplets, etc depending on the value of depth
.
Calling directly plot
on the output of imbspace
has the same
effect of calling directly imbspace.plot
.
val |
an invisible object of class |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Plot of imbalance space diagnostic tool for CEM
imbspace.plot(obj, group="1", data, explore=TRUE, verbose=1)
imbspace.plot(obj, group="1", data, explore=TRUE, verbose=1)
obj |
an object of class |
group |
character string denoting group id. Defaults to |
data |
data for running additional matching solutions. |
explore |
if |
verbose |
integer, controls output. |
For an interactive device a two panels plot is given. On the left panel the user can select a CEM solution and the number of cutpoints used in that matching solution is plotted as a parallel plot on the right plot. On exit (right-click on the left panel), the function returns all the cem solutions highlighted in the last selection of the user.
For non-interactive devices, only the space of the solutions are plotted.
This plot shows the tradeoff in matching as a function of imbalance and sample size.
The imbalance of the raw data is represented as a red plot and the initial CEM solution as a green plot. All solutions below the green dot and left to it are better than the user choice in terms of imbalance and number of units matched.
tab |
an invisible object containing the selection of cem solutions and their coarsenings. |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
require(cem) data(LL) set.seed(123) mat <- cem("treated", LL, drop=c("re78","treated"), cut=list(age=4, edu=4, re74=3, re75=3), keep.all=TRUE) mat imb.raw <- L1.profile(LL$treated, LL[, mat$vars], M=250, plot=FALSE) imbsp <- imbspace(mat, LL,depth=2, raw.profile=imb.raw, maximal=6, minimal=2, fixed=c("hispanic", "black", "married", "nodegree","u74","u75"), plot=FALSE) tmp <- plot(imbsp,data=LL,explore=interactive()) tmp
require(cem) data(LL) set.seed(123) mat <- cem("treated", LL, drop=c("re78","treated"), cut=list(age=4, edu=4, re74=3, re75=3), keep.all=TRUE) mat imb.raw <- L1.profile(LL$treated, LL[, mat$vars], M=250, plot=FALSE) imbsp <- imbspace(mat, LL,depth=2, raw.profile=imb.raw, maximal=6, minimal=2, fixed=c("hispanic", "black", "married", "nodegree","u74","u75"), plot=FALSE) tmp <- plot(imbsp,data=LL,explore=interactive()) tmp
Reduces a CEM output to a k2k matching
k2k(obj, data, method=NULL, mpower=2, verbose=0)
k2k(obj, data, method=NULL, mpower=2, verbose=0)
obj |
an object as output from |
data |
the original data.frame used by |
method |
distance method to use in |
mpower |
power of the Minkowski distance. See Details. |
verbose |
controls level of verbosity. Default=0. |
This function transforms a typical cem
matching solution
to a k
-to-k
match, with k
variable along strata:
i.e., in each stratum generated by cem
, the match is reduce to have
the same number of treated and control units. (This option will delete
some data that matched well, and thus likely increase the variance, but
it means that subsequent analyses do not require weights.)
The user can choose a method
(between 'euclidean
',
'maximum
', 'manhattan
', 'canberra
', 'binary
'
and 'minkowski
') for nearest neighbor matching inside each
cem
strata. By default method
is set to 'NULL
',
which means random matching inside cem
strata. For the Minkowski
distance the power can be specified via the argument mpower
'.
For more information on method != NULL
, refer to
dist
help page.
After k2k
the weights of each matched observation are set to unity. Please notice that option keep.all=TRUE
must be used in cem
calls
otherwise k2k
will not work.
obj |
an object of class |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) set.seed(123) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78", keep.all=TRUE) mat mat$k2k # ATT estimate att(mat, re78 ~ treated, data=LL) # transform the match into k2k mat2 <- k2k(mat, LL, "euclidean", 1) mat2 mat2$k2k # ATT estimate after k2k att(mat2, re78 ~ treated, data=LL)
data(LL) set.seed(123) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78", keep.all=TRUE) mat mat$k2k # ATT estimate att(mat, re78 ~ treated, data=LL) # transform the match into k2k mat2 <- k2k(mat, LL, "euclidean", 1) mat2 mat2$k2k # ATT estimate after k2k att(mat2, re78 ~ treated, data=LL)
Evaluates L1 distance between multidimensional histograms
L1.meas(group, data, drop=NULL, breaks = NULL, weights, grouping = NULL)
L1.meas(group, data, drop=NULL, breaks = NULL, weights, grouping = NULL)
group |
the group variable |
data |
the data |
drop |
a vector of variable names in the data frame to ignore |
breaks |
a list of vectors of cutpoints; if not specified, automatic choice will be made |
weights |
weights |
grouping |
named list, each element of which is a list of groupings for a single categorical variable. See Details. |
This function calculates the L1 distance on the k-dimensional histogram in order to measure the level of imbalance in a matching solution.
If breaks
is not specified, the Scott automated bin calculation
is used (which coarsens less than Sturges, which used in
cem
). Please refer to cem
help page. In
this case, breaks are used to calculate the L1 measure.
When choosing breaks
for L1, a very fine coarsening (many cut
points) produces values of L1 close to 1. A very mild coarsening (very
fex cutpoints), is not able to discriminate, i.e. L1 close to 0
(particularly true when the number of observations is small with respect
to the number of continuous variables).
The grouping
option is a list where each element is itself a
list. For example, suppose for variable quest1
you have the
following possible levels "no answer", NA, "negative", "neutral",
"positive"
and you want to collect ("no answer", NA, "neutral")
into a single group, then the grouping
argument should contain
list(quest1=list(c("no answer", NA, "neutral")))
. Or if you have
a discrete variable elements
with values 1:10
and you want
to collect it into groups “1:3,NA
”, “4
”,
“5:9
”, “10
” you specify in grouping
the
following list list(elements=list(c(1:3,NA), 5:9))
. Values not
defined in the grouping
are left as they are. If cutpoints
and groupings
are defined for the same variable, the
groupings
take precedence and the corresponding cutpoints are set
to NULL
.
The L1.profile
function
shows how to compare matching solutions for any level of (i.e., without
regard to) coarsening.
This code also calculate the Local Common Support (LCS) measure, which is the proportion of non empty k-dimensional cells of the histogram which contain at least one observation per group.
An object of class L1.meas
which is a list with the following fields
L1 |
The numerical value of the L1 measure |
breaks |
A list of cutpoints used to calculate the L1 measure |
LCS |
The numerical value of the Local Common Support proportion |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) set.seed(123) L1.meas(LL$treated,LL, drop=c("treated","re78"))
data(LL) set.seed(123) L1.meas(LL$treated,LL, drop=c("treated","re78"))
Calculates L1 distance for different coarsenings
L1.profile(group, data, drop = NULL, min.cut = 2, max.cut = 12, weights, plot = TRUE, add = FALSE, col = "red", lty = 1, M=100, useCP=NULL, grouping=NULL, progress=TRUE, verbose=1)
L1.profile(group, data, drop = NULL, min.cut = 2, max.cut = 12, weights, plot = TRUE, add = FALSE, col = "red", lty = 1, M=100, useCP=NULL, grouping=NULL, progress=TRUE, verbose=1)
group |
the group variable |
data |
the data |
drop |
a vector of variable names in the data frame to ignore |
min.cut |
minimum number of cut points per variable |
max.cut |
maximum number of cut points per variable |
weights |
weights |
useCP |
a list which elements is a list of cutpoints, usually passed from
a previous instance of |
M |
number of random coarsenings |
plot |
plot a graph? |
add |
add graph to an existing plot? Makes sense only if |
col |
draw in specified color |
lty |
draw using specified lty |
grouping |
named list, each element of which is a list of groupings for a single categorical variable. See Details. |
progress |
if |
verbose |
integer, controls level of output. |
The L1 measure depends on the coarsening chosen to calculate it, and as such the comparison of different matching solutions may differ depending on this somewhat arbitrary choice. This function computes L1 for a random range of possible coarsenings. The point of this function is that if one matching solution has a lower L1 than another, then it dominates without regard to the choice of coarsening. A graphic display conveys the results succinctly. (The logic is similar to that for ROC curves used for classification algorithms.) (This degree of coarsening should remain fixed for different CEM runs.)
For each variables the function generates a random number of cutpoints
between min.cut
and max.cut
in which to cut the support of each
variable. This procedure is repeated M
times. The out is sorted in
increasing values of L1 just for graphical representation.
Non numeric variables are grouped randomly unless they appear specified in
the grouping
argument.
A plot
method exists for the returned object.
An invisible object of class L1profile
which contains a named list of coarsenings and
values of the L1 measure for each coarsening.
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
set.seed(123) data(LL) for(i in c(4:6,10:12)) LL[[i]] <- factor(LL[[i]]) imb0 <- L1.profile(LL$treated,LL, drop=c("treated","re78")) if(require(MatchIt)){ m2 <- matchit(treated ~ black + hispanic + married + nodegree + u74 + u75 + education + age + re74 + re75, data=LL, distance="logit") m3 <- try(matchit(treated ~ black + hispanic + married + nodegree + u74 + u75 + education + age + re74 + re75, data=LL, distance="mahalanobis"), TRUE) L1.profile(LL$treated,LL, drop=c("treated","re78"), weights=m2$w, add=TRUE, col="green", lty=2, useCP=imb0$CP) if(class(m3)[1]!="try-error"){ L1.profile(LL$treated,LL, drop=c("treated","re78"), weights=m3$w, add=TRUE, col="orange", lty=3, useCP=imb0$CP) } } m1 <- cem("treated", LL, drop="re78") L1.profile(LL$treated,LL, drop=c("treated","re78"), weights=m1$w>0, add=TRUE, col="blue", lty=4, useCP=imb0$CP) legend(5, 0.9, legend=c("raw data", "pscore", "mahalanobis", "cem"), lty=1:4, col=c("red", "green", "orange", "blue"))
set.seed(123) data(LL) for(i in c(4:6,10:12)) LL[[i]] <- factor(LL[[i]]) imb0 <- L1.profile(LL$treated,LL, drop=c("treated","re78")) if(require(MatchIt)){ m2 <- matchit(treated ~ black + hispanic + married + nodegree + u74 + u75 + education + age + re74 + re75, data=LL, distance="logit") m3 <- try(matchit(treated ~ black + hispanic + married + nodegree + u74 + u75 + education + age + re74 + re75, data=LL, distance="mahalanobis"), TRUE) L1.profile(LL$treated,LL, drop=c("treated","re78"), weights=m2$w, add=TRUE, col="green", lty=2, useCP=imb0$CP) if(class(m3)[1]!="try-error"){ L1.profile(LL$treated,LL, drop=c("treated","re78"), weights=m3$w, add=TRUE, col="orange", lty=3, useCP=imb0$CP) } } m1 <- cem("treated", LL, drop="re78") L1.profile(LL$treated,LL, drop=c("treated","re78"), weights=m1$w>0, add=TRUE, col="blue", lty=4, useCP=imb0$CP) legend(5, 0.9, legend=c("raw data", "pscore", "mahalanobis", "cem"), lty=1:4, col=c("red", "green", "orange", "blue"))
This is a modified version of the Lalonde experimental dataset used for explanatory reasons only.
data(LL)
data(LL)
A data frame with 722 observations on the following 11 variables.
treated
treatment variable indicator
age
age
education
years of education
black
race indicator variable
married
marital status indicator variable
nodegree
indicator variable for not possessing a degree
re74
real earnings in 1974
re75
real earnings in 1975
re78
real earnings in 1978 (post-treatment outcome)
hispanic
ethnic indicator variable
u74
unemployment in 1974 indicator variable
u75
unemployment in 1975 indicator variable
q1
answer to survey question n1
This data is a copy of the original Lalonde (1986) data set (see LL
)
with 10% of missing data and an additional variable q1
which is the
fictituous answer to the questionarie on “Agreement on this job training program”.
see references
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Lalonde (1986) Evaluating the Econometric Evaluations of Training Programs with Experimental Data, The American Economic Review, 76(4), 604-620.
Lalonde experimental dataset (see cited reference).
data(LL)
data(LL)
A data frame with 722 observations on the following 10 variables.
treated
treatment variable indicator
age
age
education
years of education
black
race indicator variable
married
marital status indicator variable
nodegree
indicator variable for not possessing a degree
re74
real earnings in 1974
re75
real earnings in 1975
re78
real earnings in 1978 (post-treatment outcome)
hispanic
ethnic indicator variable
u74
unemployment in 1974 indicator variable
u75
unemployment in 1975 indicator variable
see references
Lalonde (1986) Evaluating the Econometric Evaluations of Training Programs with Experimental Data, The American Economic Review, 76(4), 604-620.
The Lalonde set of treated units versus PSID (Panel Study of Income Dynamics) control individuals
data(LLvsPSID)
data(LLvsPSID)
A data frame with 2787 observations on the following 10 variables.
treated
treated variable indicator
age
age
education
years of education
black
race indicator variable
married
marital status indicator variable
nodegree
indicator variable of not possessing a degree
re74
real earnings in 1974
re75
real earnings in 1975
re78
real earnings in 1978 (post treatment outcome)
hispanic
ethnic indicator variable
u74
unemployment in 1974 indicator variable
u75
unemployment in 1975 indicator variable
These two sets of treated and control units can be hardly matched.
see references
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Lalonde (1986) Evaluating the Econometric Evaluations of Training Programs with Experimental Data, The American Economic Review, 76(4), 604-620.
Produces a paired sample out of a CEM match solution
pair(obj, data, method=NULL, mpower=2, verbose=1)
pair(obj, data, method=NULL, mpower=2, verbose=1)
obj |
an object as output from |
data |
the original data.frame used by |
method |
distance method to use in |
mpower |
power of the Minkowski distance. See Details. |
verbose |
controls level of verbosity. Default=0. |
This function returns a vector of paired matched units index.
The user can choose a method
(between 'euclidean
',
'maximum
', 'manhattan
', 'canberra
', 'binary
'
and 'minkowski
') for nearest neighbor matching inside each
cem
strata. By default method
is set to 'NULL
',
which means random matching inside cem
strata. For the Minkowski
distance the power can be specified via the argument mpower
'.
For more information on method != NULL
, refer to
dist
help page.
obj |
a list with the fields |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) set.seed(123) # cem match: automatic bin choice mat <- cem(data=LL, drop="re78") # we want a set of paired units psample <- pair(mat, data=LL) table(psample$paired) psample$paired[1:100] table(psample$full.paired) psample$full.paired[1:10] # cem match: automatic bin choice, we drop one row from the data set mat1 <- cem(data=LL[-1,], drop="re78") # we want a set of paired units but we have an odd number of units in the data psample <- pair(mat1, data=LL[-1,]) table(psample$full.paired)
data(LL) set.seed(123) # cem match: automatic bin choice mat <- cem(data=LL, drop="re78") # we want a set of paired units psample <- pair(mat, data=LL) table(psample$paired) psample$paired[1:100] table(psample$full.paired) psample$full.paired[1:10] # cem match: automatic bin choice, we drop one row from the data set mat1 <- cem(data=LL[-1,], drop="re78") # we want a set of paired units but we have an odd number of units in the data psample <- pair(mat1, data=LL[-1,]) table(psample$full.paired)
Heuristic search of the best propensity score model specification
pscoreSelect(formula, data, C.L=2*(pnorm(-1,0,1)), C.Q=0.1,verbose=1)
pscoreSelect(formula, data, C.L=2*(pnorm(-1,0,1)), C.Q=0.1,verbose=1)
data |
the original data. |
formula |
formula type specification. |
C.L |
if at leat one likelihood ration test statistic is greater
than |
C.Q |
if the highest likelihood ratio statistic is greater than |
verbose |
integer, prints more info on screen. |
This is a tool to help the user to the search for the best
propensity score model specification along the lines
suggested by Imbens and Rubin (forthcoming). The output of the function is
a model formula to be passed to glm
or such, in order
to estimate the propensity score model and then perform
propensity score matching.
This tool is useful in combination with imbspace.plot
.
val |
an invisible object of class |
Richard Nielsen
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Imbens, Rubin (2015) doi:10.1017/CBO9781139025751
set.seed(123) data(LL) mod <- pscoreSelect( treated ~ age + education + black+ married + nodegree + re74 + re75 + hispanic + u74 + u75, data=LL) print(mod)
set.seed(123) data(LL) mod <- pscoreSelect( treated ~ age + education + black+ married + nodegree + re74 + re75 + hispanic + u74 + u75, data=LL) print(mod)
Diagnostic tools for CEM
relax.cem(obj, data, depth=1, verbose = 1, L1.breaks=NULL, plot=TRUE, fixed=NULL, shifts=NULL, minimal=NULL, use.coarsened=TRUE, eval.imbalance=TRUE, use.weights=FALSE, ...) relax.plot(tab, group="1", max.terms=50, perc=.5, unique=FALSE, colors=TRUE)
relax.cem(obj, data, depth=1, verbose = 1, L1.breaks=NULL, plot=TRUE, fixed=NULL, shifts=NULL, minimal=NULL, use.coarsened=TRUE, eval.imbalance=TRUE, use.weights=FALSE, ...) relax.plot(tab, group="1", max.terms=50, perc=.5, unique=FALSE, colors=TRUE)
obj |
an object of class cem. |
data |
the original data. |
verbose |
controls the level of verbosity. |
L1.breaks |
list of cutpoints for the calculation of the L1 measure. |
plot |
plot the solutions? |
tab |
the output table from |
fixed |
vector of variable names which will not be relaxed. |
max.terms |
plot only the last best results of |
shifts |
a vector of proportions of shifts. |
minimal |
the minimal number of intervals acceptable after relaxation. Should be a nameed list of positive integers. |
group |
character string denoting group id. Defaults to |
perc |
only plot if percentage of matched units is greater than |
unique |
only plot different solutions (in terms of matched units). |
depth |
if 1, relaxes up to dropping one var, if 2 relaxes (up to dropping) two vars, etc. |
use.coarsened |
used coarsened values for continuous variables. |
colors |
If |
eval.imbalance |
If |
use.weights |
If |
... |
passed to the |
relax.cem
starts from a cem solution (as given by
cem
) which has to be run with argument keep.all=TRUE
.
relax.cem
tries several relaxed coarsenings on the
variables. Coarsenings corresponds to dividing the support of each
variable into a decreasing number of intervals of the same length (even
if in the starting solution intervals are of different lengths).
Because CEM is MIB, the number of matched units increases as the
number of intervals decrease. All variables are coarsened into
k
intervals along a sequence which starts from the original
number of intervals and decreases to 10 intervals by 2,
then continues from 10 down to 1 intervals by 1. If
minimal
is specified, variables are coarsened down to that
minimal value.
To observe MIB property of CEM use.coarsened
(default) should be
set to TRUE
; otherwise the coarsening of the continuous variable
will be recalculated at each iteration and there is no guarantee of
monotonicity.
relax.cem
outputs a list of tables. Each table is named
Ggroup
where group
is the id of the group. Each
Ggroup
table is ordered in increasing order of matched units of
group group
. Columns PercGgroup
and Ggroup
report
percentage and absolute number of matched units for each group
.
Column Relaxed
indicates which relaxation has been done, with
something like "V1(4), V3(5)"
, which means "variable V1
has been split in 4 intervals of the same length and variable V3
into five intervals". Thus, the number of intervals is reported in
parenthases and if equal to 1 means that the corresponding variable is
excluded from affecting the match (i.e. all observations are assigned to
the same interval).
If shifts
is not null, each coarsening is shifted accordingly
(see shift.cem
for additional details). In case of
shifting “S:
” appears in the labels.
The relax.plot
, plot all the different relaxation in increasing
order of number of treated units matched. For each coarsening it also reports
the value of the L1 measure. The table generated by relax.cem
may
contain many entries. By default, only a portion of best coarsenings
are plotted (option max.terms
). In addition, the user can specify
to plot the corasening for which at least a certain percentage of
treated units have been matched (option perc
, by default 50
In addition, of several different coarsenings which lead to the same number of treated
units matched, the user can specify to plot only one of them using the
option unique
= TRUE
(default).
If L1.breaks
are NULL
they are taken from the cem
object
if available or calculated atumatically as in cem
.
Calling directly plot
on the output of cem.relax
has the same
effect of calling directly relax.plot
.
tab |
an invisible object containing the tabs and the L1breaks used |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) set.seed(123) mat <- cem(treatment="treated",data=LL, drop="re78", keep.all=TRUE) mat tab <- relax.cem(mat, LL, depth=1, plot=FALSE) relax.plot(tab, group="1") plot(tab, group="1") relax.plot(tab, group="1", unique=TRUE) relax.plot(tab, group="1", perc=0.6) relax.plot(tab, group="1", perc=0.6,unique=TRUE) tab1 <- relax.cem(mat, LL, depth=1, minimal=list(re74=6, age=3, education=3, re75=5)) tab2 <- relax.cem(mat, LL, depth=1, minimal=list(re74=6, age=3, education=3, re75=5), shifts=0.01) tab3 <- relax.cem(mat, LL, depth=1, minimal=list(age=3, education=3), fixed=c("re74","re75")) tab4 <- relax.cem(mat, LL, depth=2, minimal=list(age=4, education=3,re75=6),plot=FALSE, fixed="re74") relax.plot(tab4) relax.plot(tab4, unique=TRUE) relax.plot(tab4, perc=0.7)
data(LL) set.seed(123) mat <- cem(treatment="treated",data=LL, drop="re78", keep.all=TRUE) mat tab <- relax.cem(mat, LL, depth=1, plot=FALSE) relax.plot(tab, group="1") plot(tab, group="1") relax.plot(tab, group="1", unique=TRUE) relax.plot(tab, group="1", perc=0.6) relax.plot(tab, group="1", perc=0.6,unique=TRUE) tab1 <- relax.cem(mat, LL, depth=1, minimal=list(re74=6, age=3, education=3, re75=5)) tab2 <- relax.cem(mat, LL, depth=1, minimal=list(re74=6, age=3, education=3, re75=5), shifts=0.01) tab3 <- relax.cem(mat, LL, depth=1, minimal=list(age=3, education=3), fixed=c("re74","re75")) tab4 <- relax.cem(mat, LL, depth=2, minimal=list(age=4, education=3,re75=6),plot=FALSE, fixed="re74") relax.plot(tab4) relax.plot(tab4, unique=TRUE) relax.plot(tab4, perc=0.7)
Heuristic search of match solutions
search.match(data, treatment, vars, depth=3, min.vars =1, group=1, useCP,verbose=1, ...)
search.match(data, treatment, vars, depth=3, min.vars =1, group=1, useCP,verbose=1, ...)
data |
the original data. |
treatment |
name of the treatment variable. |
depth |
level of interaction and squares. See Details. |
vars |
vector of variables' names to match on. |
min.vars |
minimum number of variables to consider in the model. |
group |
the indentifier of the treated group, usually 1 or the level of the fact variable |
useCP |
the cutpoints for the calculation of the L1 measure |
verbose |
integer, prints more info on screen. |
... |
passed to |
This is a tool to help the user in the search of different choices models for matching.
For example, for the search of different propensity score models.
The tool tries all submodels of k
variables starting from one covariate up to the full model. Then adds
interactions to the full model trying all pairs, triplets etc according to the parameter depth
.
Then, for continuous variables only, adds squared terms to the full model.
This tool is useful in combination with imbspace.plot
.
val |
an invisible object of class |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Diagnostic tools for CEM. Applies leftward and rightward shifts of the cutpoints.
shift.cem(obj, data, shifts=NULL, verbose=1, plot=TRUE)
shift.cem(obj, data, shifts=NULL, verbose=1, plot=TRUE)
obj |
and object of class cem |
data |
the original data |
shifts |
a vector of proportions of shifts |
verbose |
controls the level of verbosity |
plot |
whether to plot a graphic representation of the search |
For each variable, shift all the cutpoints left and right by
shifts
times the smallest epsilon of the
coarsening. Shifting to the right produces a new cell on the left;
shift to the left, adds a new cell to the coarsening on the right.
Only positive proportions should be used; the algorithm
will produce shifting on the left or on the right. The best shifting
of the original cem match is produced as output, where best is defined
in terms of the maximal total number of matched units mT+mC
(see below).
By default, the function returns minimal information about the execution
of the algorithm. By setting a value greater than 0 in option verbose
more feedback on the process is returned.
Option plot
= TRUE
plots the number of treated units
matched mT
, the number of control units
matched mC
, and the sum mT+mC
, as a function of the shifts.
tab |
an invisible object containing a new cem object |
Stefano Iacus, Gary King, and Giuseppe Porro
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
data(LL) set.seed(123) m74 <- max(LL$re74, na.rm=TRUE) s74 <- seq(0,m74,by=sd(LL$re74)) l74 <- length(s74) if(max(s74) < m74) s74 <- c(s74, m74) m75 <- max(LL$re75, na.rm=TRUE) s75 <- seq(0,m75,by=sd(LL$re75)) l75 <- length(s75) if(max(s75) < m75) s75 <- c(s75, m75) mybr = list(re74=s74, re75 = s75, age = hist(LL$age,plot=FALSE)$breaks, education = hist(LL$education,plot=FALSE)$breaks) mat <- cem(treatment="treated",data=LL, drop="re78",cut=mybr) mat shift.cem(mat, data=LL, shifts=seq(0.01, 0.5, length=10), verb=1)
data(LL) set.seed(123) m74 <- max(LL$re74, na.rm=TRUE) s74 <- seq(0,m74,by=sd(LL$re74)) l74 <- length(s74) if(max(s74) < m74) s74 <- c(s74, m74) m75 <- max(LL$re75, na.rm=TRUE) s75 <- seq(0,m75,by=sd(LL$re75)) l75 <- length(s75) if(max(s75) < m75) s75 <- c(s75, m75) mybr = list(re74=s74, re75 = s75, age = hist(LL$age,plot=FALSE)$breaks, education = hist(LL$education,plot=FALSE)$breaks) mat <- cem(treatment="treated",data=LL, drop="re78",cut=mybr) mat shift.cem(mat, data=LL, shifts=seq(0.01, 0.5, length=10), verb=1)
Randomly compute many different matching solutions
spacegraph(treatment=NULL, data = NULL, R=list("cem"=50,"psm"=0,"mdm"=0,"matchit"=0), grouping = NULL, drop=NULL, L1.breaks = NULL, L1.grouping=NULL, fixed = NULL, minimal = 1, maximal = 15, M=100, raw.profile=NULL, keep.weights=FALSE, progress=TRUE, rgrouping=FALSE, groups=NULL, psmpoly=1, mdmpoly=1, other.matches=NULL, heuristic=FALSE, linear.pscore=FALSE, verbose=1)
spacegraph(treatment=NULL, data = NULL, R=list("cem"=50,"psm"=0,"mdm"=0,"matchit"=0), grouping = NULL, drop=NULL, L1.breaks = NULL, L1.grouping=NULL, fixed = NULL, minimal = 1, maximal = 15, M=100, raw.profile=NULL, keep.weights=FALSE, progress=TRUE, rgrouping=FALSE, groups=NULL, psmpoly=1, mdmpoly=1, other.matches=NULL, heuristic=FALSE, linear.pscore=FALSE, verbose=1)
treatment |
character, name of the treatment variable |
data |
a data.frame |
drop |
a vector of variable names in the data frame to ignore during matching. |
R |
a named list that gives the number of of possible random solutions for each matching method. Allowed methods are |
grouping |
named list, each element of which is a list of
groupings for a single categorical variable. For more details see |
L1.breaks |
list of cutpoints for the calculation of the L1 measure. |
L1.grouping |
as |
fixed |
vector of variable names which will not be relaxed. |
minimal |
the minimal number of intervals acceptable after relaxation. Should be a named list of positive integers or if a number, this is applied to all variables. |
maximal |
the maximal number of intervals acceptable after relaxation. Should be a named list of positive integers or if a number, this is applied to all variables. |
M |
number of possible random coarsening for the L1 measure |
raw.profile |
an object of class |
keep.weights |
if |
rgrouping |
Boolean, specifies whether levels of categorical variables should be randomly grouped together by CEM. |
groups |
named list, each element of which is a list of allowable groupings for a single categorical variable. |
psmpoly |
numeric, specifying the order of polynomials to include in the propensity score models. At the moment, only |
mdmpoly |
numeric, specifying the order of polynomials to include in Mahalanobis matching. At the moment, only |
other.matches |
This argument allows plotting of user-specified matching solutions. The solutions must be supplied in a specific format: as a list of data frames, where each data frame contains the observation IDs, observations weights, and the method. These must be provided in three columns of the data frame, with names (exactly) "id", "weight", "method". See the example. |
heuristic |
Boolean, if TRUE spacegraph uses a heuristic method to randomly select covariates for MDM and PSM rather than sampling from all possible combinations of covariates. The heuristic tends to select most of the main effects and a small number of interactions. This method is automatically applied with large numbers of covariates. |
progress |
show progress bars. |
linear.pscore |
does |
verbose |
integer, controls the level of output. |
Spacegraph is a tool to help the user to the search for optimal matching solutions by generating many matching solutions from a variety of matching algorithms (currently CEM, Mahalanobis distance matching, and propensity score matching are supported). The resulting object can be plotted with plot()
to show where each solution falls along the bais-variance tradeoff.
The spacegraph
function currently calculates two measures of balance for each solution: the L1 metric (see L1.meas
) and the difference in means of the covariates. Typically, analyists look at the difference in means seperately for each variable, but this can't be plotted on a two-dimensional graph. We summarize the difference in means by calculating the average difference in means for all of the covariates. Specifically, we calculate the difference in means for each variable as mean(treated)-mean(control)/sd(treated) and then average across all covariates.
val |
an object of class |
Richard Nielsen, Stefano Iacus, Gary King, and Giuseppe Porro
data(LL) set.seed(123) sp <- spacegraph("treated", LL, drop="re78", M=5, R=list(cem=5,psm=5, mdm=5)) plot(sp) ## ABOUT THE PLOTTING TOOL: ## The circled solution is the current selection. ## Solutions that are strictly better are also circled. ## The gui provides the exact call to re-run the selected matching solution. ## The call can also be edited, re-run by clicking the button, and ## automatically added to the existing spacegraph. ## CEM solutions can also be adjusted variable by variable and re-run. ## Some plotting parameters can be changed plot(sp, main="Comparison of Matching Methods", ylab="L1", xlim=c(300,50), ylim=c(0,.7)) ## You can specify whether the x-axis shows treated units, ## control units, or all units using the argument N, which ## can take the values "treated", "control", or "all". ## Default is "treated". plot(sp, N="all") ## You can specify how the x-axis is scaled. Setting scale.var=T ## gives you the scaling as 1/sqrt(n). Setting scale.var=F gives ## scales it linearly. Default is scale.var=TRUE. plot(sp, scale.var=FALSE) ## You can also specify which measure of balance to use ## by specifying the argument "balance.metric" as ## "L1", "mdiff" (Avg. standardized difference in means), ## or "mdisc" (Average Malanobis Discrepancy). Default is "L1". plot(sp, balance.metric="mdiff") plot(sp, balance.metric="mdisc") ## Matching solutions from other methods can be included in ## a spacegraph by using the argument "other.matches". ## First, Run a matching method. Here, propensity scores from MatchIt. library(MatchIt) m.out <- matchit(formula=treated ~ education+age, data=LL, method = "nearest") ## Put the required information into a list of data frames. ## Note, there are many ways to do this. mymatches <- list(data.frame(names(m.out$w))) names(mymatches[[1]])[1] <- "id" mymatches[[1]]$weight <- m.out$w mymatches[[1]]$method <- "matchit psm" sp <- spacegraph("treated", LL, drop="re78", M=5, R=list(cem=5,psm=5, mdm=5), other.matches=mymatches) plot(sp)
data(LL) set.seed(123) sp <- spacegraph("treated", LL, drop="re78", M=5, R=list(cem=5,psm=5, mdm=5)) plot(sp) ## ABOUT THE PLOTTING TOOL: ## The circled solution is the current selection. ## Solutions that are strictly better are also circled. ## The gui provides the exact call to re-run the selected matching solution. ## The call can also be edited, re-run by clicking the button, and ## automatically added to the existing spacegraph. ## CEM solutions can also be adjusted variable by variable and re-run. ## Some plotting parameters can be changed plot(sp, main="Comparison of Matching Methods", ylab="L1", xlim=c(300,50), ylim=c(0,.7)) ## You can specify whether the x-axis shows treated units, ## control units, or all units using the argument N, which ## can take the values "treated", "control", or "all". ## Default is "treated". plot(sp, N="all") ## You can specify how the x-axis is scaled. Setting scale.var=T ## gives you the scaling as 1/sqrt(n). Setting scale.var=F gives ## scales it linearly. Default is scale.var=TRUE. plot(sp, scale.var=FALSE) ## You can also specify which measure of balance to use ## by specifying the argument "balance.metric" as ## "L1", "mdiff" (Avg. standardized difference in means), ## or "mdisc" (Average Malanobis Discrepancy). Default is "L1". plot(sp, balance.metric="mdiff") plot(sp, balance.metric="mdisc") ## Matching solutions from other methods can be included in ## a spacegraph by using the argument "other.matches". ## First, Run a matching method. Here, propensity scores from MatchIt. library(MatchIt) m.out <- matchit(formula=treated ~ education+age, data=LL, method = "nearest") ## Put the required information into a list of data frames. ## Note, there are many ways to do this. mymatches <- list(data.frame(names(m.out$w))) names(mymatches[[1]])[1] <- "id" mymatches[[1]]$weight <- m.out$w mymatches[[1]]$method <- "matchit psm" sp <- spacegraph("treated", LL, drop="re78", M=5, R=list(cem=5,psm=5, mdm=5), other.matches=mymatches) plot(sp)