Based on the basque
dataset and the corresponding
example of package Synth
, this vignette illustrates the
usage of package MSCMT
. Thereby, parts of Abadie and Gardeazabal (2003) are reproduced in
this example.
The MSCMT
package expects its input data to be a
list
of matrices, where each matrix corresponds to one
variable of interest. The column names of the matrices correspond to the
names of all units (treated unit and control units) and have to be
identical across all elements of the list. The row names correspond to
the instants of time where data is available, they may well be different
between variables.
Currently, all variables must either contain annual, quarterly,
monthly, weekly or daily data. The row names must therefore match one of
the following patterns (where d
corresponds to a single
digit):
dddd
for annual dates, e.g. 2016
,ddddQd
for quarterly dates, e.g. 2016Q1
for the first quarter of 2016,dddd-dd
for monthly dates, e.g. 2016-03
for March 2016.ddddWdd
for weekly dates, e.g. 2016W03
for
the third week of 2016.dddd-dd-dd
for daily dates,
e.g. 2016-03-17
.A list of matrices is a very useful, but quite unusual data format.
Therefore, a helper function called listFromLong
, which
generates a list of matrices from a more common data representation, the
so-called ‘long’ format, has been included in package
MSCMT
. With listFromLong
, migrating from the
Synth
package to MSCMT
is simple, because
package Synth
(more precisely, its dataprep
function) expects its input data to be in ‘long’ format.
The basque
dataset in package Synth
in
‘long’ format has the following structure:
## [1] "regionno" "regionname" "year"
## [4] "gdpcap" "sec.agriculture" "sec.energy"
## [7] "sec.industry" "sec.construction" "sec.services.venta"
## [10] "sec.services.nonventa" "school.illit" "school.prim"
## [13] "school.med" "school.high" "school.post.high"
## [16] "popdens" "invest"
## regionno regionname year gdpcap sec.agriculture
## 703 17 Basque Country (Pais Vasco) 1969 6.081405 5.53
## sec.energy sec.industry sec.construction sec.services.venta
## 703 3.75 45.39 6.45 34.09
## sec.services.nonventa school.illit school.prim school.med school.high
## 703 4.8 43.2716 1093.236 123.8557 30.35246
## school.post.high popdens invest
## 703 15.08178 246.89 26.37294
With listFromLong
from package MSCMT
, the
conversion to ‘list’ format is a simple task. The names of the
parameters correspond to the names of the parameters for function
dataprep
of package Synth
:
library(MSCMT)
Basque <- listFromLong(basque, unit.variable="regionno", time.variable="year", unit.names.variable="regionname")
names(Basque)
## [1] "gdpcap" "sec.agriculture" "sec.energy"
## [4] "sec.industry" "sec.construction" "sec.services.venta"
## [7] "sec.services.nonventa" "school.illit" "school.prim"
## [10] "school.med" "school.high" "school.post.high"
## [13] "popdens" "invest"
## Spain (Espana) Andalucia Aragon Principado De Asturias Baleares (Islas)
## 1955 2.354542 1.688732 2.288775 2.502928 3.143959
## 1956 2.480149 1.758498 2.445159 2.615538 3.347758
## 1957 2.603613 1.827621 2.603399 2.725793 3.549629
## 1958 2.637104 1.852756 2.639032 2.751857 3.642673
## 1959 2.669880 1.878035 2.677092 2.777421 3.734862
## 1960 2.869966 2.010140 2.881462 2.967295 4.058841
## Canarias Cantabria Castilla Y Leon Castilla-La Mancha Cataluna
## 1955 1.914382 2.559412 1.729149 1.327764 3.546630
## 1956 2.071837 2.693873 1.838332 1.415096 3.690446
## 1957 2.226078 2.820337 1.947658 1.503570 3.826835
## 1958 2.220866 2.879035 1.971365 1.531420 3.875678
## 1959 2.213439 2.943730 1.995144 1.559340 3.921737
## 1960 2.357684 3.137032 2.138817 1.667524 4.241788
## Comunidad Valenciana Extremadura Galicia Madrid (Comunidad De)
## 1955 2.575978 1.243430 1.634676 4.594473
## 1956 2.738503 1.332548 1.725578 4.786632
## 1957 2.899886 1.422451 1.816481 4.963439
## 1958 2.963510 1.440231 1.840903 4.906170
## 1959 3.026207 1.458083 1.865396 4.846401
## 1960 3.219294 1.535847 1.983290 5.161097
## Murcia (Region de) Navarra (Comunidad Foral De)
## 1955 1.679520 2.555127
## 1956 1.764282 2.698158
## 1957 1.850328 2.839831
## 1958 1.887389 2.881891
## 1959 1.924093 2.930877
## 1960 2.118609 3.163525
## Basque Country (Pais Vasco) Rioja (La)
## 1955 3.853185 2.390460
## 1956 3.945658 2.535204
## 1957 4.033562 2.680020
## 1958 4.023422 2.726435
## 1959 4.013782 2.772851
## 1960 4.285918 2.969866
The ‘list’ representation allows for simple, reproducible data preparations. To reproduce the analysis of Abadie and Gardeazabal (2003), the following preparations are sufficient:
# define the sum of all cases
school.sum <- with(Basque,colSums(school.illit + school.prim + school.med + school.high + school.post.high))
# combine school.high and school.post.high in a single class
Basque$school.higher <- Basque$school.high + Basque$school.post.high
# calculate ratios and multiply by number of observations to obtain percentages from totals
for (item in c("school.illit", "school.prim", "school.med", "school.higher"))
Basque[[item]] <- 6 * 100 * t(t(Basque[[item]]) / school.sum)
In package Synth
, data preparation and model formulation
are combined in function dataprep
, whereas the estimation
is separated from the model formulation. The approach of package
MSCMT
is different: in order to facilitate the estimation
of (potentially many) different models based on the same dataset, the
model formulation has been moved to function mscmt
, which
also does the model estimation.
To allow for the methodological extensions (Multivariate Synthetic Control Methods using Time-Series), the model syntax had to be made more flexible. A model formulation now consists of
treatment.identifier
),controls.identifier
),2 x L
-matrix (parameter
times.dep
), where
L
dependent variables,2 x K
-matrix (parameter
times.pred
), where
K
predictor variables,K
aggregation
functions (parameter agg.fns
) for the predictor variables.
If missing, all predictor variables are considered to be time series,
which corresponds to the function name "id"
. Whenever the
result of an aggregation function has length exceeding 1, the resulting
data is considered as a time series, too.Note that times.dep
and times.pred
may
contain duplicate column names for further increased flexibility of the
model specification.
The following model specification reproduces the model in Abadie and Gardeazabal (2003):
treatment.identifier <- "Basque Country (Pais Vasco)"
controls.identifier <- setdiff(colnames(Basque[[1]]),
c(treatment.identifier, "Spain (Espana)"))
times.dep <- cbind("gdpcap" = c(1960,1969))
times.pred <- cbind("school.illit" = c(1964,1969),
"school.prim" = c(1964,1969),
"school.med" = c(1964,1969),
"school.higher" = c(1964,1969),
"invest" = c(1964,1969),
"gdpcap" = c(1960,1969),
"sec.agriculture" = c(1961,1969),
"sec.energy" = c(1961,1969),
"sec.industry" = c(1961,1969),
"sec.construction" = c(1961,1969),
"sec.services.venta" = c(1961,1969),
"sec.services.nonventa" = c(1961,1969),
"popdens" = c(1969,1969))
agg.fns <- rep("mean", ncol(times.pred))
After preparing the data and formulating the model, the model
estimation is done with function mscmt
. Apart from the
function parameters concerning data and model specification, there are
further parameters which, e.g.,
A simple estimation (without a placebo study) produces the following
console output (if parameter verbose
is TRUE
,
which is the default):
res <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, seed=1)
## 06:42:06: Number of 'sunny' donors: 16 out of 16
## 06:42:06: Unrestricted outer optimum (obtained by ignoring all predictors) with
## 06:42:06: RMSPE 0.064236669716524 and MSPE (loss v) 0.0041263497362698 is
## 06:42:06: INFEASIBLE when respecting the predictors.
## 06:42:06: Starting optimization via DEoptC, random seed 1.
## 06:42:11: Optimization finished (1122001 calls to inner optimizer), rmspe:
## 06:42:11: 0.0654680949292753, mspe: 0.00428607145366861.
## Final rmspe: 0.06546809, mspe (loss v): 0.004286071
## Optimal weights:
## Baleares (Islas) Cataluna Madrid (Comunidad De)
## 0.2192728 0.6327857 0.1479414
Package MSCMT
ships with an S3
method for
print
, which gives a nice human-readable summary of the
estimation results:
## Specification:
## --------------
##
## Model type: SCM
## Treated unit: Basque Country (Pais Vasco)
## Control units: Andalucia, Aragon, Principado De Asturias, Baleares (Islas),
## Canarias, Cantabria, Castilla Y Leon, Castilla-La Mancha,
## Cataluna, Comunidad Valenciana, Extremadura, Galicia,
## Madrid (Comunidad De), Murcia (Region de),
## Navarra (Comunidad Foral De), Rioja (La)
## Dependent(s): gdpcap with optimization period from 1960 to 1969
## Predictors: school.illit from 1964 to 1969, aggregated via 'mean',
## school.prim from 1964 to 1969, aggregated via 'mean',
## school.med from 1964 to 1969, aggregated via 'mean',
## school.higher from 1964 to 1969, aggregated via 'mean',
## invest from 1964 to 1969, aggregated via 'mean',
## gdpcap from 1960 to 1969, aggregated via 'mean',
## sec.agriculture from 1961 to 1969, aggregated via 'mean',
## sec.energy from 1961 to 1969, aggregated via 'mean',
## sec.industry from 1961 to 1969, aggregated via 'mean',
## sec.construction from 1961 to 1969, aggregated via 'mean',
## sec.services.venta from 1961 to 1969, aggregated via 'mean',
## sec.services.nonventa from 1961 to 1969, aggregated via 'mean',
## popdens from 1969 to 1969, aggregated via 'mean'
##
##
## Results:
## --------
##
## Result type: Ordinary solution, ie. no perfect preditor fit possible and the
## predictors impose some restrictions on the outer optimization.
## Optimal W: Baleares (Islas) : 21.92728%,
## Cataluna : 63.27857%,
## Madrid (Comunidad De): 14.79414%
## Dependent loss: MSPE ('loss V'): 0.004286071,
## RMSPE : 0.065468095
## (Optimal) V: Some optimal weight vectors V are:
## min.loss.w max.order
## school.illit.mean.1964.1969 9.998580e-09 1.578398e-05
## school.prim.mean.1964.1969 9.998580e-09 1.578398e-05
## school.med.mean.1964.1969 9.998580e-09 1.578398e-05
## school.higher.mean.1964.1969 9.998580e-09 2.903475e-04
## invest.mean.1964.1969 8.469442e-05 2.990163e-04
## gdpcap.mean.1960.1969 9.998580e-01 9.992528e-01
## sec.agriculture.mean.1961.1969 9.998580e-09 1.578398e-05
## sec.energy.mean.1961.1969 9.998580e-09 1.578398e-05
## sec.industry.mean.1961.1969 9.998580e-09 1.578398e-05
## sec.construction.mean.1961.1969 9.998580e-09 1.578398e-05
## sec.services.venta.mean.1961.1969 9.998580e-09 1.578398e-05
## sec.services.nonventa.mean.1961.1969 5.718465e-05 1.578398e-05
## popdens.mean.1969.1969 9.998580e-09 1.578398e-05
## ----------
## pred. loss 1.139851e-04 3.374961e-04
## (Predictor weights V are standardized by sum(V)=1)
##
While there is a basic S3
method for plot
,
it is strongly recommended to use package ggplot2
and the
corresponding S3
method for ggplot
contained
in package MSCMT
. With the results of a simple estimation,
two types of plots are available, the first being a comparison of
original and synthesized data for the treated unit. The variable to be
plotted can be selected with parameter what
, by default the
(first) dependent variable is being chosen:
The second type of plot is a plot of the gaps, ie. the differences between original and synthesized data:
It is possible to plot several variables by providing a vector for
argument what
. Pre-defined sets of variables named
"dependents"
, "predictors"
, and
"all"
can be selected with parameter
what.set
.
Placebo studies are performed by simply setting the function argument
placebo
to TRUE
. By default, the original
treated unit is not added to the donor pool of the (original) control
units, but this can be changed with parameter
placebo.with.treated
.
A remarkable speed-up (depending on the number of control units) can
be achieved for placebo studies by making use of a cluster, which can be
set up with the help of package parallel
. The simplest form
of a cluster is a local cluster, which makes the power of multi-core
cpus available for the (lengthy) computations involved with placebo
studies. Setting up a local cluster is very easy, see the example
below.
The argument cl
of function mscmt
can be
used to specify the cluster to be used for placebo studies. The only
drawback of using a cluster for placebo studies is losing the verbose
output from the individual (one for each unit) SCM estimations, which
includes a lack of progress information for the whole placebo study.
Nevertheless, the speed-up should compensate for this drawback in all
applications where a placebo study is meaningful.
Although clusters should be shut down automatically when the
corresponding (master) R
process is finished, function
stopCluster
can (and should) be used to shut down the
cluster manually.
In the following example, a (local) cluster with two nodes is used for the estimation.
library(parallel)
cl <- makeCluster(2)
resplacebo <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, cl=cl, placebo=TRUE, seed=1)
## 06:42:12: Starting placebo study, excluding original treated unit, on the
## 06:42:12: cluster. Please hold the line.
## 06:42:54: Placebo study on cluster finished.
Object resplacebo
now contains single SCM estimations
for each unit as well as aggregated information concerning original
data, synthesized data, and gaps for all units. The individual SCM
estimations can be accessed separately (as list elements with names
corresponding to the units’ names). With the following plot, one can
inspect whether there is some effect for Catalonia:
Several functions in package MSCMT
are able to make use
of the results of the placebo study as a whole. One example are
so-called placebo plots, by setting the plot type to
"placebo.gaps"
(the default for results of a placebo
study):
For statistical inference based on the results of a placebo study, the literature has developed so-called ‘placebo tests’, which have similarities to permutation tests. Two of these are
Most often, not all control units can be synthesized with an acceptable fit in a placebo study, resulting in large pre-treatment gaps. Of course, large post-treatment gaps are expected for these units, but since these gaps are rather caused from lack of fit than from an existing treatment effect, excluding such units is strongly advisable while investigating the effect for the (original) treated unit.
Excluding control units with large pre-treatment errors is usually done by limiting the ratio of a control unit’s pre-treatment (r)mspe to the treated unit’s (r)mspe.
Control units with large pre-treatment errors can easily be excluded from placebo plots:
The p-values of per-period placebo tests can be calculated via
function pvalue
or plotted via ggplot
with
plot.type="p.values"
:
## Time Series:
## Start = 1970
## End = 1997
## Frequency = 1
## [1] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.36363636
## [7] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
## [13] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
## [19] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
## [25] 0.09090909 0.09090909 0.18181818 0.18181818
Calculating the aggregated treatment effect and testing its
significance can be done with function did
of package
MSCMT
:
## $effect.size
## [1] -0.7715269
##
## $average.pre
## [1] 0.0005933236
##
## $average.post
## [1] -0.7709335
##
## $p.value
## [1] 0.125
##
## $rank
## [1] 1
##
## $excluded
## [1] "Principado De Asturias" "Baleares (Islas)" "Castilla Y Leon"
## [4] "Castilla-La Mancha" "Comunidad Valenciana" "Extremadura"
## [7] "Madrid (Comunidad De)" "Murcia (Region de)" "Rioja (La)"
In this vignette, the basic workflow of preparing the data, defining
and estimating the model, and evaluating the results of a (simple) SCM
application with package MSCMT
has been illustrated.
Many features and options of package MSCMT
remained
untouched, because this would have led far beyond the scope of this
simple example. There are more specialized vignettes which
illustrate
synth
of package Synth
,In future releases of this package, additional vignettes will (probably) be added to illustrate
ggplot
method.Changing the treated unit for placebo studies is done automatically.↩︎