Working with package MSCMT

Introduction

Based on the basque dataset and the corresponding example of package Synth, this vignette illustrates the usage of package MSCMT. Thereby, parts of Abadie and Gardeazabal (2003) are reproduced in this example.

Preparing the Data

The MSCMT package expects its input data to be a list of matrices, where each matrix corresponds to one variable of interest. The column names of the matrices correspond to the names of all units (treated unit and control units) and have to be identical across all elements of the list. The row names correspond to the instants of time where data is available, they may well be different between variables.

Currently, all variables must either contain annual, quarterly, monthly, weekly or daily data. The row names must therefore match one of the following patterns (where d corresponds to a single digit):

  1. dddd for annual dates, e.g. 2016,
  2. ddddQd for quarterly dates, e.g. 2016Q1 for the first quarter of 2016,
  3. dddd-dd for monthly dates, e.g. 2016-03 for March 2016.
  4. ddddWdd for weekly dates, e.g. 2016W03 for the third week of 2016.
  5. dddd-dd-dd for daily dates, e.g. 2016-03-17.

A list of matrices is a very useful, but quite unusual data format. Therefore, a helper function called listFromLong, which generates a list of matrices from a more common data representation, the so-called ‘long’ format, has been included in package MSCMT. With listFromLong, migrating from the Synth package to MSCMT is simple, because package Synth (more precisely, its dataprep function) expects its input data to be in ‘long’ format.

Example

The basque dataset in package Synth in ‘long’ format has the following structure:

library(Synth)
data(basque)
colnames(basque)
##  [1] "regionno"              "regionname"            "year"                 
##  [4] "gdpcap"                "sec.agriculture"       "sec.energy"           
##  [7] "sec.industry"          "sec.construction"      "sec.services.venta"   
## [10] "sec.services.nonventa" "school.illit"          "school.prim"          
## [13] "school.med"            "school.high"           "school.post.high"     
## [16] "popdens"               "invest"
basque[703,]
##     regionno                  regionname year   gdpcap sec.agriculture
## 703       17 Basque Country (Pais Vasco) 1969 6.081405            5.53
##     sec.energy sec.industry sec.construction sec.services.venta
## 703       3.75        45.39             6.45              34.09
##     sec.services.nonventa school.illit school.prim school.med school.high
## 703                   4.8      43.2716    1093.236   123.8557    30.35246
##     school.post.high popdens   invest
## 703         15.08178  246.89 26.37294

With listFromLong from package MSCMT, the conversion to ‘list’ format is a simple task. The names of the parameters correspond to the names of the parameters for function dataprep of package Synth:

library(MSCMT)
Basque <- listFromLong(basque, unit.variable="regionno", time.variable="year", unit.names.variable="regionname")
names(Basque)
##  [1] "gdpcap"                "sec.agriculture"       "sec.energy"           
##  [4] "sec.industry"          "sec.construction"      "sec.services.venta"   
##  [7] "sec.services.nonventa" "school.illit"          "school.prim"          
## [10] "school.med"            "school.high"           "school.post.high"     
## [13] "popdens"               "invest"
head(Basque$gdpcap)
##      Spain (Espana) Andalucia   Aragon Principado De Asturias Baleares (Islas)
## 1955       2.354542  1.688732 2.288775               2.502928         3.143959
## 1956       2.480149  1.758498 2.445159               2.615538         3.347758
## 1957       2.603613  1.827621 2.603399               2.725793         3.549629
## 1958       2.637104  1.852756 2.639032               2.751857         3.642673
## 1959       2.669880  1.878035 2.677092               2.777421         3.734862
## 1960       2.869966  2.010140 2.881462               2.967295         4.058841
##      Canarias Cantabria Castilla Y Leon Castilla-La Mancha Cataluna
## 1955 1.914382  2.559412        1.729149           1.327764 3.546630
## 1956 2.071837  2.693873        1.838332           1.415096 3.690446
## 1957 2.226078  2.820337        1.947658           1.503570 3.826835
## 1958 2.220866  2.879035        1.971365           1.531420 3.875678
## 1959 2.213439  2.943730        1.995144           1.559340 3.921737
## 1960 2.357684  3.137032        2.138817           1.667524 4.241788
##      Comunidad Valenciana Extremadura  Galicia Madrid (Comunidad De)
## 1955             2.575978    1.243430 1.634676              4.594473
## 1956             2.738503    1.332548 1.725578              4.786632
## 1957             2.899886    1.422451 1.816481              4.963439
## 1958             2.963510    1.440231 1.840903              4.906170
## 1959             3.026207    1.458083 1.865396              4.846401
## 1960             3.219294    1.535847 1.983290              5.161097
##      Murcia (Region de) Navarra (Comunidad Foral De)
## 1955           1.679520                     2.555127
## 1956           1.764282                     2.698158
## 1957           1.850328                     2.839831
## 1958           1.887389                     2.881891
## 1959           1.924093                     2.930877
## 1960           2.118609                     3.163525
##      Basque Country (Pais Vasco) Rioja (La)
## 1955                    3.853185   2.390460
## 1956                    3.945658   2.535204
## 1957                    4.033562   2.680020
## 1958                    4.023422   2.726435
## 1959                    4.013782   2.772851
## 1960                    4.285918   2.969866

The ‘list’ representation allows for simple, reproducible data preparations. To reproduce the analysis of Abadie and Gardeazabal (2003), the following preparations are sufficient:

# define the sum of all cases
school.sum <- with(Basque,colSums(school.illit + school.prim + school.med + school.high  + school.post.high))
# combine school.high and school.post.high in a single class
Basque$school.higher <- Basque$school.high + Basque$school.post.high
# calculate ratios and multiply by number of observations to obtain percentages from totals 
for (item in c("school.illit", "school.prim", "school.med", "school.higher"))      
  Basque[[item]] <- 6 * 100 * t(t(Basque[[item]]) / school.sum)

Defining the Model

In package Synth, data preparation and model formulation are combined in function dataprep, whereas the estimation is separated from the model formulation. The approach of package MSCMT is different: in order to facilitate the estimation of (potentially many) different models based on the same dataset, the model formulation has been moved to function mscmt, which also does the model estimation.

To allow for the methodological extensions (Multivariate Synthetic Control Methods using Time-Series), the model syntax had to be made more flexible. A model formulation now consists of

  • defining the (single)1 treated unit (single character string, parameter treatment.identifier),
  • defining the (multiple) control units (vector of character strings, parameter controls.identifier),
  • defining one or more dependent variables and the corresponding optimization periods as a 2 x L-matrix (parameter times.dep), where
    • the column names consist of the names of the L dependent variables,
    • the first row contains the starting times of the optimization periods in the appropriate (annual, quarterly, or monthly) format,
    • the second row contains the end times of the optimization periods in the appropriate format,
  • defining several predictor variables and the corresponding optimization periods as a 2 x K-matrix (parameter times.pred), where
    • the column names consist of the names of the K predictor variables,
    • the first row contains the starting times of the optimization periods in the appropriate (annual, quarterly, or monthly) format,
    • the second row contains the end times of the optimization periods in the appropriate format,
  • an (optional) vector of the names of K aggregation functions (parameter agg.fns) for the predictor variables. If missing, all predictor variables are considered to be time series, which corresponds to the function name "id". Whenever the result of an aggregation function has length exceeding 1, the resulting data is considered as a time series, too.

Note that times.dep and times.pred may contain duplicate column names for further increased flexibility of the model specification.

Example

The following model specification reproduces the model in Abadie and Gardeazabal (2003):

treatment.identifier <- "Basque Country (Pais Vasco)"
controls.identifier  <- setdiff(colnames(Basque[[1]]),
                                c(treatment.identifier, "Spain (Espana)"))
times.dep  <- cbind("gdpcap"                = c(1960,1969))
times.pred <- cbind("school.illit"          = c(1964,1969),
                    "school.prim"           = c(1964,1969),
                    "school.med"            = c(1964,1969),
                    "school.higher"         = c(1964,1969),
                    "invest"                = c(1964,1969),
                    "gdpcap"                = c(1960,1969),
                    "sec.agriculture"       = c(1961,1969),
                    "sec.energy"            = c(1961,1969),
                    "sec.industry"          = c(1961,1969),
                    "sec.construction"      = c(1961,1969),
                    "sec.services.venta"    = c(1961,1969),
                    "sec.services.nonventa" = c(1961,1969),
                    "popdens"               = c(1969,1969))
agg.fns <- rep("mean", ncol(times.pred))                       

Single Estimation

After preparing the data and formulating the model, the model estimation is done with function mscmt. Apart from the function parameters concerning data and model specification, there are further parameters which, e.g.,

  • control for several aspects of the optimization procedure, including weights for the inner and outer target functions,
  • choose and tune the optimizers,
  • verbose output,
  • initiate placebo studies
  • allow to use the computational power of a multi-core cpu or cluster for placebo studies.

Example

A simple estimation (without a placebo study) produces the following console output (if parameter verbose is TRUE, which is the default):

res <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, seed=1)
## 06:42:06: Number of 'sunny' donors: 16 out of 16
## 06:42:06: Unrestricted outer optimum (obtained by ignoring all predictors) with 
## 06:42:06: RMSPE 0.064236669716524 and MSPE (loss v) 0.0041263497362698 is 
## 06:42:06: INFEASIBLE when respecting the predictors.
## 06:42:06: Starting optimization via DEoptC, random seed 1.
## 06:42:11: Optimization finished (1122001 calls to inner optimizer), rmspe: 
## 06:42:11: 0.0654680949292753, mspe: 0.00428607145366861.
## Final rmspe: 0.06546809, mspe (loss v): 0.004286071
## Optimal weights:
##      Baleares (Islas)              Cataluna Madrid (Comunidad De) 
##             0.2192728             0.6327857             0.1479414

Package MSCMT ships with an S3 method for print, which gives a nice human-readable summary of the estimation results:

res
## Specification:
## --------------
## 
## Model type:     SCM
## Treated unit:   Basque Country (Pais Vasco)
## Control units:  Andalucia, Aragon, Principado De Asturias, Baleares (Islas), 
##                 Canarias, Cantabria, Castilla Y Leon, Castilla-La Mancha, 
##                 Cataluna, Comunidad Valenciana, Extremadura, Galicia, 
##                 Madrid (Comunidad De), Murcia (Region de), 
##                 Navarra (Comunidad Foral De), Rioja (La)
## Dependent(s):   gdpcap with optimization period from 1960 to 1969
## Predictors:     school.illit          from 1964 to 1969, aggregated via 'mean', 
##                 school.prim           from 1964 to 1969, aggregated via 'mean', 
##                 school.med            from 1964 to 1969, aggregated via 'mean', 
##                 school.higher         from 1964 to 1969, aggregated via 'mean', 
##                 invest                from 1964 to 1969, aggregated via 'mean', 
##                 gdpcap                from 1960 to 1969, aggregated via 'mean', 
##                 sec.agriculture       from 1961 to 1969, aggregated via 'mean', 
##                 sec.energy            from 1961 to 1969, aggregated via 'mean', 
##                 sec.industry          from 1961 to 1969, aggregated via 'mean', 
##                 sec.construction      from 1961 to 1969, aggregated via 'mean', 
##                 sec.services.venta    from 1961 to 1969, aggregated via 'mean', 
##                 sec.services.nonventa from 1961 to 1969, aggregated via 'mean', 
##                 popdens               from 1969 to 1969, aggregated via 'mean'
## 
## 
## Results:
## --------
## 
## Result type:    Ordinary solution, ie. no perfect preditor fit possible and the 
##                 predictors impose some restrictions on the outer optimization.
## Optimal W:      Baleares (Islas)     : 21.92728%, 
##                 Cataluna             : 63.27857%, 
##                 Madrid (Comunidad De): 14.79414%
## Dependent loss: MSPE ('loss V'): 0.004286071, 
##                 RMSPE          : 0.065468095
## (Optimal) V:    Some optimal weight vectors V are:
##                                                        min.loss.w    max.order
##                 school.illit.mean.1964.1969          9.998580e-09 1.578398e-05
##                 school.prim.mean.1964.1969           9.998580e-09 1.578398e-05
##                 school.med.mean.1964.1969            9.998580e-09 1.578398e-05
##                 school.higher.mean.1964.1969         9.998580e-09 2.903475e-04
##                 invest.mean.1964.1969                8.469442e-05 2.990163e-04
##                 gdpcap.mean.1960.1969                9.998580e-01 9.992528e-01
##                 sec.agriculture.mean.1961.1969       9.998580e-09 1.578398e-05
##                 sec.energy.mean.1961.1969            9.998580e-09 1.578398e-05
##                 sec.industry.mean.1961.1969          9.998580e-09 1.578398e-05
##                 sec.construction.mean.1961.1969      9.998580e-09 1.578398e-05
##                 sec.services.venta.mean.1961.1969    9.998580e-09 1.578398e-05
##                 sec.services.nonventa.mean.1961.1969 5.718465e-05 1.578398e-05
##                 popdens.mean.1969.1969               9.998580e-09 1.578398e-05
##                 ----------                                                    
##                 pred. loss                           1.139851e-04 3.374961e-04
##                 (Predictor weights V are standardized by sum(V)=1)
## 

While there is a basic S3 method for plot, it is strongly recommended to use package ggplot2 and the corresponding S3 method for ggplot contained in package MSCMT. With the results of a simple estimation, two types of plots are available, the first being a comparison of original and synthesized data for the treated unit. The variable to be plotted can be selected with parameter what, by default the (first) dependent variable is being chosen:

library(ggplot2)
ggplot(res, type="comparison")

The second type of plot is a plot of the gaps, ie. the differences between original and synthesized data:

ggplot(res, type="gaps")

It is possible to plot several variables by providing a vector for argument what. Pre-defined sets of variables named "dependents", "predictors", and "all" can be selected with parameter what.set.

ggplot(res, what=c("gdpcap","invest","school.higher","sec.energy"), type="comparison")

Placebo Study

Placebo studies are performed by simply setting the function argument placebo to TRUE. By default, the original treated unit is not added to the donor pool of the (original) control units, but this can be changed with parameter placebo.with.treated.

Making Use of a Cluster

A remarkable speed-up (depending on the number of control units) can be achieved for placebo studies by making use of a cluster, which can be set up with the help of package parallel. The simplest form of a cluster is a local cluster, which makes the power of multi-core cpus available for the (lengthy) computations involved with placebo studies. Setting up a local cluster is very easy, see the example below.

The argument cl of function mscmt can be used to specify the cluster to be used for placebo studies. The only drawback of using a cluster for placebo studies is losing the verbose output from the individual (one for each unit) SCM estimations, which includes a lack of progress information for the whole placebo study. Nevertheless, the speed-up should compensate for this drawback in all applications where a placebo study is meaningful.

Although clusters should be shut down automatically when the corresponding (master) R process is finished, function stopCluster can (and should) be used to shut down the cluster manually.

Example

In the following example, a (local) cluster with two nodes is used for the estimation.

library(parallel)
cl <- makeCluster(2)
resplacebo <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, cl=cl, placebo=TRUE, seed=1)
## 06:42:12: Starting placebo study, excluding original treated unit, on the 
## 06:42:12: cluster. Please hold the line.
## 06:42:54: Placebo study on cluster finished.
stopCluster(cl)

Object resplacebo now contains single SCM estimations for each unit as well as aggregated information concerning original data, synthesized data, and gaps for all units. The individual SCM estimations can be accessed separately (as list elements with names corresponding to the units’ names). With the following plot, one can inspect whether there is some effect for Catalonia:

ggplot(resplacebo[["Cataluna"]], type="comparison")

Several functions in package MSCMT are able to make use of the results of the placebo study as a whole. One example are so-called placebo plots, by setting the plot type to "placebo.gaps" (the default for results of a placebo study):

ggplot(resplacebo)

Statistical Inference

For statistical inference based on the results of a placebo study, the literature has developed so-called ‘placebo tests’, which have similarities to permutation tests. Two of these are

  • tests for the (per-period) treatment effect based on the per-period gaps,
  • tests for the aggregated treatment effect based on a difference-in-difference approach for aggregated (average) gaps.

Most often, not all control units can be synthesized with an acceptable fit in a placebo study, resulting in large pre-treatment gaps. Of course, large post-treatment gaps are expected for these units, but since these gaps are rather caused from lack of fit than from an existing treatment effect, excluding such units is strongly advisable while investigating the effect for the (original) treated unit.

Excluding control units with large pre-treatment errors is usually done by limiting the ratio of a control unit’s pre-treatment (r)mspe to the treated unit’s (r)mspe.

Example

Control units with large pre-treatment errors can easily be excluded from placebo plots:

ggplot(resplacebo, exclude.ratio=5, ratio.type="mspe")

The p-values of per-period placebo tests can be calculated via function pvalue or plotted via ggplot with plot.type="p.values":

pvalue(resplacebo, exclude.ratio=5, ratio.type="mspe", alternative="less")
## Time Series:
## Start = 1970 
## End = 1997 
## Frequency = 1 
##  [1] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.36363636
##  [7] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
## [13] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
## [19] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
## [25] 0.09090909 0.09090909 0.18181818 0.18181818
ggplot(resplacebo, exclude.ratio=5, ratio.type="mspe", type="p.value", alternative="less")

Calculating the aggregated treatment effect and testing its significance can be done with function did of package MSCMT:

did(resplacebo, range.post=c(1970,1990), exclude.ratio=5, alternative="less")
## $effect.size
## [1] -0.7715269
## 
## $average.pre
## [1] 0.0005933236
## 
## $average.post
## [1] -0.7709335
## 
## $p.value
## [1] 0.125
## 
## $rank
## [1] 1
## 
## $excluded
## [1] "Principado De Asturias" "Baleares (Islas)"       "Castilla Y Leon"       
## [4] "Castilla-La Mancha"     "Comunidad Valenciana"   "Extremadura"           
## [7] "Madrid (Comunidad De)"  "Murcia (Region de)"     "Rioja (La)"

Summary

In this vignette, the basic workflow of preparing the data, defining and estimating the model, and evaluating the results of a (simple) SCM application with package MSCMT has been illustrated.

Many features and options of package MSCMT remained untouched, because this would have led far beyond the scope of this simple example. There are more specialized vignettes which illustrate

In future releases of this package, additional vignettes will (probably) be added to illustrate

  • applications of MSCM and MSCMT,
  • alternatives for the optimizers and its parameters,
  • more functionality of the ggplot method.
  • the use of cross-validation, see Becker, Klößner, and Pfeifer (2018).

References

Abadie, Alberto, and Javier Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.” The American Economic Review 93 (1): 113–32. http://dx.doi.org/10.1257/000282803321455188.
Becker, Martin, Stefan Klößner, and Gregor Pfeifer. 2018. “Cross-Validating Synthetic Controls.” Economics Bulletin 38: 603–9. http://www.accessecon.com/Pubs/EB/2018/Volume38/EB-18-V38-I1-P58.pdf.

  1. Changing the treated unit for placebo studies is done automatically.↩︎