This vignette illustrates the syntax of SCMT models.
For a more general introduction to package MSCMT
see its main vignette.
Although SCM models are usually based on time series data of
predictor variables, standard SCM estimation does not exploit this
particular characteristic. Instead, time series data of predictors are
either aggregated, mostly by calculating (a bunch of) means, or every
instant of time is considered as a separate input variable with
individual predictor weight. With package MSCMT
, a time
series of a predictor variable can be considered as single input
variable without the need of aggregation, an extension of SCM called
SCMT, see Klößner and Pfeifer
(2015).
This vignette illustrates the syntax of SCMT models and how SCMT models may lead to more meaningful predictor weights without drawbacks concerning the model fit.
We use the basque
dataset in package Synth
as an example and replicate the preparation of the data from the main vignette of this package:
library(Synth)
data(basque)
library(MSCMT)
Basque <- listFromLong(basque, unit.variable="regionno", time.variable="year", unit.names.variable="regionname")
school.sum <- with(Basque,colSums(school.illit + school.prim + school.med + school.high + school.post.high))
Basque$school.higher <- Basque$school.high + Basque$school.post.high
for (item in c("school.illit", "school.prim", "school.med", "school.higher"))
Basque[[item]] <- 6 * 100 * t(t(Basque[[item]]) / school.sum)
We also replicate model specification of the main vignette which reproduces the model in Abadie and Gardeazabal (2003):
treatment.identifier <- "Basque Country (Pais Vasco)"
controls.identifier <- setdiff(colnames(Basque[[1]]),
c(treatment.identifier, "Spain (Espana)"))
times.dep <- cbind("gdpcap" = c(1960,1969))
times.pred <- cbind("school.illit" = c(1964,1969),
"school.prim" = c(1964,1969),
"school.med" = c(1964,1969),
"school.higher" = c(1964,1969),
"invest" = c(1964,1969),
"gdpcap" = c(1960,1969),
"sec.agriculture" = c(1961,1969),
"sec.energy" = c(1961,1969),
"sec.industry" = c(1961,1969),
"sec.construction" = c(1961,1969),
"sec.services.venta" = c(1961,1969),
"sec.services.nonventa" = c(1961,1969),
"popdens" = c(1969,1969))
agg.fns <- rep("mean", ncol(times.pred))
Estimation of the model gives:
res <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, seed=1, single.v=TRUE, verbose=FALSE)
res
## Specification:
## --------------
##
## Model type: SCM
## Treated unit: Basque Country (Pais Vasco)
## Control units: Andalucia, Aragon, Principado De Asturias, Baleares (Islas),
## Canarias, Cantabria, Castilla Y Leon, Castilla-La Mancha,
## Cataluna, Comunidad Valenciana, Extremadura, Galicia,
## Madrid (Comunidad De), Murcia (Region de),
## Navarra (Comunidad Foral De), Rioja (La)
## Dependent(s): gdpcap with optimization period from 1960 to 1969
## Predictors: school.illit from 1964 to 1969, aggregated via 'mean',
## school.prim from 1964 to 1969, aggregated via 'mean',
## school.med from 1964 to 1969, aggregated via 'mean',
## school.higher from 1964 to 1969, aggregated via 'mean',
## invest from 1964 to 1969, aggregated via 'mean',
## gdpcap from 1960 to 1969, aggregated via 'mean',
## sec.agriculture from 1961 to 1969, aggregated via 'mean',
## sec.energy from 1961 to 1969, aggregated via 'mean',
## sec.industry from 1961 to 1969, aggregated via 'mean',
## sec.construction from 1961 to 1969, aggregated via 'mean',
## sec.services.venta from 1961 to 1969, aggregated via 'mean',
## sec.services.nonventa from 1961 to 1969, aggregated via 'mean',
## popdens from 1969 to 1969, aggregated via 'mean'
##
##
## Results:
## --------
##
## Result type: Ordinary solution, ie. no perfect preditor fit possible and the
## predictors impose some restrictions on the outer optimization.
## Optimal W: Baleares (Islas) : 21.92728%,
## Cataluna : 63.27857%,
## Madrid (Comunidad De): 14.79414%
## Dependent loss: MSPE ('loss V'): 0.004286071,
## RMSPE : 0.065468095
## (Optimal) V: Single predictor weights V requested. The optimal weight vector
## V is:
## max.order
## school.illit.mean.1964.1969 1.578398e-05
## school.prim.mean.1964.1969 1.578398e-05
## school.med.mean.1964.1969 1.578398e-05
## school.higher.mean.1964.1969 2.903475e-04
## invest.mean.1964.1969 2.990163e-04
## gdpcap.mean.1960.1969 9.992528e-01
## sec.agriculture.mean.1961.1969 1.578398e-05
## sec.energy.mean.1961.1969 1.578398e-05
## sec.industry.mean.1961.1969 1.578398e-05
## sec.construction.mean.1961.1969 1.578398e-05
## sec.services.venta.mean.1961.1969 1.578398e-05
## sec.services.nonventa.mean.1961.1969 1.578398e-05
## popdens.mean.1969.1969 1.578398e-05
## ----------
## pred. loss 3.374961e-04
## (Predictor weights V are standardized by sum(V)=1)
##
It is remarkable that the mean of the lagged dependent variable
gdpcap.mean.1960.1969
is by far the most important
predictor with a weight of 0.9992528, all other predictors are only
marginally relevant due to their tiny (at most 0.0002990163) weights.1
Omitting the lagged dependent variable
gdpcap.mean.1960.1969
from the model definition, however,
leads to a significant increase of the dependent loss:
times.pred <- times.pred[,-6]
agg.fns <- rep("mean", ncol(times.pred))
res2 <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, seed=1, single.v=TRUE, verbose=FALSE)
res2
## Specification:
## --------------
##
## Model type: SCM
## Treated unit: Basque Country (Pais Vasco)
## Control units: Andalucia, Aragon, Principado De Asturias, Baleares (Islas),
## Canarias, Cantabria, Castilla Y Leon, Castilla-La Mancha,
## Cataluna, Comunidad Valenciana, Extremadura, Galicia,
## Madrid (Comunidad De), Murcia (Region de),
## Navarra (Comunidad Foral De), Rioja (La)
## Dependent(s): gdpcap with optimization period from 1960 to 1969
## Predictors: school.illit from 1964 to 1969, aggregated via 'mean',
## school.prim from 1964 to 1969, aggregated via 'mean',
## school.med from 1964 to 1969, aggregated via 'mean',
## school.higher from 1964 to 1969, aggregated via 'mean',
## invest from 1964 to 1969, aggregated via 'mean',
## sec.agriculture from 1961 to 1969, aggregated via 'mean',
## sec.energy from 1961 to 1969, aggregated via 'mean',
## sec.industry from 1961 to 1969, aggregated via 'mean',
## sec.construction from 1961 to 1969, aggregated via 'mean',
## sec.services.venta from 1961 to 1969, aggregated via 'mean',
## sec.services.nonventa from 1961 to 1969, aggregated via 'mean',
## popdens from 1969 to 1969, aggregated via 'mean'
##
##
## Results:
## --------
##
## Result type: Ordinary solution, ie. no perfect preditor fit possible and the
## predictors impose some restrictions on the outer optimization.
## Optimal W: Cataluna : 85.0814%,
## Madrid (Comunidad De): 14.9186%
## Dependent loss: MSPE ('loss V'): 0.008864545,
## RMSPE : 0.094151712
## (Optimal) V: Single predictor weights V requested. The optimal weight vector
## V is:
## max.order
## school.illit.mean.1964.1969 0.02710923
## school.prim.mean.1964.1969 0.02710923
## school.med.mean.1964.1969 0.09108599
## school.higher.mean.1964.1969 0.23068005
## invest.mean.1964.1969 0.02710923
## sec.agriculture.mean.1961.1969 0.02710923
## sec.energy.mean.1961.1969 0.02710923
## sec.industry.mean.1961.1969 0.23068005
## sec.construction.mean.1961.1969 0.02710923
## sec.services.venta.mean.1961.1969 0.02710923
## sec.services.nonventa.mean.1961.1969 0.02710923
## popdens.mean.1969.1969 0.23068005
## ----------
## pred. loss 0.31473799
## (Predictor weights V are standardized by sum(V)=1)
##
The dependent loss (MSPE) increased considerably from 0.0042861 to 0.0088645. Trying to give more meaning to the economic predictors in this way obviously has the drawback of worsening the fit of the dependent variable.
Leaving the lagged dependent variable
gdpcap.mean.1960.1969
aside, but considering all other
predictor variables as time series instead of
aggregating their values leads to the following results:
agg.fns <- rep("id", ncol(times.pred)) # Omitting agg.fns has the same effect (as "id" is the default)
res3 <- mscmt(Basque, treatment.identifier, controls.identifier, times.dep, times.pred, agg.fns, seed=1, single.v=TRUE, verbose=FALSE)
res3
## Specification:
## --------------
##
## Model type: SCMT
## Treated unit: Basque Country (Pais Vasco)
## Control units: Andalucia, Aragon, Principado De Asturias, Baleares (Islas),
## Canarias, Cantabria, Castilla Y Leon, Castilla-La Mancha,
## Cataluna, Comunidad Valenciana, Extremadura, Galicia,
## Madrid (Comunidad De), Murcia (Region de),
## Navarra (Comunidad Foral De), Rioja (La)
## Dependent(s): gdpcap with optimization period from 1960 to 1969
## Predictors: school.illit from 1964 to 1969,
## school.prim from 1964 to 1969,
## school.med from 1964 to 1969,
## school.higher from 1964 to 1969,
## invest from 1964 to 1969,
## sec.agriculture from 1961 to 1969,
## sec.energy from 1961 to 1969,
## sec.industry from 1961 to 1969,
## sec.construction from 1961 to 1969,
## sec.services.venta from 1961 to 1969,
## sec.services.nonventa from 1961 to 1969,
## popdens from 1969 to 1969
##
##
## Results:
## --------
##
## Result type: Ordinary solution, ie. no perfect preditor fit possible and the
## predictors impose some restrictions on the outer optimization.
## Optimal W: Baleares (Islas) : 30.616175425966911661%,
## Canarias : 0.000000000001394418%,
## Cataluna : 25.642267871421836389%,
## Madrid (Comunidad De) : 31.319195420744378566%,
## Navarra (Comunidad Foral De): 12.422361281865482496%
## Dependent loss: MSPE ('loss V'): 0.004212379,
## RMSPE : 0.064902846
## (Optimal) V: Single predictor weights V requested. The optimal weight vector
## V is:
## max.order
## school.illit 0.001191182
## school.prim 0.002720510
## school.med 0.986287184
## school.higher 0.000127276
## invest 0.000127276
## sec.agriculture 0.008782917
## sec.energy 0.000127276
## sec.industry 0.000127276
## sec.construction 0.000127276
## sec.services.venta 0.000127276
## sec.services.nonventa 0.000127276
## popdens 0.000127276
## ----------
## pred. loss 0.013824839
## (Predictor weights V are standardized by sum(V)=1)
##
Notice that this specification’s model type is ‘SCMT’, in contrast to
the previous models which were ‘SCM’ models. By using the ‘SCMT’ model,
the dependent loss (0.0042124) is even smaller than that of the original
model (0.0042861) which used the dependent variable’s mean as an extra
economic predictor. school.med
has now become the most
important predictor with weight 0.9862872, all other predictor weights
are at least 0.000127276.
This vignette illustrated that considering predictors as true time series (without intermediate aggregation) may have various benefits. In this example, by excluding the mean of the lagged dependent variable from the set of economic predictors and considering all other predictors as time series, more meaningful predictor weights could be obtained and the dependent variable’s fit could be slightly improved, too.
Notice that the weight vector v
is obtained
by maximizing the order statistics of v
(while fixing the
sum of v
to 1). This choice of ‘v’ attributes weights as
large as possible to even the least relevant predictor(s).↩︎