Title: | Penalized Poisson Pseudo Maximum Likelihood Regression |
---|---|
Description: | A set of tools that enables efficient estimation of penalized Poisson Pseudo Maximum Likelihood regressions, using lasso or ridge penalties, for models that feature one or more sets of high-dimensional fixed effects. The methodology is based on Breinlich, Corradi, Rocha, Ruta, Santos Silva, and Zylkin (2021) <http://hdl.handle.net/10986/35451> and takes advantage of the method of alternating projections of Gaure (2013) <doi:10.1016/j.csda.2013.03.024> for dealing with HDFE, as well as the coordinate descent algorithm of Friedman, Hastie and Tibshirani (2010) <doi:10.18637/jss.v033.i01> for fitting lasso regressions. The package is also able to carry out cross-validation and to implement the plugin lasso of Belloni, Chernozhukov, Hansen and Kozbur (2016) <doi:10.1080/07350015.2015.1102733>. |
Authors: | Diego Ferreras Garrucho [aut], Tom Zylkin [aut], Joao Cruz [cre, ctb], Nicolas Apfel [ctb] |
Maintainer: | Joao Cruz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.4 |
Built: | 2025-03-10 06:47:06 UTC |
Source: | CRAN |
This function performs standard plugin lasso PPML estimation for bootreps
samples drawn again with
replacement and reports
those regressors selected in at least a certain fraction of the bootstrap repetitions.
bootstrap( data, dep, indep = NULL, cluster_id = NULL, fixed = NULL, selectobs = NULL, bootreps = 250, boot_threshold = 0.01, colcheck_x = FALSE, colcheck_x_fes = FALSE, post = FALSE, gamma_val = NULL, verbose = FALSE, tol = 1e-06, hdfetol = 0.01, penweights = NULL, maxiter = 1000, phipost = TRUE )
bootstrap( data, dep, indep = NULL, cluster_id = NULL, fixed = NULL, selectobs = NULL, bootreps = 250, boot_threshold = 0.01, colcheck_x = FALSE, colcheck_x_fes = FALSE, post = FALSE, gamma_val = NULL, verbose = FALSE, tol = 1e-06, hdfetol = 0.01, penweights = NULL, maxiter = 1000, phipost = TRUE )
data |
A data frame containing all relevant variables. |
dep |
A string with the names of the independent variables or their column numbers. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
cluster_id |
A string denoting the cluster-id with which to perform cluster bootstrap. |
fixed |
A vector with the names or column numbers of factor variables identifying the fixed effects,
or a list with the desired interactions between variables in |
selectobs |
Optional. A vector indicating which observations to use (either a logical vector or a numeric vector with row numbers, as usual when subsetting in R). |
bootreps |
Number of bootstrap repetitions. |
boot_threshold |
Minimal threshold. If a variable is selected in at least this fraction of times, it is reported at the end of the iterations. |
colcheck_x |
Logical. If |
colcheck_x_fes |
Logical. If |
post |
Logical. If |
gamma_val |
Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
verbose |
Logical. If |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol |
Tolerance parameter for the within-transformation step,
passed on to |
penweights |
Optional: a vector of coefficient-specific penalties to use in plugin lasso when
|
maxiter |
Maximum number of iterations (a number). |
phipost |
Logical. If |
This function enables users to implement the "bootstrap" step in the procedure described in Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2020). To do this, Plugin Lasso is run B times. The function can also perform a post-selection estimation.
A matrix with coefficient estimates for all dependent variables.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: bs1 <- bootstrap(data=trade3, dep="export", cluster_id="clus", fixed=list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), indep=7:22, bootreps=10, colcheck_x = TRUE, colcheck_x_fes = TRUE, boot_threshold = 0.01, post=TRUE, gamma_val=0.01, verbose=FALSE) ## End(Not run)
## Not run: bs1 <- bootstrap(data=trade3, dep="export", cluster_id="clus", fixed=list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), indep=7:22, bootreps=10, colcheck_x = TRUE, colcheck_x_fes = TRUE, boot_threshold = 0.01, post=TRUE, gamma_val=0.01, verbose=FALSE) ## End(Not run)
cluster_matrix
is a helper for computation of cluster-robust standard errors.
cluster_matrix(e, cluster, x)
cluster_matrix(e, cluster, x)
e |
Vector of residuals. |
cluster |
Vector of clusters. |
x |
Regressor matrix. |
Gives the XeeX matrix.
collinearity_check
checks for perfect multicollinearity in a model with high-dimensional
fixed effects. It calls lfe::demeanlist
in order to partial out the fixed effects, and then
uses stats::lm.wfit
to discard linearly dependent variables.
collinearity_check( y, x = NULL, fes = NULL, hdfetol, colcheck_x_fes = TRUE, colcheck_x = TRUE )
collinearity_check( y, x = NULL, fes = NULL, hdfetol, colcheck_x_fes = TRUE, colcheck_x = TRUE )
y |
Dependent variable (a numeric vector). |
x |
Regressor matrix. |
fes |
List of fixed effects. |
hdfetol |
Tolerance for the centering, passed on to |
colcheck_x_fes |
Logical. If |
colcheck_x |
Logical. If |
A numeric vector containing the variables that pass the collinearity check.
This function is a helper for xvalidate
that computes FEs using PPML First Order Conditions
(FOCs).
compute_fes( y, fes, x, b, insample_obs = rep(1, n), onlymus = FALSE, tol = 1e-08, verbose = FALSE )
compute_fes( y, fes, x, b, insample_obs = rep(1, n), onlymus = FALSE, tol = 1e-08, verbose = FALSE )
y |
Dependent variable (a vector). |
fes |
List of fixed effects. |
x |
Regressor matrix. |
b |
A vector of coefficient estimates. |
insample_obs |
Vector of observations used to estimate the |
onlymus |
Logical. If |
tol |
A tolerance parameter. |
verbose |
Logical. If |
If onlymus = TRUE
, the vector of conditional means. Otherwise, a list with two
elements:
mu
: conditional means.
fe_values
: fixed effects.
An auxiliary data set with basic geographic information about country ISO 3166 codes included in the
trade
data set.
countries
countries
A data frame with 249 rows and 4 variables.
iso: Country ISO 3166 code.
name: Country name.
region: Continent.
subregion: sub-continental region.
The source of the data set is Luke Duncalfe's ISO-3166-Countries-with-Regional-Codes repository on GitHub (https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes#readme).
Faster matrix multiplication using C++.
eigenMatMult(A, B) eigenMapMatMult(A, B)
eigenMatMult(A, B) eigenMapMatMult(A, B)
A , B
|
Matrices. |
Finds Least Squares solutions using C++.
fastolsCpp(X, y)
fastolsCpp(X, y)
X |
Regressor matrix. |
y |
Dependent variable (a vector). |
The vector of parameter (beta) estimates.
A wrapper around fastridgeCpp
, for faster computation of the analytical solution for
ridge regression.
fastridge(x, y, weights = rep(1/n, n), lambda, standardize = TRUE)
fastridge(x, y, weights = rep(1/n, n), lambda, standardize = TRUE)
x |
Regressor matrix. |
y |
Dependent variable (a numeric vector). |
weights |
Vector of weights. |
lambda |
Penalty parameter. |
standardize |
Logical. If |
A vector of coefficient (beta) estimates.
Finds Ridge solutions using C++.
fastridgeCpp(X, y, lambda)
fastridgeCpp(X, y, lambda)
X |
Regressor matrix. |
y |
Dependent variable (a vector). |
lambda |
Penalty parameter (a number). |
The vector of parameter (beta) estimates.
Computes standard deviation using C++.
faststddev(X, w)
faststddev(X, w)
X |
Regressor matrix. |
w |
Weights. |
Vector of standard deviations of the parameter estimates.
Computes weighted mean using C++.
fastwmean(X, w)
fastwmean(X, w)
X |
Regressor matrix. |
w |
Weights. |
Weighted mean.
genfes
generates a list of fixed effects by creating interactions of paired factors.
genfes(data, inter)
genfes(data, inter)
data |
A data frame including the factors. |
inter |
A list: each element includes the variables to be interacted (both names and column |
A list containing the desired interactions of vars
, with the same length as inter
.
genmodel
transforms a data frame into the needed components for our main functions (a y vector,
a x matrix and a fes list).
genmodel( data, dep = NULL, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL )
genmodel( data, dep = NULL, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL )
data |
A data frame containing all relevant variables. |
dep |
A string with the name of the independent variable or a column number. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed |
A vector with the names or column numbers of factor variables identifying the fixed effects,
or a list with the desired interactions between variables in |
cluster |
Optional. A string with the name of the clustering variable or a column number. It's also possible to input a vector with several variables, in which case the interaction of all of them is taken as the clustering variable. |
selectobs |
Optional. A vector indicating which observations to use. |
A list with four elements:
y
: y vector.
x
: x matrix.
fes
: list of fixed effects.
cluster
: cluster vector.
hdfeppml
fits an (unpenalized) Poisson Pseudo Maximum Likelihood (PPML) model with
high-dimensional fixed effects (HDFE).
hdfeppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
hdfeppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
data |
A data frame containing all relevant variables. |
dep |
A string with the name of the independent variable or a column number. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed |
A vector with the names or column numbers of factor variables identifying the fixed effects,
or a list with the desired interactions between variables in |
cluster |
Optional. A string with the name of the clustering variable or a column number. It's also possible to input a vector with several variables, in which case the interaction of all of them is taken as the clustering variable. |
selectobs |
Optional. A vector indicating which observations to use (either a logical vector or a numeric vector with row numbers, as usual when subsetting in R). |
... |
Further options. For a full list, see hdfeppml_int. |
This function is a thin wrapper around hdfeppml_int, providing a more convenient interface for
data frames. Whereas the internal function requires some preliminary handling of data sets (y
must be a vector, x
must be a matrix and fixed effects fes
must be provided in a list),
the wrapper takes a full data frame in the data
argument, and users can simply specify which
variables correspond to y, x and the fixed effects, using either variable names or column numbers.
More formally, hdfeppml_int
performs iteratively re-weighted least squares (IRLS) on a
transformed model, as described in Correia, Guimarães and Zylkin (2020) and similar to the
ppmlhdfe
package in Stata. In each iteration, the function calculates the transformed dependent
variable, partials out the fixed effects (calling collapse:fhdwithin
) and then solves a weighted
least squares problem (using fast C++ implementation).
A list with the following elements:
coefficients
: a 1 x ncol(x)
matrix with coefficient (beta) estimates.
residuals
: a 1 x length(y)
matrix with the residuals of the model.
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
:
bic
: Bayesian Information Criterion.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
se
: standard errors of the coefficients.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] test <- hdfeppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas)) ## End(Not run)
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] test <- hdfeppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas)) ## End(Not run)
hdfeppml_int
is the internal algorithm called by hdfeppml
to fit an (unpenalized)
Poisson Pseudo Maximum Likelihood (PPML) regression with high-dimensional fixed effects (HDFE). It
takes a vector with the dependent variable, a regressor matrix and a set of fixed effects (in list
form: each element in the list should be a separate HDFE).
hdfeppml_int( y, x = NULL, fes = NULL, tol = 1e-08, hdfetol = 1e-04, mu = NULL, saveX = TRUE, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, init_z = NULL, verbose = FALSE, maxiter = 1000, cluster = NULL, vcv = TRUE )
hdfeppml_int( y, x = NULL, fes = NULL, tol = 1e-08, hdfetol = 1e-04, mu = NULL, saveX = TRUE, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, init_z = NULL, verbose = FALSE, maxiter = 1000, cluster = NULL, vcv = TRUE )
y |
Dependent variable (a vector) |
x |
Regressor matrix. |
fes |
List of fixed effects. |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol |
Tolerance parameter for the within-transformation step,
passed on to |
mu |
A vector of initial values for mu that can be passed to the command. |
saveX |
Logical. If |
colcheck |
Logical. If |
colcheck_x |
Logical. If |
colcheck_x_fes |
Logical. If |
init_z |
Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm. |
verbose |
Logical. If |
maxiter |
Maximum number of iterations (a number). |
cluster |
Optional: a vector classifying observations into clusters (to use when calculating SEs). |
vcv |
Logical. If |
More formally, hdfeppml_int
performs iteratively re-weighted least squares (IRLS) on a
transformed model, as described in Correia, Guimarães and Zylkin (2020) and similar to the
ppmlhdfe
package in Stata. In each iteration, the function calculates the transformed dependent
variable, partials out the fixed effects (calling collapse::fhdwithin
, which uses the algorithm in
Gaure (2013)) and then solves a weighted least squares problem (using fast C++ implementation).
A list with the following elements:
coefficients
: a 1 x ncol(x)
matrix with coefficient (beta) estimates.
residuals
: a 1 x length(y)
matrix with the residuals of the model.
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
:
bic
: Bayesian Information Criterion.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
se
: standard errors of the coefficients.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, the call to hdfeppml_int: reg <- hdfeppml_int(y = y, x = x, fes = fes) ## End(Not run)
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, the call to hdfeppml_int: reg <- hdfeppml_int(y = y, x = x, fes = fes) ## End(Not run)
A function performs standard plugin lasso PPML estimation (without fixed effects) for several dependent variables in a single step. This is still IN DEVELOPMENT: at the current stage, only coefficient estimates are are provided and there is no support for clustered errors.
iceberg(data, dep, indep = NULL, selectobs = NULL, ...)
iceberg(data, dep, indep = NULL, selectobs = NULL, ...)
data |
A data frame containing all relevant variables. |
dep |
A string with the names of the independent variables or their column numbers. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
selectobs |
Optional. A vector indicating which observations to use (either a logical vector or a numeric vector with row numbers, as usual when subsetting in R). |
... |
Further arguments, including:
|
This functions enables users to implement the "iceberg" step in the two-step procedure described in
Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2020). To do this after using the plugin
method in mlfitppml
, just select all the variables with non-zero coefficients in
dep
and the remaining regressors in indep
. The function will then perform separate
lasso estimation on each of the selected dependent variables and report the coefficients.
A matrix with coefficient estimates for all dependent variables.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
iceberg_results <- iceberg(data = trade[, -(1:6)], dep = c("ad_prov_14", "cp_prov_23", "tbt_prov_07", "tbt_prov_33", "tf_prov_41", "tf_prov_45"), selectobs = (trade$time == "2016"))
iceberg_results <- iceberg(data = trade[, -(1:6)], dep = c("ad_prov_14", "cp_prov_23", "tbt_prov_07", "tbt_prov_33", "tf_prov_41", "tf_prov_45"), selectobs = (trade$time == "2016"))
Compute a large number of outer products (useful for clustered SEs) using C++.
manyouter(A, B, c)
manyouter(A, B, c)
A , B
|
Numeric vectors. |
c |
Integer. |
mlfitppml
is a general-purpose wrapper function for penalized PPML estimation. This is a
flexible tool that allows users to select:
Penalty type: either lasso or ridge.
Penalty parameter: users can provide a single global value for lambda (a single regression is estimated), a vector of lambda values (the function estimates the regression using each of them, sequentially) or even coefficient-specific penalty weights.
Method: plugin lasso estimates can be obtained directly from this function too.
Cross-validation: if this option is enabled, the function uses IDs provided by the user to perform k-fold cross-validation and reports the resulting RMSE for all lambda values.
mlfitppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
mlfitppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
data |
A data frame containing all relevant variables. |
dep |
A string with the name of the independent variable or a column number. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed |
A vector with the names or column numbers of factor variables identifying the fixed effects,
or a list with the desired interactions between variables in |
cluster |
Optional. A string with the name of the clustering variable or a column number. It's also possible to input a vector with several variables, in which case the interaction of all of them is taken as the clustering variable. |
selectobs |
Optional. A vector indicating which observations to use (either a logical vector or a numeric vector with row numbers, as usual when subsetting in R). |
... |
Further arguments, including:
For a full list of options, see mlfitppml_int. |
This function is a thin wrapper around mlfitppml_int
, providing a more convenient interface for
data frames. Whereas the internal function requires some preliminary handling of data sets (y
must be a vector, x
must be a matrix and fes
must be provided in a list), the wrapper
takes a full data frame in the data
argument, and users can simply specify which variables
correspond to y, x and the fixed effects, using either variable names or column numbers.
For technical details on the algorithms used, see hdfeppml (post-lasso regression), penhdfeppml (standard penalized regression), penhdfeppml_cluster (plugin lasso), and xvalidate (cross-validation).
A list with the following elements:
beta
: if post = FALSE
, a length(lambdas)
x ncol(x)
matrix with
coefficient (beta) estimates from the penalized regressions. If post = TRUE
, this is
the matrix of coefficients from the post-penalty regressions.
beta_pre
: if post = TRUE
, a length(lambdas)
x ncol(x)
matrix with
coefficient (beta) estimates from the penalized regressions.
bic
: Bayesian Information Criterion.
lambdas
: vector of penalty parameters.
ses
: standard errors of the coefficients of the post-penalty regression. Note that
these are only provided when post = TRUE
.
rmse
: if xval = TRUE
, a matrix with the root mean squared error (RMSE - column 2)
for each value of lambda (column 1), obtained by cross-validation.
phi
: coefficient-specific penalty weights (only if method == "plugin"
).
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] # Now we can use our main functions on the reduced trade data set: test <- mlfitppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas), lambdas = c(0.01, 0.001), tol = 1e-6, hdfetol = 1e-2) ## End(Not run)
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] # Now we can use our main functions on the reduced trade data set: test <- mlfitppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas), lambdas = c(0.01, 0.001), tol = 1e-6, hdfetol = 1e-2) ## End(Not run)
mlfitppml_int
is the internal wrapper called by mlfitppml
for penalized PPML estimation.
This in turn calls penhdfeppml_int
, penhdfeppml_cluster_int
and hdfeppml_int
as needed. It takes a vector with the dependent variable, a regressor matrix and a set of fixed
effects (in list form: each element in the list should be a separate HDFE). This is a flexible tool
that allows users to select:
Penalty type: either lasso or ridge.
Penalty parameter: users can provide a single global value for lambda (a single regression is estimated), a vector of lambda values (the function estimates the regression using each of them, sequentially) or even coefficient-specific penalty weights.
Method: plugin lasso estimates can be obtained directly from this function too.
Cross-validation: if this option is enabled, the function uses IDs provided by the user to perform k-fold cross-validation and reports the resulting RMSE for all lambda values.
mlfitppml_int( y, x, fes, lambdas, penalty = "lasso", tol = 1e-08, hdfetol = 1e-04, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, post = TRUE, cluster = NULL, method = "bic", IDs = 1:n, verbose = FALSE, xval = FALSE, standardize = TRUE, vcv = TRUE, phipost = TRUE, penweights = NULL, K = 15, gamma_val = NULL, mu = NULL )
mlfitppml_int( y, x, fes, lambdas, penalty = "lasso", tol = 1e-08, hdfetol = 1e-04, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, post = TRUE, cluster = NULL, method = "bic", IDs = 1:n, verbose = FALSE, xval = FALSE, standardize = TRUE, vcv = TRUE, phipost = TRUE, penweights = NULL, K = 15, gamma_val = NULL, mu = NULL )
y |
Dependent variable (a vector) |
x |
Regressor matrix. |
fes |
List of fixed effects. |
lambdas |
Vector of penalty parameters. |
penalty |
A string indicating the penalty type. Currently supported: "lasso" and "ridge". |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol |
Tolerance parameter for the within-transformation step,
passed on to |
colcheck |
Logical. If |
colcheck_x |
Logical. If |
colcheck_x_fes |
Logical. If |
post |
Logical. If |
cluster |
Optional: a vector classifying observations into clusters (to use when calculating SEs). |
method |
The user can set this equal to "plugin" to perform the plugin algorithm with coefficient-specific penalty weights (see details). Otherwise, a single global penalty is used. |
IDs |
A vector of fold IDs for k-fold cross validation. If left unspecified, each observation is assigned to a different fold (warning: this is likely to be very resource-intensive). |
verbose |
Logical. If |
xval |
Logical. If |
standardize |
Logical. If |
vcv |
Logical. If |
phipost |
Logical. If |
penweights |
Optional: a vector of coefficient-specific penalties to use in plugin lasso when
|
K |
Maximum number of iterations for the plugin algorithm to converge. |
gamma_val |
Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
mu |
A vector of initial values for mu that can be passed to the command. |
For technical details on the algorithms used, see hdfeppml_int (post-lasso regression), penhdfeppml_int (standard penalized regression), penhdfeppml_cluster_int (plugin lasso), and xvalidate (cross-validation).
A list with the following elements:
beta
: if post = FALSE
, a length(lambdas)
x ncol(x)
matrix with
coefficient (beta) estimates from the penalized regressions. If post = TRUE
, this is
the matrix of coefficients from the post-penalty regressions.
beta_pre
: if post = TRUE
, a length(lambdas)
x ncol(x)
matrix with
coefficient (beta) estimates from the penalized regressions.
bic
: Bayesian Information Criterion.
lambdas
: vector of penalty parameters.
ses
: standard errors of the coefficients of the post-penalty regression. Note that
these are only provided when post = TRUE
.
rmse
: if xval = TRUE
, a matrix with the root mean squared error (RMSE - column 2)
for each value of lambda (column 1), obtained by cross-validation.
phi
: coefficient-specific penalty weights (only if method == "plugin"
).
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # First, we need to transform the data (this is what mlfitppml handles internally). Start by # filtering the data set to keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, we try mlfitppml_int with a lasso penalty (the default) and two lambda values: reg <- mlfitppml_int(y = y, x = x, fes = fes, lambdas = c(0.1, 0.01)) # We can also try plugin lasso: \donttest{reg <- mlfitppml_int(y = y, x = x, fes = fes, cluster = fes$pair, method = "plugin")} # For an example with cross-validation, please see the vignette. ## End(Not run)
## Not run: # First, we need to transform the data (this is what mlfitppml handles internally). Start by # filtering the data set to keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, we try mlfitppml_int with a lasso penalty (the default) and two lambda values: reg <- mlfitppml_int(y = y, x = x, fes = fes, lambdas = c(0.1, 0.01)) # We can also try plugin lasso: \donttest{reg <- mlfitppml_int(y = y, x = x, fes = fes, cluster = fes$pair, method = "plugin")} # For an example with cross-validation, please see the vignette. ## End(Not run)
penhdfeppml
fits a penalized PPML regression for a given type of penalty and a given
value of the penalty parameter. The penalty can be either lasso or ridge, and the plugin method
can be enabled via the method
argument.
penhdfeppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
penhdfeppml( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
data |
A data frame containing all relevant variables. |
dep |
A string with the name of the independent variable or a column number. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed |
A vector with the names or column numbers of factor variables identifying the fixed effects,
or a list with the desired interactions between variables in |
cluster |
Optional. A string with the name of the clustering variable or a column number. It's also possible to input a vector with several variables, in which case the interaction of all of them is taken as the clustering variable. |
selectobs |
Optional. A vector indicating which observations to use (either a logical vector or a numeric vector with row numbers, as usual when subsetting in R). |
... |
Further options, including:
For a full list of options, see penhdfeppml_int. |
This function is a thin wrapper around penhdfeppml_int, providing a more convenient interface
for data frames. Whereas the internal function requires some preliminary handling of data sets (y
must be a vector, x
must be a matrix and fes
must be provided in a list), the wrapper
takes a full data frame in the data
argument, and users can simply specify which variables
correspond to y, x and the fixed effects, using either variable names or column numbers.
More formally, penhdfeppml_int
performs iteratively re-weighted least squares (IRLS) on a
transformed model, as described in Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2021).
In each iteration, the function calculates the transformed dependent variable, partials out the fixed
effects (calling lfe::fhdwithin
) and then and then calls glmnet::glmnet
if the selected
penalty is lasso (the default). If the user has selected ridge, the analytical solution is instead
computed directly using fast C++ implementation.
For information on how the plugin lasso method works, see penhdfeppml_cluster.
If method == "lasso"
(the default), an object of class elnet
with the elements
described in glmnet, as well as:
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
.
bic
: Bayesian Information Criterion.
phi
: coefficient-specific penalty weights (only if method == "plugin"
.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
If method == "ridge"
, a list with the following elements:
beta
: a 1 x ncol(x)
matrix with coefficient (beta) estimates.
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
.
bic
: Bayesian Information Criterion.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] test <- penhdfeppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), lambda = 0.05, selectobs = (trade$imp %in% americas) & (trade$exp %in% americas)) ## End(Not run)
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] test <- penhdfeppml(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), lambda = 0.05, selectobs = (trade$imp %in% americas) & (trade$exp %in% americas)) ## End(Not run)
Performs plugin lasso - PPML estimation with HDFE. This is an internal function, called by
mlfitppml
and penhdfeppml
when users select the method = "plugin"
option, but it's made available as a stand-alone option for advanced users who may prefer to avoid
some overhead imposed by the wrappers.
penhdfeppml_cluster( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
penhdfeppml_cluster( data, dep = 1, indep = NULL, fixed = NULL, cluster = NULL, selectobs = NULL, ... )
data |
A data frame containing all relevant variables. |
dep |
A string with the name of the independent variable or a column number. |
indep |
A vector with the names or column numbers of the regressors. If left unspecified, all remaining variables (excluding fixed effects) are included in the regressor matrix. |
fixed |
A vector with the names or column numbers of factor variables identifying the fixed effects,
or a list with the desired interactions between variables in |
cluster |
A string with the name of the clustering variable or a column number. It's also possible to input a vector with several variables, in which case the interaction of all of them is taken as the clustering variable. Note that this is NOT OPTIONAL in this case: our plugin algorithm requires clusters to be specified. |
selectobs |
Optional. A vector indicating which observations to use (either a logical vector or a numeric vector with row numbers, as usual when subsetting in R). |
... |
Further options. For a full list of options, see penhdfeppml_cluster_int. |
This function is a thin wrapper around penppml_cluster_int
, providing a more convenient interface
for data frames. Whereas the internal function requires some preliminary handling of data sets (y
must be a vector, x
must be a matrix and fes
must be provided in a list), the wrapper
takes a full data frame in the data
argument, and users can simply specify which variables
correspond to y, x and the fixed effects, using either variable names or column numbers.
The plugin method uses coefficient-specific penalty weights that account for heteroskedasticity. The penalty parameters are calculated automatically by the function using statistical theory - for a brief discussion of this, see Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2021), and for a more in-depth analysis, check Belloni, Chernozhukov, Hansen, and Kozbur (2016), which introduced the specific implementation used in this package. Heuristically, the penalty parameters are set at a level high enough so that the absolute value of the score for each regressor must be statistically large relative to its standard error in order for the regressors to be selected.
An object of class elnet
with the elements described in glmnet, as
well as the following:
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
.
bic
: Bayesian Information Criterion.
phi
: coefficient-specific penalty weights.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] test <- penhdfeppml_cluster(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), cluster = c("exp", "imp"), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas), tol = 1e-5, hdfetol = 1e-1) ## End(Not run)
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] test <- penhdfeppml_cluster(data = trade[, -(5:6)], dep = "export", fixed = list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), cluster = c("exp", "imp"), selectobs = (trade$imp %in% americas) & (trade$exp %in% americas), tol = 1e-5, hdfetol = 1e-1) ## End(Not run)
Performs plugin lasso - PPML estimation with HDFE. This is an internal function, called by mlfitppml_int
and
penhdfeppml_int
when users select the method = "plugin"
option, but it's made available
as a stand-alone option for advanced users who may prefer to avoid some overhead imposed by the
wrappers.
penhdfeppml_cluster_int( y, x, fes, cluster, tol = 1e-08, hdfetol = 1e-04, glmnettol = 1e-12, penalty = "lasso", penweights = NULL, saveX = TRUE, mu = NULL, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, K = 15, init_z = NULL, post = FALSE, verbose = FALSE, lambda = NULL, phipost = TRUE, gamma_val = NULL )
penhdfeppml_cluster_int( y, x, fes, cluster, tol = 1e-08, hdfetol = 1e-04, glmnettol = 1e-12, penalty = "lasso", penweights = NULL, saveX = TRUE, mu = NULL, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, K = 15, init_z = NULL, post = FALSE, verbose = FALSE, lambda = NULL, phipost = TRUE, gamma_val = NULL )
y |
Dependent variable (a vector) |
x |
Regressor matrix. |
fes |
List of fixed effects. |
cluster |
Optional: a vector classifying observations into clusters (to use when calculating SEs). |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol |
Tolerance parameter for the within-transformation step,
passed on to |
glmnettol |
Tolerance parameter to be passed on to |
penalty |
Only "lasso" is supported at the present stage. |
penweights |
Optional: a vector of coefficient-specific penalties to use in plugin lasso when
|
saveX |
Logical. If |
mu |
A vector of initial values for mu that can be passed to the command. |
colcheck |
Logical. If |
colcheck_x |
Logical. If |
colcheck_x_fes |
Logical. If |
K |
Maximum number of iterations. |
init_z |
Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm. |
post |
Logical. If |
verbose |
Logical. If |
lambda |
Penalty parameter (a number). |
phipost |
Logical. If |
gamma_val |
Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
The plugin method uses coefficient-specific penalty weights that account for heteroskedasticity. The penalty parameters are calculated automatically by the function using statistical theory - for a brief discussion of this, see Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2021), and for a more in-depth analysis, check Belloni, Chernozhukov, Hansen, and Kozbur (2016), which introduced the specific implementation used in this package. Heuristically, the penalty parameters are set at a level high enough so that the absolute value of the score for each regressor must be statistically large relative to its standard error in order for the regressors to be selected.
An object of class elnet
with the elements described in glmnet, as
well as the following:
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
.
bic
: Bayesian Information Criterion.
phi
: coefficient-specific penalty weights.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in Latin America and the Caribbean: LatAmericaCar <- countries$iso[countries$sub.region == "Latin America and the Caribbean"] trade <- trade[(trade$imp %in% LatAmericaCar) & (trade$exp %in% LatAmericaCar), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, we try penhdfeppml_cluster_int: reg <- penhdfeppml_cluster_int(y = y, x = x, fes = fes, cluster = fes$pair) ## End(Not run)
## Not run: # To reduce run time, we keep only countries in Latin America and the Caribbean: LatAmericaCar <- countries$iso[countries$sub.region == "Latin America and the Caribbean"] trade <- trade[(trade$imp %in% LatAmericaCar) & (trade$exp %in% LatAmericaCar), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, we try penhdfeppml_cluster_int: reg <- penhdfeppml_cluster_int(y = y, x = x, fes = fes, cluster = fes$pair) ## End(Not run)
penhdfeppml_int
is the internal algorithm called by penhdfeppml
to fit a penalized PPML
regression for a given type of penalty and a given value of the penalty parameter. It takes a vector
with the dependent variable, a regressor matrix and a set of fixed effects (in list form: each element
in the list should be a separate HDFE). The penalty can be either lasso or ridge, and the plugin
method can be enabled via the method
argument.
penhdfeppml_int( y, x, fes, lambda, tol = 1e-08, hdfetol = 1e-04, glmnettol = 1e-12, penalty = "lasso", penweights = NULL, saveX = TRUE, mu = NULL, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, init_z = NULL, post = FALSE, verbose = FALSE, phipost = TRUE, standardize = TRUE, method = "placeholder", cluster = NULL, debug = FALSE, gamma_val = NULL )
penhdfeppml_int( y, x, fes, lambda, tol = 1e-08, hdfetol = 1e-04, glmnettol = 1e-12, penalty = "lasso", penweights = NULL, saveX = TRUE, mu = NULL, colcheck = TRUE, colcheck_x = colcheck, colcheck_x_fes = colcheck, init_z = NULL, post = FALSE, verbose = FALSE, phipost = TRUE, standardize = TRUE, method = "placeholder", cluster = NULL, debug = FALSE, gamma_val = NULL )
y |
Dependent variable (a vector) |
x |
Regressor matrix. |
fes |
List of fixed effects. |
lambda |
Penalty parameter (a number). |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol |
Tolerance parameter for the within-transformation step,
passed on to |
glmnettol |
Tolerance parameter to be passed on to |
penalty |
A string indicating the penalty type. Currently supported: "lasso" and "ridge". |
penweights |
Optional: a vector of coefficient-specific penalties to use in plugin lasso when
|
saveX |
Logical. If |
mu |
A vector of initial values for mu that can be passed to the command. |
colcheck |
Logical. If |
colcheck_x |
Logical. If |
colcheck_x_fes |
Logical. If |
init_z |
Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm. |
post |
Logical. If |
verbose |
Logical. If |
phipost |
Logical. If |
standardize |
Logical. If |
method |
The user can set this equal to "plugin" to perform the plugin algorithm with coefficient-specific penalty weights (see details). Otherwise, a single global penalty is used. |
cluster |
Optional: a vector classifying observations into clusters (to use when calculating SEs). |
debug |
Logical. If |
gamma_val |
Numerical value that determines the regularization threshold as defined in Belloni, Chernozhukov, Hansen, and Kozbur (2016). NULL default sets parameter to 0.1/log(n). |
More formally, penhdfeppml_int
performs iteratively re-weighted least squares (IRLS) on a
transformed model, as described in Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin (2020).
In each iteration, the function calculates the transformed dependent variable, partials out the fixed
effects (calling collapse::fhdwithin
) and then and then calls glmnet
if the selected
penalty is lasso (the default). If the user selects ridge, the analytical solution is instead
computed directly using fast C++ implementation.
For information on the plugin lasso method, see penhdfeppml_cluster_int.
If method == "lasso"
(the default), an object of class elnet
with the elements
described in glmnet, as well as:
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
.
bic
: Bayesian Information Criterion.
phi
: coefficient-specific penalty weights (only if method == "plugin"
.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
If method == "ridge"
, a list with the following elements:
beta
: a 1 x ncol(x)
matrix with coefficient (beta) estimates.
mu
: a 1 x length(y)
matrix with the final values of the conditional mean .
deviance
.
bic
: Bayesian Information Criterion.
x_resid
: matrix of demeaned regressors.
z_resid
: vector of demeaned (transformed) dependent variable.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, we try penhdfeppml_int with a lasso penalty (the default): reg <- penhdfeppml_int(y = y, x = x, fes = fes, lambda = 0.1) # We can also try ridge: \donttest{reg <- penhdfeppml_int(y = y, x = x, fes = fes, lambda = 0.1, penalty = "ridge")} ## End(Not run)
## Not run: # To reduce run time, we keep only countries in the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # Finally, we try penhdfeppml_int with a lasso penalty (the default): reg <- penhdfeppml_int(y = y, x = x, fes = fes, lambda = 0.1) # We can also try ridge: \donttest{reg <- penhdfeppml_int(y = y, x = x, fes = fes, lambda = 0.1, penalty = "ridge")} ## End(Not run)
This is the internal function upon which the iceberg
wrapper is built. It performs standard
plugin lasso PPML estimation without fixed effects, relying on glmnet::glmnet
. As the other
internals in the package, it needs a y vector and an x matrix.
plugin_lasso_int( y, x, tol = 1e-08, glmnettol = 1e-12, penweights = NULL, colcheck = FALSE, K = 50, verbose = FALSE, lambda = NULL, icepost = FALSE )
plugin_lasso_int( y, x, tol = 1e-08, glmnettol = 1e-12, penweights = NULL, colcheck = FALSE, K = 50, verbose = FALSE, lambda = NULL, icepost = FALSE )
y |
Dependent variable (a vector). |
x |
Regressor matrix. |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
glmnettol |
Tolerance parameter to be passed on to |
penweights |
Optional: a vector of coefficient-specific penalties to use in plugin lasso. |
colcheck |
Logical. If |
K |
Maximum number of iterations. |
verbose |
Logical. If |
lambda |
Penalty parameter (a number). |
icepost |
Logical. If |
A list with 14 elements, including beta
, which is the only one we use in the wrapper.
For a full list, see glmnet.
A helper function for xvalidate
that filters a list of fixed effects and returns the modified
list. Used to split the fixed effects for cross-validation.
select_fes(fe_list, select_obs, list = TRUE)
select_fes(fe_list, select_obs, list = TRUE)
fe_list |
A list of fixed effects. |
select_obs |
A vector of selected observations / rows. |
list |
Logical. If |
A modified list of fixed effects.
Performs weighted standardization of x variables. Used in fastridge
.
standardize_wt(x, weights = rep(1/n, n), intercept = TRUE, return.sd = FALSE)
standardize_wt(x, weights = rep(1/n, n), intercept = TRUE, return.sd = FALSE)
x |
Regressor matrix. |
weights |
Weights. |
intercept |
Logical. If |
return.sd |
Logical. If |
If return.sd == FALSE
, it gives the matrix of standardized regressors. If
return.sd == TRUE
, then it returns the vector of standard errors of the means of the
variables.
A panel data set containing bilateral trade flows between 210 exporters and 262 importers between 1964 and 2016. The data set also contains information about trade agreements in force between country pairs, as well as 16 dummies for specific provisions in those agreements (a small selection from a broader data set).
trade
trade
A data frame with 194,092 rows and 22 variables:
Exporter country (ISO 3166 code)
Importer country (ISO 3166 code).
Year.
Merchandise trade exports in USD.
Agreement ID code.
Agreement name.
Anti-dumping actions allowed and with specific provisions for material injury.
Does the agreement contain provisions that promote transparency?
Technical Regulations - Is the use of international standards promoted?
Does the agreement go beyond the TBT (Technical Barriers to Trade) Agreement?
Harmonization and common legal framework
Issuance of proof of origin
Does the agreement contain a standstill provision?
Does the agreement grant Fair and Equitable Treatment (FET)?
Prohibits export-related performance requirements, subject to exemptions.
Stipulates that GIs can be registered and protected through a TM system
Does the agreement require states to control ozone-depleting substances?
Incorporates/reaffirms all multilateral agreements to which both parties are a party (general obligation)
Does the transfer provision explicitly exclude “good faith and non-discriminatory application of its laws” related to bankruptcy, insolvency or creditor rights protection?
Does the agreement regulate subsidization to state enterprises?
Does the agreement include reference to internationally recognized labor standards?
Does the agreement regulate consumer protection?
Data on international trade flows was obtained from Comtrade. Provision data comes from: Mattoo, A., N. Rocha, M. Ruta (2020). Handbook of deep trade agreements. Washington, DC: World Bank.
Given matrix ee' and matrix X, compute X(k)'ee'X(k) for each regressor X.
xeex(X, e, S)
xeex(X, e, S)
X |
Regressor matrix. |
e |
Residuals. |
S |
Cluster sizes. |
The matrix product X(k)'ee'X(k).
This is the internal function called by mlfitppml_int
to perform cross-validation, if the
option is enabled. It is available also on a stand-alone basis in case it is needed, but generally
users will be better served by using the wrapper mlfitppml
.
xvalidate( y, x, fes, IDs, testID = NULL, tol = 1e-08, hdfetol = 1e-04, colcheck_x = TRUE, colcheck_x_fes = TRUE, init_mu = NULL, init_x = NULL, init_z = NULL, verbose = FALSE, cluster = NULL, penalty = "lasso", method = "placeholder", standardize = TRUE, penweights = rep(1, ncol(x_reg)), lambda = 0 )
xvalidate( y, x, fes, IDs, testID = NULL, tol = 1e-08, hdfetol = 1e-04, colcheck_x = TRUE, colcheck_x_fes = TRUE, init_mu = NULL, init_x = NULL, init_z = NULL, verbose = FALSE, cluster = NULL, penalty = "lasso", method = "placeholder", standardize = TRUE, penweights = rep(1, ncol(x_reg)), lambda = 0 )
y |
Dependent variable (a vector) |
x |
Regressor matrix. |
fes |
List of fixed effects. |
IDs |
A vector of fold IDs for k-fold cross validation. If left unspecified, each observation is assigned to a different fold (warning: this is likely to be very resource-intensive). |
testID |
Optional. A number indicating which ID to hold out during cross-validation. If left unspecified, the function cycles through all IDs and reports the average RMSE. |
tol |
Tolerance parameter for convergence of the IRLS algorithm. |
hdfetol |
Tolerance parameter for the within-transformation step,
passed on to |
colcheck_x |
Logical. If |
colcheck_x_fes |
Logical. If |
init_mu |
Optional: initial values of the conditional mean |
init_x |
Optional: initial values of the independent variables. |
init_z |
Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm. |
verbose |
Logical. If |
cluster |
Optional: a vector classifying observations into clusters (to use when calculating SEs). |
penalty |
A string indicating the penalty type. Currently supported: "lasso" and "ridge". |
method |
The user can set this equal to "plugin" to perform the plugin algorithm with coefficient-specific penalty weights (see details). Otherwise, a single global penalty is used. |
standardize |
Logical. If |
penweights |
Optional: a vector of coefficient-specific penalties to use in plugin lasso when
|
lambda |
Penalty parameter, to be passed on to penhdfeppml_int or penhdfeppml_cluster_int. |
xvalidate
carries out cross-validation with the user-provided IDs by holding out each one of
them, sequentially, as in the k-fold procedure (unless testID
is specified, in which case
it just uses this ID for validation). After filtering out the holdout sample, the function simply
calls penhdfeppml_int and penhdfeppml_cluster_int to estimate the coefficients, it
predicts the conditional means for the held-out observations and finally it calculates the root mean
squared error (RMSE).
A list with two elements:
rmse
: root mean squared error (RMSE).
mu
: conditional means.
Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.
Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.
Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.
Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.
Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.
# First, we need to transform the data. Start by filtering the data set to keep only countries in # the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # We also need to create the IDs. We split the data set by agreement, not observation: id <- unique(trade[, 5]) nfolds <- 10 unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE)) cross_ids <- merge(trade[, 5, drop = FALSE], unique_ids, by = "id", all.x = TRUE) # Finally, we try xvalidate with a lasso penalty (the default) and two lambda values: ## Not run: reg <- xvalidate(y = y, x = x, fes = fes, lambda = 0.001, IDs = cross_ids$fold, verbose = TRUE) ## End(Not run)
# First, we need to transform the data. Start by filtering the data set to keep only countries in # the Americas: americas <- countries$iso[countries$region == "Americas"] trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ] # Now generate the needed x, y and fes objects: y <- trade$export x <- data.matrix(trade[, -1:-6]) fes <- list(exp_time = interaction(trade$exp, trade$time), imp_time = interaction(trade$imp, trade$time), pair = interaction(trade$exp, trade$imp)) # We also need to create the IDs. We split the data set by agreement, not observation: id <- unique(trade[, 5]) nfolds <- 10 unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE)) cross_ids <- merge(trade[, 5, drop = FALSE], unique_ids, by = "id", all.x = TRUE) # Finally, we try xvalidate with a lasso penalty (the default) and two lambda values: ## Not run: reg <- xvalidate(y = y, x = x, fes = fes, lambda = 0.001, IDs = cross_ids$fold, verbose = TRUE) ## End(Not run)