Title: | Residual Randomization Inference for Regression Models |
---|---|
Description: | Testing and inference for regression models using residual randomization methods. The basis of inference is an invariance assumption on the regression errors, e.g., clustered errors, or doubly-clustered errors. |
Authors: | Panos Toulis |
Maintainer: | Panos Toulis <[email protected]> |
License: | GPL-2 |
Version: | 1.1 |
Built: | 2024-12-01 08:03:29 UTC |
Source: | CRAN |
model
is valid.Checks whether the input model
is valid.
check_model(model)
check_model(model)
model |
A |
clustering
object. A clustering is a List
that
splits indexes 1..#num_datapoints to clusters. Each List
element corresponds to one cluster.
The clustering is not necessarily a partition but it usually is.An example clustering
object. A clustering is a List
that
splits indexes 1..#num_datapoints to clusters. Each List
element corresponds to one cluster.
The clustering is not necessarily a partition but it usually is.
example_clustering()
example_clustering()
A List
for the clustering of indexes 1..#num_datapoints.
Example regression model and H0.
example_model(n = 100)
example_model(n = 100)
n |
Number of datapoints. |
List of (y, X, lam, lam0) that corresponds to regression model and null hypothesis:
y = n-length vector of outcomes
X = n x p covariate matrix;
lam = p-vector of coefficients
lam0 = real number.
The null we are testing through this specification is
H0: lam' beta = lam[1] * beta[1] + ... + lam[p] * beta[p] = lam0,
where beta are the model parameters in the regression, y = X beta + e. By default this example sets p = 2-dim model, lam = (0, 1) and lam0 = 0. In this specification, H0: beta[2] = 0.
model = example_model() lm(model$y ~ model$X + 0)
model = example_model() lm(model$y ~ model$X + 0)
This functions fits the regression y ~ X using Armadillo solve
function.
fastLm(y, X)
fastLm(y, X)
y |
Vector of outcomes. |
X |
Matrix of covariates (first column should be 1's) |
List
of regression output with elements coef
, stderr
.
Given regression model
and clustering
, this function calculates the OLS residuals
under the linear null hypothesis, and assigns them to the specified clusters.
get_clustered_eps(model, clustering)
get_clustered_eps(model, clustering)
model |
A regression |
clustering |
A |
A List
of the restricted residuals clustered according to clustering
.
m = example_model(n=100) cl = list(1:50, 51:100) er = get_clustered_eps(m, cl) stopifnot(length(er) == length(cl)) stopifnot(length(er[[1]]) == 50)
m = example_model(n=100) cl = list(1:50, 51:100) er = get_clustered_eps(m, cl) stopifnot(length(er) == length(cl)) stopifnot(length(er[[1]]) == 50)
Fast OLS as in fastLm but returns only the fitted coefficients.
OLS_c(y, X)
OLS_c(y, X)
y |
Vector of outcomes. |
X |
Matrix of covariates (first column should be 1's) |
Vector of coefficients.
Decides to reject or not based on observed test statistic value tobs
and randomization values tvals
.
one_sided_test(tobs, tvals, alpha, tol = 1e-14)
one_sided_test(tobs, tvals, alpha, tol = 1e-14)
tobs |
The observed value of the test statistic (scalar). |
tvals |
Vector of randomization values of the test statistic (to compare with |
alpha |
Desired level of the test (between 0 to 1). |
tol |
Used to check whether |
The test may randomize to achieve the specified level alpha
when there are very few randomization values.
Test decision (binary).
Testing Statistical Hypotheses (Ch. 15, Lehman and Romano, 2006)
Depending on ret_pval
this function returns either a p-value for the test or the binary decision.
out_pval(rtest_out, ret_pval, alpha)
out_pval(rtest_out, ret_pval, alpha)
rtest_out |
A |
ret_pval |
A |
alpha |
Desired test level (from 0 to 1). |
Returns 1 if the test rejects, 0 otherwise.
Binary decision if ret_pval
is TRUE, or the p-value otherwise.
Implements the residual randomization test. The hypothesis tested is
r_test_c(y, X, lam, lam0, cluster_eps_r, use_perm, use_sign, num_R)
r_test_c(y, X, lam, lam0, cluster_eps_r, use_perm, use_sign, num_R)
y |
Vector of outcomes (n x 1). |
X |
Matrix of covariates (n x p). First column should be 1's. |
lam |
Vector of coefficients in linear H0 (p x 1). |
lam0 |
Scalar value for linear H0. |
cluster_eps_r |
A |
use_perm |
|
use_sign |
|
num_R |
Integer of how many randomization values to calculate. |
H0: lam' beta = lam[1] * beta[1] + ... + lam[p] * beta[p] = lam0.
A List
with the observed test statistic value (tobs
), and the randomization values (tvals
)
This functions fits the regression y ~ X under a linear constraint on the
model parameters. The constraint is Q
* beta = c
where beta
are the regression model parameters, and Q, c
are inputs.
restricted_OLS_c(y, X, bhat, Q, c)
restricted_OLS_c(y, X, bhat, Q, c)
y |
Vector of outcomes. |
X |
Matrix of covariates (first column should be 1's) |
bhat |
Unconstrained OLS-fitted coefficients. |
Q |
Matrix of linear constraints (k x p). |
c |
Vector of constraint values (k x 1). |
Vector of fitted OLS coefficients under linear constraint.
Advanced Econometrics (Section 1.4, Takeshi Amemiya, 1985)
This function is a wrapper over rrtest and gives confidence intervals for all parameters.
rrinf( y, X, g_invar, cover = 0.95, num_R = 999, control = list(num_se = 6, num_breaks = 60) )
rrinf( y, X, g_invar, cover = 0.95, num_R = 999, control = list(num_se = 6, num_breaks = 60) )
y |
Vector of outcomes (length n) |
X |
Covariate matrix (n x p). First column should be ones to include intercept. |
g_invar |
Function that transforms residuals. Accepts n-vector and returns n-vector. |
cover |
Number from [0, 1] that denotes the confidence interval coverage (e.g., 0.95 denotes 95%) |
num_R |
Number of test statistic values to calculate in the randomization test (similar to no. of bootstrap samples). |
control |
A |
This function has similar funtionality as standard confint. It generates confidence intervals by testing plausible values for each parameter. The plausible values are generated as follows. For some parameter beta_i we test successively
H0: beta_i = hat_beta_i - num_se
* se_i
...up to...
H0: beta_i = hat_beta_i + num_se
* se_i
broken in num_breaks
intervals. Here, hat_beta_i is the OLS estimate of beta_i and se_i is the standard error.
We then report the minimum and maximum values in this search space which we cannot reject
at level alpha
. This forms the desired confidence interval.
Matrix that includes the confidence interval endpoints, and the interval midpoint estimate.
If the confidence interval appears to be a point or is empty, then this means
that the nulls we consider are implausible.
We can try to improve the search through control.tinv
.
For example, we can both increase num_se
to increase the width of search,
and increase num_breaks
to make the search space finer.
Life after bootstrap: residual randomization inference in regression models (Toulis, 2019)
https://www.ptoulis.com/residual-randomization
set.seed(123) X = cbind(rep(1, 100), runif(100)) beta = c(-1, 1) y = X %*% beta + rnorm(100) g_invar = function(e) sample(e) # Assume exchangeable errors. M = rrinf(y, X, g_invar, control=list(num_se=4, num_breaks=20)) M # Intervals cover true values
set.seed(123) X = cbind(rep(1, 100), runif(100)) beta = c(-1, 1) y = X %*% beta + rnorm(100) g_invar = function(e) sample(e) # Assume exchangeable errors. M = rrinf(y, X, g_invar, control=list(num_se=4, num_breaks=20)) M # Intervals cover true values
This function is a wrapper over rrtest_clust and gives confidence intervals for all parameters assuming a particular cluster invariance on the errors.
rrinf_clust( y, X, type, clustering = NULL, cover = 0.95, num_R = 999, control = list(num_se = 6, num_breaks = 60) )
rrinf_clust( y, X, type, clustering = NULL, cover = 0.95, num_R = 999, control = list(num_se = 6, num_breaks = 60) )
y |
Vector of outcomes (length n) |
X |
Covariate matrix (n x p). First column should be ones to include intercept. |
type |
A string, either "perm", "sign" or "double". |
clustering |
A |
cover |
Number from [0, 1] that denotes the confidence interval coverage (e.g., 0.95 denotes 95%) |
num_R |
Number of test statistic values to calculate in the randomization test (similar to no. of bootstrap samples). |
control |
A |
This function has similar funtionality as standard confint. It generates confidence intervals by testing plausible values for each parameter. The plausible values are generated as follows. For some parameter beta_i we test successively
H0: beta_i = hat_beta_i - num_se
* se_i
...up to...
H0: beta_i = hat_beta_i + num_se
* se_i
broken in num_breaks
intervals. Here, hat_beta_i is the OLS estimate of beta_i and se_i is the standard error.
We then report the minimum and maximum values in this search space which we cannot reject
at level alpha
. This forms the desired confidence interval.
Matrix that includes the OLS estimate, and confidence interval endpoints.
If the confidence interval appears to be a point or is empty, then this means
that the nulls we consider are implausible.
We can try to improve the search through control.tinv
.
For example, we can both increase num_se
to increase the width of search,
and increase num_breaks
to make the search space finer.
See rrtest_clust for a description of type
and clustering
.
Life after bootstrap: residual randomization inference in regression models (Toulis, 2019)
https://www.ptoulis.com/residual-randomization
# Heterogeneous example set.seed(123) n = 200 X = cbind(rep(1, n), 1:n/n) beta = c(-1, 0.2) ind = c(rep(0, 0.9*n), rep(1, .1*n)) # cluster indicator y = X %*% beta + rnorm(n, sd= (1-ind) * 0.1 + ind * 5) # heteroskedastic confint(lm(y ~ X + 0)) # normal OLS CI is imprecise cl = list(which(ind==0), which(ind==1)) # define the clustering rrinf_clust(y, X, "perm", cl) # improved CI through clustered errors
# Heterogeneous example set.seed(123) n = 200 X = cbind(rep(1, n), 1:n/n) beta = c(-1, 0.2) ind = c(rep(0, 0.9*n), rep(1, .1*n)) # cluster indicator y = X %*% beta + rnorm(n, sd= (1-ind) * 0.1 + ind * 5) # heteroskedastic confint(lm(y ~ X + 0)) # normal OLS CI is imprecise cl = list(which(ind==0), which(ind==1)) # define the clustering rrinf_clust(y, X, "perm", cl) # improved CI through clustered errors
Generic residual randomization inference This function provides the basis for all other rrinf* functions.
rrinfBase(y, X, g_or_clust, cover, num_R, control.tinv)
rrinfBase(y, X, g_or_clust, cover, num_R, control.tinv)
y |
Vector of outcomes (length n) |
X |
Covariate matrix (n x p). First column should be ones to include intercept. |
g_or_clust |
Either |
cover |
Number from [0, 1] that denotes the confidence interval coverage (e.g., 0.95 denotes 95%) |
num_R |
Number of test statistic values to calculate in the randomization test (similar to no. of bootstrap samples). |
control.tinv |
A |
This function has similar funtionality as standard confint. It does so by testing plausible values for each parameter. The plausible values can be controlled as follows. For some parameter beta_i we will test successively
H0: beta_i = hat_beta_i - num_se
* se_i
...up to...
H0: beta_i = hat_beta_i + num_se
* se_i
broken in num_breaks
intervals. Here, hat_beta_i is the OLS estimate of beta_i and se_i is the standard error.
The g_or_clust
object should either be (i) a g-invariance function R^n -> R^n; or (ii)
a list(type, cl) where type=c("perm", "sign", "double") and cl=clustering
(see example_clustering for details).
https://www.ptoulis.com/residual-randomization
Matrix that includes the confidence interval endpoints, and the interval midpoint estimate.
This function tests the specified linear hypothesis in model
assuming the errors are distributionally invariant
with respect to stochastic function g_invar
.
rrtest(model, g_invar, num_R = 999, alpha = 0.05, val_type = "decision")
rrtest(model, g_invar, num_R = 999, alpha = 0.05, val_type = "decision")
model |
Regression model and hypothesis. See example_model for details. |
g_invar |
Stochastic function that transforms residuals. Accepts n-vector and returns n-vector. |
num_R |
Number of test statistic values to calculate in the randomization test. |
alpha |
Nominal test level (between 0 to 1). |
val_type |
The type of return value. |
For the regression y = X * beta + e, this function is testing the following linear null hypothesis:
H0: lam' beta = lam[1] * beta[1] + ... + lam[p] * beta[p] = lam0,
where y, X, lam, lam0 are specified in model
.
The assumption is that the errors, e, have some form of cluster invariance.
Specifically:
(e_1, e_2, ..., e_n) ~ g_invar(e_1, e_2, ..., e_n),
where ~ denotes equality in distribution, and g_invar
is the supplied
invariance function.
If val_type
= "decision" (default) we get the test binary decision (1=REJECT H0).
If val_type
= "pval" we get the test p-value.
If val_type
= "full" we get the full test output, i.e., a List
with elements tobs
, tvals
,
the observed and randomization values of the test statistic, respectively.
There is no guarantee that an arbitrary g_invar
will produce valid tests.
The rrtest_clust function has such guarantees under mild assumptions.
Life after bootstrap: residual randomization inference in regression models (Toulis, 2019)
https://www.ptoulis.com/residual-randomization
model = example_model(n = 100) # test H0: beta2 = 0 (here, H0 is true) g_invar = function(e) sample(e) # Assume errors are exchangeable. rrtest(model, g_invar) # same as rrtest_clust(model, "perm")
model = example_model(n = 100) # test H0: beta2 = 0 (here, H0 is true) g_invar = function(e) sample(e) # Assume errors are exchangeable. rrtest(model, g_invar) # same as rrtest_clust(model, "perm")
This function tests the specified linear hypothesis in model
assuming that the errors have some form of cluster invariance determined by type
within the clusters determined by clustering
.
rrtest_clust( model, type, clustering = NULL, num_R = 999, alpha = 0.05, val_type = "decision" )
rrtest_clust( model, type, clustering = NULL, num_R = 999, alpha = 0.05, val_type = "decision" )
model |
Regression model and hypothesis. See example_model for details. |
type |
A |
clustering |
A |
num_R |
Number of test statistic values to calculate in the test. |
alpha |
Nominal test level (between 0 to 1). |
val_type |
The type of return value. |
For the regression y = X * beta + e, this function is testing the following linear null hypothesis:
H0: lam' beta = lam[1] * beta[1] + ... + lam[p] * beta[p] = lam0,
where y, X, lam, lam0 are specified in model
.
The assumption is that the errors, e, have some form of cluster invariance.
Specifically:
If type
= "perm" then the errors are assumed exchangeable within the specified clusters:
(e_1, e_2, ..., e_n) ~ cluster_perm(e_1, e_2, ..., e_n),
where ~ denotes equality in distribution, and cluster_perm is any random permutation
within the clusters defined by clustering
. Internally, the test repeatedly calculates a test statistic
by randomly permuting the residuals within clusters.
If type
= "sign" then the errors are assumed sign-symmetric within the specified clusters:
(e_1, e_2, ..., e_n) ~ cluster_signs(e_1, e_2, ..., e_n),
where cluster_signs is a random signs flip of residuals on the cluster level. Internally, the test repeatedly calculates a test statistic by randomly flipping the signs of cluster residuals.
If type
= "double" then the errors are assumed both exchangeable and sign symmetric
within the specified clusters:
(e_1, e_2, ..., e_n) ~ cluster_signs(cluster_perm(e_1, e_2, ..., e_n)),
Internally, the test repeatedly calculates a test statistic by permuting and randomly flipping the signs of residuals on the cluster level.
If val_type
= "decision" (default) we get the test binary decision (1=REJECT H0).
If val_type
= "pval" we get the test p-value.
If val_type
= "full" we get the full test output, i.e., a List
with elements tobs
, tvals
,
the observed and randomization values of the test statistic, respectively.
If clustering
is NULL then it will be assigned a default value:
list(1:n)
if type
= "perm", where n is the number of datapoints;
as.list(1:n)
if type
= "sign" or "double".
As in bootstrap num_R
is usually between 1000-5000.
Life after bootstrap: residual randomization inference in regression models (Toulis, 2019)
https://www.ptoulis.com/residual-randomization
# 1. Validity example set.seed(123) n = 50 X = cbind(rep(1, n), 1:n/n) beta = c(0, 0) rej = replicate(200, { y = X %*% beta + rt(n, df=5) model = list(y=y, X=X, lam=c(0, 1), lam0=0) # H0: beta2 = 0 rrtest_clust(model, "perm") }) mean(rej) # Should be ~ 5% since H0 is true. # 2. Heteroskedastic example set.seed(123) n = 200 X = cbind(rep(1, n), 1:n/n) beta = c(-1, 0.2) ind = c(rep(0, 0.9*n), rep(1, .1*n)) # cluster indicator y = X %*% beta + rnorm(n, sd= (1-ind) * 0.1 + ind * 5) # heteroskedastic confint(lm(y ~ X + 0)) # normal OLS does not reject H0: beta2 = 0 cl = list(which(ind==0), which(ind==1)) model = list(y=y, X=X, lam=c(0, 1), lam0=0) rrtest_clust(model, "sign") # errors are sign symmetric regardless of cluster. # Cluster sign test does not reject because of noise. rrtest_clust(model, "perm", cl) # errors are exchangeable within clusters # Cluster permutation test rejects because inference is sharper.
# 1. Validity example set.seed(123) n = 50 X = cbind(rep(1, n), 1:n/n) beta = c(0, 0) rej = replicate(200, { y = X %*% beta + rt(n, df=5) model = list(y=y, X=X, lam=c(0, 1), lam0=0) # H0: beta2 = 0 rrtest_clust(model, "perm") }) mean(rej) # Should be ~ 5% since H0 is true. # 2. Heteroskedastic example set.seed(123) n = 200 X = cbind(rep(1, n), 1:n/n) beta = c(-1, 0.2) ind = c(rep(0, 0.9*n), rep(1, .1*n)) # cluster indicator y = X %*% beta + rnorm(n, sd= (1-ind) * 0.1 + ind * 5) # heteroskedastic confint(lm(y ~ X + 0)) # normal OLS does not reject H0: beta2 = 0 cl = list(which(ind==0), which(ind==1)) model = list(y=y, X=X, lam=c(0, 1), lam0=0) rrtest_clust(model, "sign") # errors are sign symmetric regardless of cluster. # Cluster sign test does not reject because of noise. rrtest_clust(model, "perm", cl) # errors are exchangeable within clusters # Cluster permutation test rejects because inference is sharper.
Decides to reject or not based on observed test statistic value tobs
and randomization values tvals
. The test may randomize to achieve the specified level alpha
when there are very few randomization values.
two_sided_test(tobs, tvals, alpha)
two_sided_test(tobs, tvals, alpha)
tobs |
The observed value of the test statistic (scalar). |
tvals |
Vector of randomization values of the test statistic (to compare with |
alpha |
Desired level of the test (between 0 to 1). |
Test decision (binary).
Testing Statistical Hypotheses (Ch. 15, Lehman and Romano, 2006)