Package 'YatchewTest'

Title: Yatchew (1997), De Chaisemartin & D'Haultfoeuille (2024) Linearity Test
Description: Test of linearity originally proposed by Yatchew (1997) <doi:10.1016/S0165-1765(97)00218-8> and improved by de Chaisemartin & D'Haultfoeuille (2024) <doi:10.2139/ssrn.4284811> to be robust under heteroskedasticity.
Authors: Diego Ciccia [aut, cre], Felix Knau [aut], Doulo Sow [aut], Clément de Chaisemartin [aut], Xavier D'Haultfoeuille [aut]
Maintainer: Diego Ciccia <[email protected]>
License: MIT + file LICENSE
Version: 1.1.1
Built: 2025-02-19 06:49:09 UTC
Source: CRAN

Help Index


Main function

Description

Test of Linearity of a Conditional Expectation Function (Yatchew, 1997; de Chaisemartin and D'Haultfoeuille, 2024)

Usage

yatchew_test(data, ...)

Arguments

data

A data object.

...

Undocumented.

Value

Method dispatch depending on the data object class.


General yatchew_test method for unclassed dataframes

Description

General yatchew_test method for unclassed dataframes

Usage

## S3 method for class 'data.frame'
yatchew_test(data, Y, D, het_robust = FALSE, path_plot = FALSE, order = 1, ...)

Arguments

data

(data.frame) A dataframe.

Y

(char) Dependent variable.

D

(char) Independent variable.

het_robust

(logical) If FALSE, the test is performed under the assumption of homoskedasticity (Yatchew, 1997). If TRUE, the test is performed using the heteroskedasticity-robust test statistic proposed by de Chaisemartin and D'Haultfoeuille (2024).

path_plot

(logical) if TRUE and D has length 2, the assigned object will include a plot of the sequence of (D1i,D2i)(D_{1i}, D_{2i})s that minimizes the euclidean distance between each pair of consecutive observations (see Overview for further details).

order

(nonnegative integer k) If this option is specified, the program tests whether the conditional expectation of Y given D is a k-degree polynomial in D. With order = 0, the command tests the hypothesis that the conditional mean of Y given D is constant.

...

Undocumented.

Value

A list with test results.

Overview

This program implements the linearity test proposed by Yatchew (1997) and its heteroskedasticity-robust version proposed by de Chaisemartin and D'Haultfoeuille (2024). In this overview, we sketch the intuition behind the two tests, as to motivate the use of the package and its options. Please refer to Yatchew (1997) and Section 3 of de Chaisemartin and D'Haultfoeuille (2024) for further details.

Yatchew (1997) proposes a useful extension of the test with multiple independent variables. The program implements this extension when the D argument has length >1> 1. It should be noted that the power and consistency of the test in the multivariate case are not backed by proven theoretical results. We implemented this extension to allow for testing and exploratory research. Future theoretical exploration of the multivariate test will depend on the demand and usage of the package.

Univariate Yatchew Test

Let YY and DD be two random variables. Let m(D)=E[YD]m(D) = E[Y|D]. The null hypothesis of the test is that m(D)=α0+α1Dm(D) = \alpha_0 + \alpha_1 D for two real numbers α0\alpha_0 and α1\alpha_1. This means that, under the null, m(.)m(.) is linear in DD. The outcome variable can be decomposed as Y=m(D)+εY = m(D) + \varepsilon, with E[εD]=0E[\varepsilon|D] = 0 and ΔY=Δε\Delta Y = \Delta \varepsilon for ΔD0\Delta D \to 0. In a dataset with NN i.i.d. realisations of (Y,D)(Y, D), one can test this hypothesis as follows:

  1. sort the dataset by DD;

  2. denote the corresponding observations by (Y(i),D(i))(Y_{(i)}, D_{(i)}), with i{1,...,N}i \in \{1, ..., N\};

  3. approximate σ^diff2\hat{\sigma}^2_{\text{diff}}, i.e. the variance of the first differenced residuals ε(i)ε(i1)\varepsilon_{(i)} - \varepsilon_{(i-1)}, by the variance of Y(i)Y(i1)Y_{(i)} - Y_{(i-1)};

  4. compute σ^lin2\hat{\sigma}^2_{\text{lin}}, i.e. the variance of the residuals from an OLS regression of YY on DD.

Heuristically, the validity of step (3) derives from the fact that Y(i)Y(i1)Y_{(i)} - Y_{(i-1)} = m(D(i))m(D(i1))m(D_{(i)}) - m(D_{(i-1)}) + ε(i)ε(i1)\varepsilon_{(i)} - \varepsilon_{(i-1)} and the first difference term is close to zero for D(i)D(i1)D_{(i)} \approx D_{(i-1)}. Sorting at step (1) ensures that consecutive D(i)D_{(i)}s are as close as possible, and when the sample size goes to infinity the distance between consecutive observations goes to zero. Then, Yatchew (1997) shows that under homoskedasticity and regularity conditions

T:=G(σ^lin2σ^diff21)dN(0,1)T := \sqrt{G}\left(\dfrac{\hat{\sigma}^2_{\text{lin}}}{\hat{\sigma}^2_{\text{diff}}}-1\right) \stackrel{d}{\longrightarrow} \mathcal{N}\left(0,1\right)

Then, one can reject the linearity of m(.)m(.) with significance level α\alpha if T>Φ(1α)T > \Phi(1-\alpha).

If the homoskedasticity assumption fails, this test leads to overrejection. De Chaisemartin and D'Haultfoeuille (2024) propose a heteroskedasticity-robust version of the test statistic above. This version of the Yatchew (1997) test can be implemented by running the command with the option het_robust = TRUE.

Multivariate Yatchew Test

Let D\textbf{D} be a vector of KK random variables. Let g(D)=E[YD]g(\textbf{D}) = E[Y|\textbf{D}]. Denote with .,.||.,.|| the Euclidean distance between two vectors. The null hypothesis of the multivariate test is g(D)=α0+ADg(\textbf{D}) = \alpha_0 + A'\textbf{D}, with A=(α1,...,αK)A = (\alpha_1,..., \alpha_K), for K+1K+1 real numbers α0\alpha_0, α1\alpha_1, ..., αK\alpha_K. This means that, under the null, g(.)g(.) is linear in D\textbf{D}. Following the same logic as the univariate case, in a dataset with NN i.i.d. realisations of (Y,D)(Y, \textbf{D}) we can approximate the first difference Δε\Delta \varepsilon by ΔY\Delta Y valuing g(.)g(.) between consecutive observations. The program runs a nearest neighbor algorithm to find the sequence of observations such that the Euclidean distance between consecutive positions is minimized. The algorithm has been programmed in C++ and it has been integrated in R thanks to the Rcpp library. The program follows a very simple nearest neighbor approach:

  1. collect all the Euclidean distances between all the possible unique pairs of rows in D\textbf{D} in the matrix MM, where Mn,m=Dn,DmM_{n,m} = ||\textbf{D}_n,\textbf{D}_m|| with n,m{1,...,N}n,m \in \{1, ..., N\};

  2. setup the queue to Q={1,...,N}Q = \{1, ..., N\}, the (empty) path vector I={}I = \{\} and the starting index i=1i = 1;

  3. remove ii from QQ and find the column index jj of M such that Mi,j=mincQMi,cM_{i,j} = \min_{c \in Q} M_{i,c};

  4. append jj to II and start again from step 3 with i=ji = j until QQ is empty.

To improve efficiency, the program collects only the N(N1)/2N(N-1)/2 Euclidean distances corresponding to the lower triangle of matrix MM and chooses jj such that Mi,j=mincQ1{c<i}Mi,c+1{c>i}Mc,iM_{i,j} = \min_{c \in Q} 1\{c < i\} M_{i,c} + 1\{c > i\} M_{c,i}. The output of the algorithm, i.e. the vector II, is a sequence of row numbers such that the distance between the corresponding rows Di\textbf{D}_is is minimized. The program also uses two refinements suggested in Appendix A of Yatchew (1997):

  • The entries in D\textbf{D} are normalized in [0,1][0,1];

  • The algorithm is applied to sub-cubes, i.e. partitions of the [0,1]K[0,1]^K space, and the full path is obtained by joining the extrema of the subpaths.

By convention, the program computes (2log10N)K(2\lceil \log_{10} N \rceil)^K subcubes, where each univariate partition is defined by grouping observations in 2log10N2\lceil \log_{10} N \rceil quantile bins. If K=2K = 2, the user can visualize in a ggplot graph the exact path across the normalized Di\textbf{D}_is by running the command with the option path_plot = TRUE.

Once the dataset is sorted by II, the program resumes from step (2) of the univariate case.

Contacts

If you wish to inquire about the functionalities of this package or to report bugs/suggestions, feel free to post your question in the Issues section of the yatchew_test GitHub repository.

References

de Chaisemartin, C., d'Haultfoeuille, X. (2024). Two-way Fixed Effects and Difference-in-Difference Estimators in Heterogeneous Adoption Designs.

Yatchew, A. (1997). An elementary estimator of the partial linear model.

Examples

df <- as.data.frame(matrix(NA, nrow = 1E3, ncol = 0))
df$x <- rnorm(1E3)
df$b <- runif(1E3)
df$y <- 2 + df$b * df$x
yatchew_test(data = df, Y = "y", D = "x")