Title: | U-Statistic Permutation Tests of Independence for all Data Types |
---|---|
Description: | Implements various independence tests for discrete, continuous, and infinite-dimensional data. The tests are based on a U-statistic permutation test, the USP of Berrett, Kontoyiannis and Samworth (2020) <arXiv:2001.05513>, and shown to be minimax rate optimal in a wide range of settings. As the permutation principle is used, all tests have exact, non-asymptotic Type I error control at the nominal level. |
Authors: | Thomas B. Berrett <[email protected]> [aut,cre], Ioannis Kontoyiannis <[email protected]> [aut], Richard J. Samworth <[email protected]> [aut] |
Maintainer: | Thomas B. Berrett <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-12-10 06:55:04 UTC |
Source: | CRAN |
This function is used in InfKern to produce the kernel matrix from functional
data defined on the interval . For further details see Section 7.4
of (Berrett et al. 2021).
coeffs(X, Ntrunc)
coeffs(X, Ntrunc)
X |
The discretised functions whose coefficients are required.
This should be a matrix with one row per function, and with |
Ntrunc |
The number of coefficients that are required.
The function returns coefficients 1,..., |
The coefficients of in its expansion in terms of sine functions.
See (Berrett et al. 2021) for more detail.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
t=seq(from=0,to=1,length.out=1000); X=t^2 U=coeffs(X,100)[1,]; L=5 plot(t,X,type="l") approx=rep(0,1000) for(l in 1:L){ approx=approx+qnorm(U[l])*sqrt(2)*sin((l-1/2)*pi*t)/((l-1/2)*pi) lines(t,approx,col=l+1) }
t=seq(from=0,to=1,length.out=1000); X=t^2 U=coeffs(X,100)[1,]; L=5 plot(t,X,type="l") approx=rep(0,1000) for(l in 1:L){ approx=approx+qnorm(U[l])*sqrt(2)*sin((l-1/2)*pi*t)/((l-1/2)*pi) lines(t,approx,col=l+1) }
This function computes the value of the test statistic measuring the strength of
dependence in a contingency table. See Section 3.1 of (Berrett et al. 2021)
for a definition.
DiscStat(freq)
DiscStat(freq)
freq |
Two-way contingency table whose strength of dependence is to be measured. |
A list containing the value of the test statistic , the table of expected
null counts, and the table of contributions to
.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
freq=r2dtable(1,rep(10,5),rep(10,5))[[1]]; DiscStat(freq) freq=diag(1:5); DiscStat(freq) freq=r2dtable(1,rep(10,5),rep(10,5))[[1]] + 4*diag(rep(1,5)) DiscStat(freq)
freq=r2dtable(1,rep(10,5),rep(10,5))[[1]]; DiscStat(freq) freq=diag(1:5); DiscStat(freq) freq=r2dtable(1,rep(10,5),rep(10,5))[[1]] + 4*diag(rep(1,5)) DiscStat(freq)
Computes the values of the one-dimensional Fourier basis functions at a vector of locations
and with a vector of frequencies
. The scaling factor of
is included, so that
the function returns, e.g.,
.
FourierBasis(a, m, x)
FourierBasis(a, m, x)
a |
Sine or cosine; |
m |
Vector of frequencies |
x |
Vector of locations |
Returns the values of .
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
e=FourierBasis(1,1:100,0.01); plot(0.01*(1:100),e,type="l") e=FourierBasis(0,1,0.01*(1:100)); plot(0.01*(1:100),e,type="l") FourierBasis(1,1:3,0.1*(1:10))
e=FourierBasis(1,1:100,0.01); plot(0.01*(1:100),e,type="l") e=FourierBasis(0,1,0.01*(1:100)); plot(0.01*(1:100),e,type="l") FourierBasis(1,1:3,0.1*(1:10))
Calculates the kernel matrix, described in (Berrett et al. 2021) for univariate continuous data when using the Fourier basis. This function is used in USPFourier.
FourierKernel(x, M)
FourierKernel(x, M)
x |
A vector in |
M |
The maximum frequency of Fourier basis functions to compute. |
The kernel matrix , to be used in independence testing.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
n=10; x=runif(n) FourierKernel(x,5)
n=10; x=runif(n) FourierKernel(x,5)
Function to produce the kernel matrices in the infinite dimensional example described in Section 7.4 of (Berrett et al. 2021). Here, a random function is converted to a sequence of coefficients and we use the Fourier basis on these coefficients. This function is an essential part of USPFunctional.
InfKern(X, Ntrunc, M)
InfKern(X, Ntrunc, M)
X |
Matrix giving one of the samples to be tested. Each row corresponds to a discretised function, with each column giving the values of the functions at the corresponding grid point. |
Ntrunc |
The total number of coefficients to look at in the basis expansion of the functional data. |
M |
The maximum frequency to look at in the Fourier basis. |
The kernel matrix for the sample .
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
n=10 #number of observations Ndisc=1000; t=1/Ndisc #functions represented at grid points 1/Ndisc, 2/Ndisc,...,1 X=matrix(rep(0,Ndisc*n),nrow=n) for(i in 1:n){ x=rnorm(Ndisc,mean=0,sd=1) X[i,]=cumsum(x*sqrt(t)) } InfKern(X,2,2)
n=10 #number of observations Ndisc=1000; t=1/Ndisc #functions represented at grid points 1/Ndisc, 2/Ndisc,...,1 X=matrix(rep(0,Ndisc*n),nrow=n) for(i in 1:n){ x=rnorm(Ndisc,mean=0,sd=1) X[i,]=cumsum(x*sqrt(t)) } InfKern(X,2,2)
Calculate the U-statistic measure of dependence given two kernel matrices
and
, as described in Section 7.1 of (Berrett et al. 2021). For the featured
examples considered these matrices can be calculated using FourierKernel or
InfKern. Alternatively, if a different basis is to be used, then the kernels
can be entered separately.
KernStat(J, K)
KernStat(J, K)
J |
|
K |
|
Test statistic measure the strength of dependence between the two samples.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
x=runif(100); y=runif(100); M=3 J=FourierKernel(x,M); K=FourierKernel(y,M) KernStat(J,K)
x=runif(100); y=runif(100); M=3 J=FourierKernel(x,M); K=FourierKernel(y,M) KernStat(J,K)
Function to calculate each entry of the kernel matrix in the infinite dimensional example described in Section 7.4 of (Berrett et al. 2021). Here, a random function is converted to a sequence of coefficients and we use the Fourier basis on these coefficients. This function is only used in the function InfKern.
sumbasis(Ntrunc, M, x1, x2)
sumbasis(Ntrunc, M, x1, x2)
Ntrunc |
The total number of coefficients to look at. |
M |
The maximum frequency to look at in the Fourier basis. |
x1 |
The coefficients of the first data point. |
x2 |
The coefficients of the second data point. |
The entry of the kernel corresponding to the two data points.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
x1=runif(5); x2=runif(5); sumbasis(5,2,x1,x2)
x1=runif(5); x2=runif(5); sumbasis(5,2,x1,x2)
Carry out an independence test of the independence of two samples, give
two kernel matrices and
, as described in Section 7.1 of (Berrett et al. 2021).
We calculate the test statistic and null statistics using the function KernStat,
before comparing them to produce a p-value. For the featured
examples considered these matrices can be calculated using FourierKernel or
InfKern. Alternatively, if a different basis is to be used, then the kernels
can be entered separately.
USP(J, K, B = 999, ties.method = "standard", nullstats = FALSE)
USP(J, K, B = 999, ties.method = "standard", nullstats = FALSE)
J |
|
K |
|
B |
The number of permutation used to calibrate the test. |
ties.method |
If "standard" then calculate the p-value as in (5) of (Berrett et al. 2021), which is slightly conservative. If "random" then break ties randomly. This preserves Type I error control. |
nullstats |
If TRUE, returns a vector of the null statistic values. |
Returns the p-value for this independence test and the value of the test statistic, ,
as defined in (Berrett et al. 2021). If nullstats=TRUE is used, then the function also
returns a vector of the null statistics.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
x=runif(100); y=runif(100); M=3 J=FourierKernel(x,M); K=FourierKernel(y,M) USP(J,K,999) n=50; r=0.6; Ndisc=1000; t=1/Ndisc X=matrix(rep(0,Ndisc*n),nrow=n); Y=matrix(rep(0,Ndisc*n),nrow=n) for(i in 1:n){ x = rnorm(Ndisc, mean=0, sd= 1) se = sqrt(1 - r^2) #standard deviation of error e = rnorm(Ndisc, mean=0, sd=se) y = r*x + e X[i,] = cumsum(x*sqrt(t)) Y[i,] = cumsum(y*sqrt(t)) } J=InfKern(X,2,1); K=InfKern(Y,2,1) USP(J,K,999)
x=runif(100); y=runif(100); M=3 J=FourierKernel(x,M); K=FourierKernel(y,M) USP(J,K,999) n=50; r=0.6; Ndisc=1000; t=1/Ndisc X=matrix(rep(0,Ndisc*n),nrow=n); Y=matrix(rep(0,Ndisc*n),nrow=n) for(i in 1:n){ x = rnorm(Ndisc, mean=0, sd= 1) se = sqrt(1 - r^2) #standard deviation of error e = rnorm(Ndisc, mean=0, sd=se) y = r*x + e X[i,] = cumsum(x*sqrt(t)) Y[i,] = cumsum(y*sqrt(t)) } J=InfKern(X,2,1); K=InfKern(Y,2,1) USP(J,K,999)
Carry out a permutation independence test on a two-way contingency table.
The test statistic is , as described in Sections 3.1 and 7.1 of (Berrett et al. 2021).
This also appears as
in (Berrett and Samworth 2021).
The critical value is found by sampling null contingency tables,
with the same row and column totals as the input, via Patefield's algorithm, and recomputing
the test statistic.
USP.test(freq, B = 999, ties.method = "standard", nullstats = FALSE)
USP.test(freq, B = 999, ties.method = "standard", nullstats = FALSE)
freq |
Two-way contingency table whose independence is to be tested. |
B |
The number of resampled null tables to be used to calibrate the test. |
ties.method |
If "standard" then calculate the p-value as in (5) of (Berrett et al. 2021), which is slightly conservative. If "random" then break ties randomly. This preserves Type I error control. |
nullstats |
If TRUE, returns a vector of the null statistic values. |
Returns the p-value for this independence test and the value of the test statistic, ,
as defined in (Berrett et al. 2021). The third element of the list is the table of expected counts,
and the final element is the table of contributions to
. If nullstats=TRUE is used, then the function also
returns a vector of the null statistics.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
Berrett TB, Samworth RJ (2021). “USP: an independence test that improves on Pearson’s chi-squared and the G-test.” Submitted, available at arXiv:2101.10880.
freq=r2dtable(1,rep(10,5),rep(10,5))[[1]] + 4*diag(rep(1,5)) USP.test(freq,999) freq=diag(1:5); USP.test(freq,999) freq=r2dtable(1,rep(10,5),rep(10,5))[[1]]; test=USP.test(freq,999,nullstats=TRUE) plot(density(test$NullStats,from=0, to=max(max(test$NullStats),test$TestStat)), xlim=c(min(test$NullStats),max(max(test$NullStats),test$TestStat)), main="Test Statistics") abline(v=test$TestStat,col=2); TestStats=c(test$TestStat,test$NullStats) abline(v=quantile(TestStats,probs=0.95),lty=2)
freq=r2dtable(1,rep(10,5),rep(10,5))[[1]] + 4*diag(rep(1,5)) USP.test(freq,999) freq=diag(1:5); USP.test(freq,999) freq=r2dtable(1,rep(10,5),rep(10,5))[[1]]; test=USP.test(freq,999,nullstats=TRUE) plot(density(test$NullStats,from=0, to=max(max(test$NullStats),test$TestStat)), xlim=c(min(test$NullStats),max(max(test$NullStats),test$TestStat)), main="Test Statistics") abline(v=test$TestStat,col=2); TestStats=c(test$TestStat,test$NullStats) abline(v=quantile(TestStats,probs=0.95),lty=2)
Performs a permutation test of independence between two univariate continuous random variables, using the Fourier basis to construct the test statistic, as described in (Berrett et al. 2021).
USPFourier(x, y, M, B = 999, ties.method = "standard", nullstats = FALSE)
USPFourier(x, y, M, B = 999, ties.method = "standard", nullstats = FALSE)
x |
A vector containing the first sample, with each entry in |
y |
A vector containing the second sample, with each entry in |
M |
The maximum frequency to use in the Fourier basis. |
B |
The number of permutation to use when calibrating the test. |
ties.method |
If "standard" then calculate the p-value as in (5) of (Berrett et al. 2021), which is slightly conservative. If "random" then break ties randomly. This preserves Type I error control. |
nullstats |
If TRUE, returns a vector of the null statistic values. |
Returns the p-value for this independence test and the value of the test statistic, ,
as defined in (Berrett et al. 2021). If nullstats=TRUE is used, then the function also
returns a vector of the null statistics.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
x=runif(10); y=x^2 USPFourier(x,y,1,999) n=100; w=2; x=integer(n); y=integer(n); m=300 unifdata=matrix(runif(2*m,min=0,max=1),ncol=2); x1=unifdata[,1]; y1=unifdata[,2] unif=runif(m); prob=0.5*(1+sin(2*pi*w*x1)*sin(2*pi*w*y1)); accept=(unif<prob); Data1=unifdata[accept,]; x=Data1[1:n,1]; y=Data1[1:n,2] plot(x,y) USPFourier(x,y,2,999) x=runif(100); y=runif(100) test=USPFourier(x,y,3,999,nullstats=TRUE) plot(density(test$NullStats,from=min(test$NullStats),to=max(max(test$NullStats),test$TestStat)), xlim=c(min(test$NullStats),max(max(test$NullStats),test$TestStat)),main="Test Statistics") abline(v=test$TestStat,col=2); TestStats=c(test$TestStat,test$NullStats) abline(v=quantile(TestStats,probs=0.95),lty=2)
x=runif(10); y=x^2 USPFourier(x,y,1,999) n=100; w=2; x=integer(n); y=integer(n); m=300 unifdata=matrix(runif(2*m,min=0,max=1),ncol=2); x1=unifdata[,1]; y1=unifdata[,2] unif=runif(m); prob=0.5*(1+sin(2*pi*w*x1)*sin(2*pi*w*y1)); accept=(unif<prob); Data1=unifdata[accept,]; x=Data1[1:n,1]; y=Data1[1:n,2] plot(x,y) USPFourier(x,y,2,999) x=runif(100); y=runif(100) test=USPFourier(x,y,3,999,nullstats=TRUE) plot(density(test$NullStats,from=min(test$NullStats),to=max(max(test$NullStats),test$TestStat)), xlim=c(min(test$NullStats),max(max(test$NullStats),test$TestStat)),main="Test Statistics") abline(v=test$TestStat,col=2); TestStats=c(test$TestStat,test$NullStats) abline(v=quantile(TestStats,probs=0.95),lty=2)
We implement the adaptive version of the independence test for univariate continuous data
using the Fourier basis, as described in Section 4 of (Berrett et al. 2021). This applies
USPFourier with a range of values of , and a properly corrected significance
level.
USPFourierAdapt(x, y, alpha, B = 999, ties.method = "standard")
USPFourierAdapt(x, y, alpha, B = 999, ties.method = "standard")
x |
The vector of data points from the first sample, each entry belonging to |
y |
The vector of data points from the second sample, each entry belonging to |
alpha |
The desired significance level of the test. |
B |
Controls the number of permutations to be used. With a sample size of |
ties.method |
If "standard" then calculate the p-value as in (5) of (Berrett et al. 2021), which is slightly conservative. If "random" then break ties randomly. This preserves Type I error control. |
Returns an indicator with value 1 if the null hypothesis of independence is rejected and
0 otherwise. If the null hypothesis is rejected, the function also outputs the value of
at the which the null was rejected and the value of the test statistic.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
n=100; w=2; x=integer(n); y=integer(n); m=300 unifdata=matrix(runif(2*m,min=0,max=1),ncol=2); x1=unifdata[,1]; y1=unifdata[,2] unif=runif(m); prob=0.5*(1+sin(2*pi*w*x1)*sin(2*pi*w*y1)); accept=(unif<prob); Data1=unifdata[accept,]; x=Data1[1:n,1]; y=Data1[1:n,2] plot(x,y) USPFourierAdapt(x,y,0.05,999)
n=100; w=2; x=integer(n); y=integer(n); m=300 unifdata=matrix(runif(2*m,min=0,max=1),ncol=2); x1=unifdata[,1]; y1=unifdata[,2] unif=runif(m); prob=0.5*(1+sin(2*pi*w*x1)*sin(2*pi*w*y1)); accept=(unif<prob); Data1=unifdata[accept,]; x=Data1[1:n,1]; y=Data1[1:n,2] plot(x,y) USPFourierAdapt(x,y,0.05,999)
We implement the permutation independence test described in (Berrett et al. 2021)
for functional data taking values in . The discretised functions are
expressed in a series expansion, and an independence test is carried out between the
coefficients of the functions, using a Fourier basis to define the test statistic.
USPFunctional(X, Y, Ntrunc, M, B = 999, ties.method = "standard")
USPFunctional(X, Y, Ntrunc, M, B = 999, ties.method = "standard")
X |
A matrix of the discretised functional data from the first sample. There are |
Y |
A matrix of the discretised functional data from the second sample. The discretisation
grid may be different to the grid used for |
Ntrunc |
The number of coefficients to retain from the series expansions of |
M |
The maximum frequency to use in the Fourier basis when testing the independence of the coefficients. |
B |
The number of permutations used to calibrate the test. |
ties.method |
If "standard" then calculate the p-value as in (5) of (Berrett et al. 2021), which is slightly conservative. If "random" then break ties randomly. This preserves Type I error control. |
A p-value for the test of the independence of and
.
Berrett TB, Kontoyiannis I, Samworth RJ (2021). “Optimal rates for independence testing via U-statistic permutation tests.” Annals of Statistics, to appear.
n=50; r=0.6; Ndisc=1000; t=1/Ndisc X=matrix(rep(0,Ndisc*n),nrow=n); Y=matrix(rep(0,Ndisc*n),nrow=n) for(i in 1:n){ x = rnorm(Ndisc, mean=0, sd= 1) se = sqrt(1 - r^2) #standard deviation of error e = rnorm(Ndisc, mean=0, sd=se) y = r*x + e X[i,] <- cumsum(x*sqrt(t)) Y[i,] <- cumsum(y*sqrt(t)) } USPFunctional(X,Y,2,1,999)
n=50; r=0.6; Ndisc=1000; t=1/Ndisc X=matrix(rep(0,Ndisc*n),nrow=n); Y=matrix(rep(0,Ndisc*n),nrow=n) for(i in 1:n){ x = rnorm(Ndisc, mean=0, sd= 1) se = sqrt(1 - r^2) #standard deviation of error e = rnorm(Ndisc, mean=0, sd=se) y = r*x + e X[i,] <- cumsum(x*sqrt(t)) Y[i,] <- cumsum(y*sqrt(t)) } USPFunctional(X,Y,2,1,999)