Title: | Nonparametric Independence Tests Based on Entropy Estimation |
---|---|
Description: | Implementations of the weighted Kozachenko-Leonenko entropy estimator and independence tests based on this estimator, (Kozachenko and Leonenko (1987) <http://mi.mathnet.ru/eng/ppi797>). Also includes a goodness-of-fit test for a linear model which is an independence test between covariates and errors. |
Authors: | Thomas B. Berrett <[email protected]> [aut], Daniel J. Grose <[email protected]> [cre,ctb], Richard J. Samworth <[email protected]> [aut] |
Maintainer: | Daniel Grose <[email protected]> |
License: | GPL |
Version: | 0.2.0 |
Built: | 2024-12-14 06:35:24 UTC |
Source: | CRAN |
Calculates the (weighted) Kozachenko–Leonenko entropy estimator studied in Berrett, Samworth and Yuan (2018), which is based on the -nearest neighbour distances of the sample.
KLentropy(x, k, weights = FALSE, stderror = FALSE)
KLentropy(x, k, weights = FALSE, stderror = FALSE)
x |
The |
k |
The tuning parameter that gives the maximum number of neighbours that will be considered by the estimator. |
weights |
Specifies whether a weighted or unweighted estimator is used. If a weighted estimator is to be used then the default ( |
stderror |
Specifies whether an estimate of the standard error of the weighted estimate is calculated. The calculation is done using an unweighted version of the variance estimator described on page 7 of Berrett, Samworth and Yuan (2018). |
The first element of the list is the unweighted estimator for the value of 1 up to the user-specified . The second element of the list is the weighted estimator, obtained by taking the inner product between the first element of the list and the weight vector. If
stderror=TRUE
the third element of the list is an estimate of the standard error of the weighted estimate.
Berrett, T. B., Samworth, R. J., Yuan, M. (2018). “Efficient multivariate entropy estimation via k-nearest neighbour distances.” Annals of Statistics, to appear.
n=1000; x=rnorm(n); KLentropy(x,30,stderror=TRUE) # The true value is 0.5*log(2*pi*exp(1)) = 1.42. n=5000; x=matrix(rnorm(4*n),ncol=4) # The true value is 2*log(2*pi*exp(1)) = 5.68 KLentropy(x,30,weights=FALSE) # Unweighted estimator KLentropy(x,30,weights=TRUE) # Weights chosen by L2OptW w=runif(30); w=w/sum(w); KLentropy(x,30,weights=w) # User-specified weights
n=1000; x=rnorm(n); KLentropy(x,30,stderror=TRUE) # The true value is 0.5*log(2*pi*exp(1)) = 1.42. n=5000; x=matrix(rnorm(4*n),ncol=4) # The true value is 2*log(2*pi*exp(1)) = 5.68 KLentropy(x,30,weights=FALSE) # Unweighted estimator KLentropy(x,30,weights=TRUE) # Weights chosen by L2OptW w=runif(30); w=w/sum(w); KLentropy(x,30,weights=w) # User-specified weights
Calculates a weight vector to be used for the weighted Kozachenko–Leonenko estimator. The weight vector has minimum norm subject to the linear and sum-to-one constraints of (2) in Berrett, Samworth and Yuan (2018).
L2OptW(k, d)
L2OptW(k, d)
k |
The tuning parameter that gives the number of neighbours that will be considered by the weighted Kozachenko–Leonenko estimator. |
d |
The dimension of the data. |
The weight vector that is the solution of the optimisation problem.
Berrett, T. B., Samworth, R. J., Yuan, M. (2018). “Efficient multivariate entropy estimation via k-nearest neighbour distances.” Annals of Statistics, to appear.
# When d < 4 there are no linear constraints and the returned vector is (0,0,...,0,1). L2OptW(100,3) w=L2OptW(100,4) plot(w,type="l") w=L2OptW(100,8); # For each multiple of 4 that d increases an extra constraint is added. plot(w,type="l") w=L2OptW(100,12) plot(w, type="l") # This can be seen in the shape of the plot
# When d < 4 there are no linear constraints and the returned vector is (0,0,...,0,1). L2OptW(100,3) w=L2OptW(100,4) plot(w,type="l") w=L2OptW(100,8); # For each multiple of 4 that d increases an extra constraint is added. plot(w,type="l") w=L2OptW(100,12) plot(w, type="l") # This can be seen in the shape of the plot
Performs an independence test without knowledge of either marginal distribution using permutations and using a data-driven choice of .
MINTauto(x, y, kmax, B1 = 1000, B2 = 1000)
MINTauto(x, y, kmax, B1 = 1000, B2 = 1000)
x |
The |
y |
The response vector of length |
kmax |
The maximum value of |
B1 |
The number of repetitions used when choosing |
B2 |
The number of permutations to use for the final test, set at 1000 by default. |
The -value corresponding the independence test carried out and the value of
used.
Berrett, T. B., Samworth R. J. (2017). “Nonparametric independence testing via mutual information.” ArXiv e-prints. 1711.06642.
# Independent univariate normal data x=rnorm(1000); y=rnorm(1000); MINTauto(x,y,kmax=200,B1=100,B2=100) # Dependent univariate normal data library(mvtnorm) data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) MINTauto(data[,1],data[,2],kmax=200,B1=100,B2=100) # Dependent multivariate normal data Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4) data=rmvnorm(1000,sigma=Sigma) MINTauto(data[,1:3],data[,4],kmax=50,B1=100,B2=100)
# Independent univariate normal data x=rnorm(1000); y=rnorm(1000); MINTauto(x,y,kmax=200,B1=100,B2=100) # Dependent univariate normal data library(mvtnorm) data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) MINTauto(data[,1],data[,2],kmax=200,B1=100,B2=100) # Dependent multivariate normal data Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4) data=rmvnorm(1000,sigma=Sigma) MINTauto(data[,1:3],data[,4],kmax=50,B1=100,B2=100)
Performs an independence test without knowledge of either marginal distribution using permutations and averaging over a range of values of .
MINTav(x, y, K, B = 1000)
MINTav(x, y, K, B = 1000)
x |
The |
y |
The |
K |
The vector of values of |
B |
The number of permutations to use for the test, set at 1000 by default. |
The -value corresponding the independence test carried out.
Berrett, T. B., Samworth R. J. (2017). “Nonparametric independence testing via mutual information.” ArXiv e-prints. 1711.06642.
# Independent univariate normal data x=rnorm(1000); y=rnorm(1000); MINTav(x,y,K=1:200,B=100) # Dependent univariate normal data library(mvtnorm); data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) MINTav(data[,1],data[,2],K=1:200,B=100) # Dependent multivariate normal data Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4); data=rmvnorm(1000,sigma=Sigma) MINTav(data[,1:3],data[,4],K=1:50,B=100)
# Independent univariate normal data x=rnorm(1000); y=rnorm(1000); MINTav(x,y,K=1:200,B=100) # Dependent univariate normal data library(mvtnorm); data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) MINTav(data[,1],data[,2],K=1:200,B=100) # Dependent multivariate normal data Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4); data=rmvnorm(1000,sigma=Sigma) MINTav(data[,1:3],data[,4],K=1:50,B=100)
Performs an independence test when it is assumed that the marginal distribution of is known and can be simulated from.
MINTknown(x, y, k, ky, w = FALSE, wy = FALSE, y0)
MINTknown(x, y, k, ky, w = FALSE, wy = FALSE, y0)
x |
The |
y |
The |
k |
The value of |
ky |
The value of |
w |
The weight vector to used for estimation of the joint entropy |
wy |
The weight vector to used for estimation of the marginal entropy |
y0 |
The data matrix of simulated |
The -value corresponding the independence test carried out.
Berrett, T. B., Samworth R. J. (2017). “Nonparametric independence testing via mutual information.” ArXiv e-prints. 1711.06642.
library(mvtnorm) x=rnorm(1000); y=rnorm(1000); # Independent univariate normal data MINTknown(x,y,k=20,ky=30,y0=rnorm(100000)) library(mvtnorm) # Dependent univariate normal data data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) # Dependent multivariate normal data MINTknown(data[,1],data[,2],k=20,ky=30,y0=rnorm(100000)) Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4) data=rmvnorm(1000,sigma=Sigma) MINTknown(data[,1:3],data[,4],k=20,ky=30,w=TRUE,wy=FALSE,y0=rnorm(100000))
library(mvtnorm) x=rnorm(1000); y=rnorm(1000); # Independent univariate normal data MINTknown(x,y,k=20,ky=30,y0=rnorm(100000)) library(mvtnorm) # Dependent univariate normal data data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) # Dependent multivariate normal data MINTknown(data[,1],data[,2],k=20,ky=30,y0=rnorm(100000)) Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4) data=rmvnorm(1000,sigma=Sigma) MINTknown(data[,1:3],data[,4],k=20,ky=30,w=TRUE,wy=FALSE,y0=rnorm(100000))
Performs an independence test without knowledge of either marginal distribution using permutations.
MINTperm(x, y, k, w = FALSE, B = 1000)
MINTperm(x, y, k, w = FALSE, B = 1000)
x |
The |
y |
The |
k |
The value of |
w |
The weight vector to used for estimation of the joint entropy |
B |
The number of permutations to use, set at 1000 by default. |
The -value corresponding the independence test carried out.
Berrett, T. B., Samworth R. J. (2017). “Nonparametric independence testing via mutual information.” ArXiv e-prints. 1711.06642.
# Independent univariate normal data x=rnorm(1000); y=rnorm(1000) MINTperm(x,y,k=20,B=100) # Dependent univariate normal data library(mvtnorm) data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) MINTperm(data[,1],data[,2],k=20,B=100) # Dependent multivariate normal data Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4) data=rmvnorm(1000,sigma=Sigma) MINTperm(data[,1:3],data[,4],k=20,w=TRUE,B=100)
# Independent univariate normal data x=rnorm(1000); y=rnorm(1000) MINTperm(x,y,k=20,B=100) # Dependent univariate normal data library(mvtnorm) data=rmvnorm(1000,sigma=matrix(c(1,0.5,0.5,1),ncol=2)) MINTperm(data[,1],data[,2],k=20,B=100) # Dependent multivariate normal data Sigma=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0.5,0,0,0.5,1),ncol=4) data=rmvnorm(1000,sigma=Sigma) MINTperm(data[,1:3],data[,4],k=20,w=TRUE,B=100)
Performs a goodness-of-fit test of a linear model by testing whether the errors are independent of the covariates.
MINTregression(x, y, k, keps, w = FALSE, eps)
MINTregression(x, y, k, keps, w = FALSE, eps)
x |
The |
y |
The response vector of length |
k |
The value of |
keps |
The value of |
w |
The weight vector to be used for estimation of the joint entropy |
eps |
A vector of null errors which should have the same distribution as the errors are assumed to have in the linear model. |
The -value corresponding the independence test carried out.
Berrett, T. B., Samworth R. J. (2017). “Nonparametric independence testing via mutual information.” ArXiv e-prints. 1711.06642.
# Correctly specified linear model x=runif(100,min=-1.5,max=1.5); y=x+rnorm(100) plot(lm(y~x),which=1) MINTregression(x,y,5,10,w=FALSE,rnorm(10000)) # Misspecified mean linear model x=runif(100,min=-1.5,max=1.5); y=x^3+rnorm(100) plot(lm(y~x),which=1) MINTregression(x,y,5,10,w=FALSE,rnorm(10000)) # Heteroscedastic linear model x=runif(100,min=-1.5,max=1.5); y=x+x*rnorm(100); plot(lm(y~x),which=1) MINTregression(x,y,5,10,w=FALSE,rnorm(10000)) # Multivariate misspecified mean linear model x=matrix(runif(1500,min=-1.5,max=1.5),ncol=3) y=x[,1]^3+0.3*x[,2]-0.3*x[,3]+rnorm(500) plot(lm(y~x),which=1) MINTregression(x,y,30,50,w=TRUE,rnorm(50000))
# Correctly specified linear model x=runif(100,min=-1.5,max=1.5); y=x+rnorm(100) plot(lm(y~x),which=1) MINTregression(x,y,5,10,w=FALSE,rnorm(10000)) # Misspecified mean linear model x=runif(100,min=-1.5,max=1.5); y=x^3+rnorm(100) plot(lm(y~x),which=1) MINTregression(x,y,5,10,w=FALSE,rnorm(10000)) # Heteroscedastic linear model x=runif(100,min=-1.5,max=1.5); y=x+x*rnorm(100); plot(lm(y~x),which=1) MINTregression(x,y,5,10,w=FALSE,rnorm(10000)) # Multivariate misspecified mean linear model x=matrix(runif(1500,min=-1.5,max=1.5),ncol=3) y=x[,1]^3+0.3*x[,2]-0.3*x[,3]+rnorm(500) plot(lm(y~x),which=1) MINTregression(x,y,30,50,w=TRUE,rnorm(50000))