Package 'eummd'

Title: Efficient Univariate Maximum Mean Discrepancy
Description: Computes maximum mean discrepancy two-sample test for univariate data using the Laplacian kernel, as described in Bodenham and Kawahara (2023) <doi:10.1007/s11222-023-10271-x>. The p-value is computed using permutations. Also includes implementation for computing the robust median difference statistic 'Q_n' from Croux and Rousseeuw (1992) <doi:10.1007/978-3-662-26811-7_58> based on Johnson and Mizoguchi (1978) <doi:10.1137/0207013>.
Authors: Dean Bodenham [aut, cre]
Maintainer: Dean Bodenham <[email protected]>
License: GPL-2 | GPL-3
Version: 0.2.0
Built: 2024-11-14 06:16:51 UTC
Source: CRAN

Help Index


Naive computation for Energy Distance

Description

Computes energy distance, and possibly a p-value. Suitable for multivariate data. Naive approach, quadratic in number of observations.

Usage

energydist(
  X,
  Y,
  pval = TRUE,
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE
)

Arguments

X

Matrix (or vector) of observations in first sample.

Y

Matrix (or vector) of observations in second sample.

pval

Boolean for whether to compute p-value or not.

numperm

Number of permutations. Default is 200.

seednum

Seed number for generating permutations. Default is 0, which means seed is set randomly. For values larger than 0, results will be reproducible.

alternative

A character string specifying the alternative hypothesis, which must be either "greater" (default) or "two.sided". In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, "greater" is recommended. The test will still work in many cases with "two.sided" specified, but this could lead to problems in certain cases.

allowzeropval

A boolean, specifying whether we will allow zero p-values or not. Default is FALSE; then a threshold of 0.5 / (numperm+1) is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

Details

First checks number of columns (dimension) are equal. Suppose matrix XX has nn rows and dd columns, and matrix YY has mm rows; checks that YY has dd columns (if not, then throws error). Then flattens matrices to vectors (or, if d=1d=1, they are already vectors. Then calls C++ method. If the first sample has nn dd-dimensional samples and the second sample has mm dd-dimensional samples, then the algorithm computes the statistic in O((n+m)2)O((n+m)^2) time.

Random seed is set for std::mt19937 and std::shuffle in C++.

Value

A list with the following elements:

pval

The p-value of the test, if it is computed (pval=TRUE).

stat

The statistic of the test, which is always computed.

References

Baringhaus L. and Franz C. (2004) "On a new multivariate two-sample test." Journal of multivariate analysis 88(1):190-206

Szekely G. J. and Rizzo M. L. (2004) "Testing for equal distributions in high dimension." InterStat 5(16.10):1249-1272

Examples

X <- matrix(c(1:12), ncol=2, byrow=TRUE)
Y <- matrix(c(13:20), ncol=2, byrow=TRUE)
energydistList <- energydist(X=X, Y=Y, pval=FALSE)

#computing p-value
energydistList <- energydist(X=X, Y=Y)

#computing p-value
#using 1000 permutations and seed 1 for reproducibility.
energydistList <- energydist(X=X, Y=Y, numperm=1000, seednum=1)

euMMD: Efficient Univariate Maximum Mean Discrepancy

Description

Computes the maximum mean discrepancy statistic with the Laplacian kernel. Suitable only for univariate data. Computing the statistic alone for nn observations is O(nlogn)O(n \log n), and computing the p-value for LL permutations is O(nlogn+Ln)O(n \log n + Ln).

Usage

eummd(
  x,
  y,
  beta = -0.1,
  pval = TRUE,
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE
)

Arguments

x

Univariate vector of observations in first sample.

y

Univariate vector of observations in second sample.

beta

kernel parameter. Must be positive; if not, computes median heuristic in quadratic time. Default value is -0.1, which will force median heuristic to be used.

pval

Boolean for whether to compute p-value or not.

numperm

Number of permutations. Default is 200.

seednum

Seed number for generating permutations. Default is 0, which means seed is set randomly. For values larger than 0, results will be reproducible.

alternative

A character string specifying the alternative hypothesis, which must be either "greater" (default) or "two.sided". In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, "greater" is recommended. The test will still work in many cases with "two.sided" specified, but this could lead to problems in certain cases.

allowzeropval

A boolean, specifying whether we will allow zero p-values or not. Default is FALSE; then a threshold of 0.5 / (numperm+1) is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

Details

If the total number of observations in both samples is n, first sort combined sample in O(nlogn)O(n \log n) before remaining steps are linear in n.

If beta is not a positive value, median difference is computed as follows:

m=median{xixj1;i>j,i=1,2,,n+m, and j=1,2,,i1},m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\, i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \},

where xixj1|| x_i - x_j ||_1 is the 1-norm, and so if the data are univariate then

xixj1=xixj.|| x_i - x_j ||_1 = |x_{i} - x_{j}|.

and finally median heuristic is beta = 1/m. This can be computed in O(nlogn)O(n \log n ) time using the algorithms of Johnson and Mizoguchi (1978) and Croux and Rousseuw (1992); see mediandiff for references.

The Laplacian kernel kk is defined as

k(x,y)=exp[βxy1].k(x,y) = \exp [ -\beta ||x - y||_1 ].

The random seed is set for std::mt19937 and std::shuffle in C++.

Value

A list with the following elements:

pval

The p-value of the test, if it is computed (pval=TRUE). Otherwise, it is set to NA.

stat

The statistic of the test, which is always computed.

beta

The kernel parameter used in the test. If beta was not initialised or negative, this will be the median heuristic value.

References

Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.

Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).

Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.

See Also

mediandiff

Examples

x <- c(7.1, 1.2, 4.3, 0.4)
y <- c(5.5, 2.6, 8.7)
#setting the kernel parameter to be 0.1; setting seed=1 for reproducibility
mmd_list <- eummd(x, y, beta=0.1, seednum=1)

#now using median heuristic (default)
mmd_list <- eummd(x, y, seednum=1)

#now not computing the p-value, only the statistic
mmd_list <- eummd(x, y, pval=FALSE, seednum=1)

#now using a larger number of permutations 
mmd_list <- eummd(x, y, numperm=1000, seednum=1)

MEA-MMD: Multivariate Efficient Approximate Maximum Mean Discrepancy

Description

Computes maximum mean discrepancy statistics with Laplacian or Gaussian kernel. Suitable for multivariate data. Naive approach, quadratic in number of observations.

Usage

meammd(
  X,
  Y,
  beta = -0.1,
  pval = TRUE,
  type = c("proj", "dist"),
  numproj = 20,
  nmethod = c(2, 1),
  distpval = c("Hommel", "Fisher"),
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE,
  faster = TRUE
)

Arguments

X

Matrix (or vector) of observations in first sample.

Y

Matrix (or vector) of observations in second sample.

beta

kernel parameter. Must be positive; if not, computes median heuristic in quadratic time for each projection. Default value is -0.1, which will force median heuristic to be used.

pval

Boolean for whether to compute p-value or not.

type

The type of projection used. Either "proj" for random projections (default) or "dist" for interpoint distances.

numproj

Number of projections (only used if type="proj"). Default is 20.

nmethod

Norm used for interpoint distances, if type="dist". Needs to be either 2 (for two-norm, default) or 1 (for one-norm).

distpval

The p-value combination procedure if type="dist". Options are "Hommel" (default) or "Fisher". The Hommel method is preferred since the Type I error does not seem to be controlled if the Fisher method is used.

numperm

Number of permutations. Default is 200.

seednum

Seed number for generating permutations. Default is 0, which means seed is set randomly. For values larger than 0, results will be reproducible.

alternative

A character string specifying the alternative hypothesis, which must be either "greater" (default) or "two.sided". In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, "greater" is recommended. The test will still work in many cases with "two.sided" specified, but this could lead to problems in certain cases.

allowzeropval

A boolean, specifying whether we will allow zero p-values or not. Default is FALSE; then a threshold of 0.5 / (numperm+1) is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

faster

A boolean, specifying if to use faster algorithm when computing p-value. Default is TRUE.

Value

A list with the following elements:

pval

The p-value of the test, if it is computed (pval=TRUE). Otherwise, it is set to NA.

stat

The statistic of the test, which is only returned when type="proj", otherwise it is set to NA.

References

Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.

Examples

X <- matrix(c(1:12), ncol=2, byrow=TRUE)
Y <- matrix(c(13:20), ncol=2, byrow=TRUE)
# using the random projections method
mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="proj", numproj=50)

# using the method were distances are computed to the various points 
mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="dist")

Compute the median difference between pairs of values

Description

Compute the median of all differences between distinct pairs in vectors or matrices.

Usage

mediandiff(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)

Arguments

X

Numeric vector or matrix of length n.

Y

Numeric vector or matrix of length m, or NULL.

kernel

String, either "Laplacian" or "Gaussian".

fast

Boolean; if TRUE will run O(NlogN)O(N \log N) algorithm, where N = n + m, but if FALSE (default) will run naive O(N2logN)O(N^2 \log N) algorithm.

Details

The median difference is defined as follows:

ZZ is the combined XX and YY values into a single vector or matrix. Number of columns is the dimension, and these need to be equal for XX and YY. Then, if the Laplacian kernel is used,

m=median{xixj1;i>j,i=1,2,,n+m, and j=1,2,,i1},m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\, i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \},

where zizj1|| z_i - z_j ||_1 is the 1-norm, and so if the data are d-dimensional then

zizj1=k=1dzi,kzj,k.|| z_i - z_j ||_1 = \sum_{k=1}^{d} |z_{i,k} - z_{j,k}|.

If the Gaussian kernel is specified, then the square of the two-norm is used.

The median heuristic is defined as beta = 1/m.

Naive method will compute all distinct pairs, of which there are N(N+1)/2N(N+1)/2 differences. These are then sorted using a O(NlogN)O(N \log N) algorithm, so overall O(N2logN)O(N^2 \log N).

The fast method is O(NlogN)O(N \log N) is from Croux and Rousseeuw (1992), which is based on Johnson and Mizoguchi (1978).

Value

A scalar, the median of all pairwise differences.

References

Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).

Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.

See Also

medianheuristic

Examples

X <- c(7.1, 1.2, 4.3, 0.4)
Y <- c(5.5, 2.6, 8.7)
#using fast method, Laplacian kernel, loglinear in number of observations
md <- mediandiff(X, Y, fast=TRUE)

#using fast method, Gaussian kernel, loglinear in number of observations
md <- mediandiff(X, Y, fast=TRUE, kernel="Gaussian")

#using naive method (default), with Laplacian kernel
md <- mediandiff(X, Y)

Compute the median heuristic

Description

Computes the inverse of the median difference of all distinct pairs in vectors or matrices.

Usage

medianheuristic(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)

Arguments

X

Numeric vector or matrix of length n.

Y

Numeric vector or matrix of length m, or NULL.

kernel

String, either "Laplacian" or "Gaussian".

fast

Boolean; if TRUE will run O(N log N) algorithm, where N = n + m, but if FALSE will run naive O(N^2 log(N)) algorithm.

Details

Computes median of differences md using mediandiff and then returns 1 / md. See mediandiff for details.

Value

A scalar, the inverse of the median of all pairwise differences.

See Also

mediandiff

Examples

X <- c(7.1, 1.2, 4.3, 0.4)
Y <- c(5.5, 2.6, 8.7)
mh <- medianheuristic(X, Y, kernel="Laplacian", fast=TRUE)

#using fast method, Gaussian kernel, loglinear in number of observations
mh <- medianheuristic(X, Y, fast=TRUE, kernel="Gaussian")

#using naive method (default), with Laplacian kernel
mh <- medianheuristic(X, Y)

Naive computation for Maximum Mean Discrepancy

Description

Computes maximum mean discrepancy statistics with Laplacian or Gaussian kernel. Suitable for multivariate data. Naive approach, quadratic in number of observations.

Usage

mmd(
  X,
  Y,
  beta = -0.1,
  pval = TRUE,
  kernel = c("Laplacian", "Gaussian"),
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE
)

Arguments

X

Matrix (or vector) of observations in first sample.

Y

Matrix (or vector) of observations in second sample.

beta

kernel parameter. Must be positive; if not, computes median heuristic in quadratic time. Default value is -0.1, which will force median heuristic to be used.

pval

Boolean for whether to compute p-value or not.

kernel

String, either "Laplacian" or "Gaussian". Default is "Laplacian".

numperm

Number of permutations. Default is 200.

seednum

Seed number for generating permutations. Default is 0, which means seed is set randomly. For values larger than 0, results will be reproducible.

alternative

A character string specifying the alternative hypothesis, which must be either "greater" (default) or "two.sided". In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, "greater" is recommended. The test will still work in many cases with "two.sided" specified, but this could lead to problems in certain cases.

allowzeropval

A boolean, specifying whether we will allow zero p-values or not. Default is FALSE; then a threshold of 0.5 / (numperm+1) is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

Details

First checks number of columns (dimension) are equal. Suppose matrix XX has nn rows and dd columns, and matrix YY has mm rows; checks that YY has dd columns (if not, then throws error). Then flattens matrices to vectors (or, if d=1d=1, they are already vectors. Then calls C++ method. If the first sample has nn dd-dimensional samples and the second sample has mm dd-dimensional samples, then the algorithm computes the statistic in O((n+m)2)O((n+m)^2) time.

Median difference is as follows:

m=median{xixj1;i>j,i=1,2,,n+m, and j=1,2,,i1},m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\, i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \},

where xixj1|| x_i - x_j ||_1 is the 1-norm, and so if the data are dd-dimensional then

xixj1=k=1dxi,kxj,k,|| x_i - x_j ||_1 = \sum_{k=1}^{d} |x_{i,k} - x_{j,k}|,

and finally median heuristic is beta = 1/m. This can be computed in O((n+m)2)O( (n+m)^2 ) time.

The Laplacian kernel kk is defined as

k(x,y)=exp(βxixj1).k(x,y) = \exp( -\beta || x_i - x_j ||_1 ).

Random seed is set for std::mt19937 and std::shuffle in C++.

Value

A list with the following elements:

pval

The p-value of the test, if it is computed (pval=TRUE).

stat

The statistic of the test, which is always computed.

beta

The kernel parameter used in the test. If beta was not initialised or negative, this will be the median heuristic value.

References

Gretton, A., Borgwardt, K. M., Rasch M. J., Schölkopf, B. and Smola, A. (2012) "A kernel two-sample test." The Journal of Machine Learning Research 13, no. 1, 723-773.

Examples

X <- matrix(c(1:12), ncol=2, byrow=TRUE)
Y <- matrix(c(13:20), ncol=2, byrow=TRUE)
mmdList <- mmd(X=X, Y=Y, beta=0.1, pval=FALSE)

#using median heuristic
mmdList <- mmd(X=X, Y=Y, pval=FALSE)

#using median heuristic and computing p-value
mmdList <- mmd(X=X, Y=Y)

#using median heuristic and computing p-value
#using 1000 permutations and seed 1 for reproducibility.
mmdList <- mmd(X=X, Y=Y, numperm=1000, seednum=1)