Title: | Efficient Univariate Maximum Mean Discrepancy |
---|---|
Description: | Computes maximum mean discrepancy two-sample test for univariate data using the Laplacian kernel, as described in Bodenham and Kawahara (2023) <doi:10.1007/s11222-023-10271-x>. The p-value is computed using permutations. Also includes implementation for computing the robust median difference statistic 'Q_n' from Croux and Rousseeuw (1992) <doi:10.1007/978-3-662-26811-7_58> based on Johnson and Mizoguchi (1978) <doi:10.1137/0207013>. |
Authors: | Dean Bodenham [aut, cre] |
Maintainer: | Dean Bodenham <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 0.2.0 |
Built: | 2024-11-14 06:16:51 UTC |
Source: | CRAN |
Computes energy distance, and possibly a p-value. Suitable for multivariate data. Naive approach, quadratic in number of observations.
energydist( X, Y, pval = TRUE, numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE )
energydist( X, Y, pval = TRUE, numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE )
X |
Matrix (or vector) of observations in first sample. |
Y |
Matrix (or vector) of observations in second sample. |
pval |
Boolean for whether to compute p-value or not. |
numperm |
Number of permutations. Default is |
seednum |
Seed number for generating permutations. Default is |
alternative |
A character string specifying the alternative hypothesis,
which must be either |
allowzeropval |
A boolean, specifying whether we will allow zero
p-values or not. Default is |
First checks number of columns (dimension) are equal.
Suppose matrix has
rows and
columns,
and matrix
has
rows; checks that
has
columns (if not, then throws error).
Then flattens matrices to vectors (or, if
, they are
already vectors.
Then calls C++ method. If the first sample has
-dimensional samples and the second sample has
-dimensional samples, then the algorithm
computes the statistic in
time.
Random seed is set for std::mt19937
and std::shuffle
in C++.
A list with the following elements:
pval
The p-value of the test, if it is
computed (pval=TRUE
).
stat
The statistic of the test, which is always computed.
Baringhaus L. and Franz C. (2004) "On a new multivariate two-sample test." Journal of multivariate analysis 88(1):190-206
Szekely G. J. and Rizzo M. L. (2004) "Testing for equal distributions in high dimension." InterStat 5(16.10):1249-1272
X <- matrix(c(1:12), ncol=2, byrow=TRUE) Y <- matrix(c(13:20), ncol=2, byrow=TRUE) energydistList <- energydist(X=X, Y=Y, pval=FALSE) #computing p-value energydistList <- energydist(X=X, Y=Y) #computing p-value #using 1000 permutations and seed 1 for reproducibility. energydistList <- energydist(X=X, Y=Y, numperm=1000, seednum=1)
X <- matrix(c(1:12), ncol=2, byrow=TRUE) Y <- matrix(c(13:20), ncol=2, byrow=TRUE) energydistList <- energydist(X=X, Y=Y, pval=FALSE) #computing p-value energydistList <- energydist(X=X, Y=Y) #computing p-value #using 1000 permutations and seed 1 for reproducibility. energydistList <- energydist(X=X, Y=Y, numperm=1000, seednum=1)
Computes the maximum mean discrepancy statistic with the Laplacian kernel.
Suitable only for univariate data. Computing the statistic alone
for observations is
, and computing the
p-value for
permutations is
.
eummd( x, y, beta = -0.1, pval = TRUE, numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE )
eummd( x, y, beta = -0.1, pval = TRUE, numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE )
x |
Univariate vector of observations in first sample. |
y |
Univariate vector of observations in second sample. |
beta |
kernel parameter. Must be positive; if not, computes
median heuristic in quadratic time. Default value
is |
pval |
Boolean for whether to compute p-value or not. |
numperm |
Number of permutations. Default is |
seednum |
Seed number for generating permutations. Default is |
alternative |
A character string specifying the alternative hypothesis,
which must be either |
allowzeropval |
A boolean, specifying whether we will allow zero
p-values or not. Default is |
If the total number of observations in both samples is n
,
first sort combined sample in before remaining
steps are linear in
n
.
If beta
is not a positive value,
median difference is computed as follows:
where is the 1-norm, and
so if the data are univariate then
and finally median heuristic is beta = 1/m
.
This can be computed in time
using the algorithms of Johnson and Mizoguchi (1978)
and Croux and Rousseuw (1992); see
mediandiff
for references.
The Laplacian kernel is defined as
The random seed is set for std::mt19937
and std::shuffle
in C++.
A list with the following elements:
pval
The p-value of the test, if it is
computed (pval=TRUE
). Otherwise,
it is set to NA
.
stat
The statistic of the test, which is always computed.
beta
The kernel parameter used in the test.
If beta
was not initialised or
negative, this will be the median heuristic
value.
Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.
Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).
Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.
mediandiff
x <- c(7.1, 1.2, 4.3, 0.4) y <- c(5.5, 2.6, 8.7) #setting the kernel parameter to be 0.1; setting seed=1 for reproducibility mmd_list <- eummd(x, y, beta=0.1, seednum=1) #now using median heuristic (default) mmd_list <- eummd(x, y, seednum=1) #now not computing the p-value, only the statistic mmd_list <- eummd(x, y, pval=FALSE, seednum=1) #now using a larger number of permutations mmd_list <- eummd(x, y, numperm=1000, seednum=1)
x <- c(7.1, 1.2, 4.3, 0.4) y <- c(5.5, 2.6, 8.7) #setting the kernel parameter to be 0.1; setting seed=1 for reproducibility mmd_list <- eummd(x, y, beta=0.1, seednum=1) #now using median heuristic (default) mmd_list <- eummd(x, y, seednum=1) #now not computing the p-value, only the statistic mmd_list <- eummd(x, y, pval=FALSE, seednum=1) #now using a larger number of permutations mmd_list <- eummd(x, y, numperm=1000, seednum=1)
Computes maximum mean discrepancy statistics with Laplacian or Gaussian kernel. Suitable for multivariate data. Naive approach, quadratic in number of observations.
meammd( X, Y, beta = -0.1, pval = TRUE, type = c("proj", "dist"), numproj = 20, nmethod = c(2, 1), distpval = c("Hommel", "Fisher"), numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE, faster = TRUE )
meammd( X, Y, beta = -0.1, pval = TRUE, type = c("proj", "dist"), numproj = 20, nmethod = c(2, 1), distpval = c("Hommel", "Fisher"), numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE, faster = TRUE )
X |
Matrix (or vector) of observations in first sample. |
Y |
Matrix (or vector) of observations in second sample. |
beta |
kernel parameter. Must be positive; if not, computes
median heuristic in quadratic time for each projection.
Default value
is |
pval |
Boolean for whether to compute p-value or not. |
type |
The type of projection used. Either |
numproj |
Number of projections (only used if |
nmethod |
Norm used for interpoint distances, if |
distpval |
The p-value combination procedure if |
numperm |
Number of permutations. Default is |
seednum |
Seed number for generating permutations. Default is |
alternative |
A character string specifying the alternative hypothesis,
which must be either |
allowzeropval |
A boolean, specifying whether we will allow zero
p-values or not. Default is |
faster |
A boolean, specifying if to use faster algorithm
when computing p-value. Default is |
A list with the following elements:
pval
The p-value of the test, if it is
computed (pval=TRUE
). Otherwise,
it is set to NA
.
stat
The statistic of the test, which
is only returned when type="proj"
,
otherwise it is set to NA
.
Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.
X <- matrix(c(1:12), ncol=2, byrow=TRUE) Y <- matrix(c(13:20), ncol=2, byrow=TRUE) # using the random projections method mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="proj", numproj=50) # using the method were distances are computed to the various points mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="dist")
X <- matrix(c(1:12), ncol=2, byrow=TRUE) Y <- matrix(c(13:20), ncol=2, byrow=TRUE) # using the random projections method mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="proj", numproj=50) # using the method were distances are computed to the various points mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="dist")
Compute the median of all differences between distinct pairs in vectors or matrices.
mediandiff(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)
mediandiff(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)
X |
Numeric vector or matrix of length |
Y |
Numeric vector or matrix of length |
kernel |
String, either |
fast |
Boolean; if |
The median difference is defined as follows:
is the combined
and
values into a single vector
or matrix.
Number of columns is the dimension, and these need to be equal
for
and
. Then, if the Laplacian kernel is used,
where is the 1-norm, and so if the data
are
d
-dimensional then
If the Gaussian kernel is specified, then the square of the two-norm is used.
The median heuristic is defined as beta = 1/m
.
Naive method will compute all distinct pairs, of which there are
differences. These are then sorted using
a
algorithm, so overall
.
The fast method is is from Croux and Rousseeuw (1992),
which is based on Johnson and Mizoguchi (1978).
A scalar, the median of all pairwise differences.
Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).
Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.
medianheuristic
X <- c(7.1, 1.2, 4.3, 0.4) Y <- c(5.5, 2.6, 8.7) #using fast method, Laplacian kernel, loglinear in number of observations md <- mediandiff(X, Y, fast=TRUE) #using fast method, Gaussian kernel, loglinear in number of observations md <- mediandiff(X, Y, fast=TRUE, kernel="Gaussian") #using naive method (default), with Laplacian kernel md <- mediandiff(X, Y)
X <- c(7.1, 1.2, 4.3, 0.4) Y <- c(5.5, 2.6, 8.7) #using fast method, Laplacian kernel, loglinear in number of observations md <- mediandiff(X, Y, fast=TRUE) #using fast method, Gaussian kernel, loglinear in number of observations md <- mediandiff(X, Y, fast=TRUE, kernel="Gaussian") #using naive method (default), with Laplacian kernel md <- mediandiff(X, Y)
Computes the inverse of the median difference of all distinct pairs in vectors or matrices.
medianheuristic(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)
medianheuristic(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)
X |
Numeric vector or matrix of length |
Y |
Numeric vector or matrix of length |
kernel |
String, either |
fast |
Boolean; if |
Computes median of differences md
using mediandiff
and then returns 1 / md
. See mediandiff
for details.
A scalar, the inverse of the median of all pairwise differences.
mediandiff
X <- c(7.1, 1.2, 4.3, 0.4) Y <- c(5.5, 2.6, 8.7) mh <- medianheuristic(X, Y, kernel="Laplacian", fast=TRUE) #using fast method, Gaussian kernel, loglinear in number of observations mh <- medianheuristic(X, Y, fast=TRUE, kernel="Gaussian") #using naive method (default), with Laplacian kernel mh <- medianheuristic(X, Y)
X <- c(7.1, 1.2, 4.3, 0.4) Y <- c(5.5, 2.6, 8.7) mh <- medianheuristic(X, Y, kernel="Laplacian", fast=TRUE) #using fast method, Gaussian kernel, loglinear in number of observations mh <- medianheuristic(X, Y, fast=TRUE, kernel="Gaussian") #using naive method (default), with Laplacian kernel mh <- medianheuristic(X, Y)
Computes maximum mean discrepancy statistics with Laplacian or Gaussian kernel. Suitable for multivariate data. Naive approach, quadratic in number of observations.
mmd( X, Y, beta = -0.1, pval = TRUE, kernel = c("Laplacian", "Gaussian"), numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE )
mmd( X, Y, beta = -0.1, pval = TRUE, kernel = c("Laplacian", "Gaussian"), numperm = 200, seednum = 0, alternative = c("greater", "two.sided"), allowzeropval = FALSE )
X |
Matrix (or vector) of observations in first sample. |
Y |
Matrix (or vector) of observations in second sample. |
beta |
kernel parameter. Must be positive; if not, computes
median heuristic in quadratic time. Default value
is |
pval |
Boolean for whether to compute p-value or not. |
kernel |
String, either |
numperm |
Number of permutations. Default is |
seednum |
Seed number for generating permutations. Default is |
alternative |
A character string specifying the alternative hypothesis,
which must be either |
allowzeropval |
A boolean, specifying whether we will allow zero
p-values or not. Default is |
First checks number of columns (dimension) are equal.
Suppose matrix has
rows and
columns,
and matrix
has
rows; checks that
has
columns (if not, then throws error).
Then flattens matrices to vectors (or, if
, they are
already vectors.
Then calls C++ method. If the first sample has
-dimensional samples and the second sample has
-dimensional samples, then the algorithm
computes the statistic in
time.
Median difference is as follows:
where is the 1-norm, and so if the data
are
-dimensional then
and finally median heuristic is beta = 1/m
.
This can be computed in time.
The Laplacian kernel is defined as
Random seed is set for std::mt19937
and std::shuffle
in C++.
A list with the following elements:
pval
The p-value of the test, if it is
computed (pval=TRUE
).
stat
The statistic of the test, which is always computed.
beta
The kernel parameter used in the test.
If beta
was not initialised or
negative, this will be the median heuristic
value.
Gretton, A., Borgwardt, K. M., Rasch M. J., Schölkopf, B. and Smola, A. (2012) "A kernel two-sample test." The Journal of Machine Learning Research 13, no. 1, 723-773.
X <- matrix(c(1:12), ncol=2, byrow=TRUE) Y <- matrix(c(13:20), ncol=2, byrow=TRUE) mmdList <- mmd(X=X, Y=Y, beta=0.1, pval=FALSE) #using median heuristic mmdList <- mmd(X=X, Y=Y, pval=FALSE) #using median heuristic and computing p-value mmdList <- mmd(X=X, Y=Y) #using median heuristic and computing p-value #using 1000 permutations and seed 1 for reproducibility. mmdList <- mmd(X=X, Y=Y, numperm=1000, seednum=1)
X <- matrix(c(1:12), ncol=2, byrow=TRUE) Y <- matrix(c(13:20), ncol=2, byrow=TRUE) mmdList <- mmd(X=X, Y=Y, beta=0.1, pval=FALSE) #using median heuristic mmdList <- mmd(X=X, Y=Y, pval=FALSE) #using median heuristic and computing p-value mmdList <- mmd(X=X, Y=Y) #using median heuristic and computing p-value #using 1000 permutations and seed 1 for reproducibility. mmdList <- mmd(X=X, Y=Y, numperm=1000, seednum=1)