Title: | Computing P-Values of the One-Sample K-S Test and the Two-Sample K-S and Kuiper Tests for (Dis)Continuous Null Distribution |
---|---|
Description: | Contains functions to compute p-values for the one-sample and two-sample Kolmogorov-Smirnov (KS) tests and the two-sample Kuiper test for any fixed critical level and arbitrary (possibly very large) sample sizes. For the one-sample KS test, this package implements a novel, accurate and efficient method named Exact-KS-FFT, which allows the pre-specified cumulative distribution function under the null hypothesis to be continuous, purely discrete or mixed. In the two-sample case, it is assumed that both samples come from an unspecified (unknown) continuous, purely discrete or mixed distribution, i.e. ties (repeated observations) are allowed, and exact p-values of the KS and the Kuiper tests are computed. Note, the two-sample Kuiper test is often used when data samples are on the line or on the circle (circular data). To cite this package in publication: (for the use of the one-sample KS test) Dimitrina S. Dimitrova, Vladimir K. Kaishev, and Senren Tan. Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed, or Continuous. Journal of Statistical Software. 2020; 95(10): 1--42. <doi:10.18637/jss.v095.i10>. (for the use of the two-sample KS and Kuiper tests) Dimitrina S. Dimitrova, Yun Jia and Vladimir K. Kaishev (2024). The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests. submitted. |
Authors: | Dimitrina S. Dimitrova <[email protected]>, Yun Jia <[email protected]>, Vladimir K. Kaishev <[email protected]>, Senren Tan <[email protected]> |
Maintainer: | Dimitrina S. Dimitrova <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 2.0.2 |
Built: | 2024-11-27 06:42:12 UTC |
Source: | CRAN |
This package computes p-values of the one-sample and two-sample Kolmogorov-Smirnov (KS) tests and the two-sample Kuiper test.
The one-sample two-sided Kolmogorov-Smirnov (KS) statistic is one of the most popular goodness-of-fit test statistics that is used to measure how well the distribution of a random sample agrees with a prespecified theoretical distribution.
Given a random sample of size
with an empirical cdf
, the two-sided KS statistic is defined as
, where
is the cdf of the prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The package KSgeneral implements a novel, accurate and efficient Fast Fourier Transform (FFT)-based method, referred as Exact-KS-FFT method to compute the complementary cdf,
, at a fixed
for a given (hypothezied) purely discrete, mixed or continuous underlying cdf
, and arbitrary, possibly very large sample size
.
A plot of the complementary cdf
,
, can also be produced.
In other words, the package computes the p-value, for any fixed critical level
.
If an observed (data) sample,
is supplied, KSgeneral computes the p-value
, where
is the value of the KS test statistic computed based on
. One can also compute the (complementary) cdf for the one-sided KS statistics
or
(cf., Dimitrova, Kaishev, Tan (2020)) by appropriately specifying correspondingly
for all
or
for all
, in the function
ks_c_cdf_Rcpp
.
The two-sample Kolmogorov-Smirnov (KS) and the Kuiper statistics are widely used to test the null hypothesis () that two data samples come from the same underlying distribution. Given a pair of random samples
and
of sizes
m
and n
with empirical cdfs and
respectively, coming from unknown CDFs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. We want to test the null hypothesis
for all
, either against the alternative hypothesis
for at least one
, which corresponds to the two-sided test, or against
and
for at least one
, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample
,
is a strictly positive weight function defined on
. KSgeneral implements an exact algorithm which is an extension of the Fortran 77 subroutine due to Nikiforov (1994), to calculate the exact p-value
, where
and
is the two-sample Kolmogorov-Smirnov goodness-of-fit test defined on the space
of all possible
pairs of samples,
and
of sizes
and
, that are randomly drawn from the pooled sample
without replacement. If two data samples
and
are supplied, the package computes
d
, where
d
is the observed value of computed based on these two observed samples. Samples may come from any continuous, discrete or mixed distribution, i.e. the test allows repeated observations to appear in the user provided data samples
,
and their pooled sample
.
The two-sample (unweighted) Kuiper goodness-of-fit statistic is defined as:
It is widely used when the data samples are periodic or circular (data that are measured in radians). KSgeneral calculates the exact p-value , where
and
is the two-sample Kuiper goodness-of-fit test defined on the on the space,
, as described above. If two data samples
and
are supplied, the package computes
v
, where
v
is the observed value of computed based on these two observed samples. Similarly, as for the KS test, the two-sample Kuiper test also allows repeated observations in the user provided data samples
,
and their pooled sample
.
One-sample KS test:
The Exact-KS-FFT method to compute p-values of the one-sample KS test in KSgeneral is based on expressing the p-value in terms of an appropriate rectangle probability with respect to the uniform order statistics, as noted by Gleser (1985) for
.
The latter representation is used to express
via a double-boundary non-crossing probability for a homogeneous Poisson process, with intensity
, which is then efficiently computed using FFT, ensuring total run-time of order
(see Dimitrova, Kaishev, Tan (2020) and also Moscovich and Nadler (2017) for the special case when
is continuous).
The code for the one-sample KS test in KSgeneral represents an R wrapper of the original C++ code due to Dimitrova, Kaishev, Tan (2020) and based on the C++ code developed by Moscovich and Nadler (2017).
The package includes the functions disc_ks_c_cdf
, mixed_ks_c_cdf
and cont_ks_c_cdf
that compute the complementary cdf , for a fixed
,
, when
is purely discrete, mixed or continuous, respectively.
KSgeneral includes also the functions
disc_ks_test
, mixed_ks_test
and cont_ks_test
that compute the p-value , where
is the value of the KS test statistic computed based on a user provided data sample
, when
is purely discrete, mixed or continuous, respectively.
The functions disc_ks_test
and cont_ks_test
represent accurate and fast (run time ) alternatives to the functions
ks.test
from the package dgof and the function ks.test
from the package stat, which compute p-values of , assuming
is purely discrete or continuous, respectively.
The package also includes the function ks_c_cdf_Rcpp
which gives the flexibility to compute the complementary cdf (p-value) for the one-sided KS test statistics or
.
It also allows for faster computation time and possibly higher accuracy in computing
.
Two-sample KS test and Kuiper test:
The method underlying for computing p-values of the two-sample KS and Kuiper tests in KSgeneral is the extension of the algorithm due to Nikiforov (1994) and is based on expressing the p-value as the probability that a point sequence stays within a certain region in the two-dimensional integer-valued lattice. The algorithm for both tests uses a recursive formula to calculate the total number of point sequences within the region which is divided by the total number of elements in , i.e.
to obtain the probability.
For a particular realization of the pooled sample , the p-values calculated by the functions
KS2sample
and Kuiper2sample
are the probabilities:
where and
are the two-sample Kolmogorov-Smirnov and Kuiper test statistics respectively, for two samples
and
of sizes
and
, randomly drawn from the pooled sample without replacement, i.e. they are defined on the space
and
for the KS test,
for the Kuiper test.
Both KS2sample
and Kuiper2sample
implement algorithms which generalize the method due to Nikiforov (1994), and calculate the exact p-values of the KS test and the Kuiper test respectively. Both of them allow tested data samples to come from continuous, discrete or mixed distributions (ties are also allowed).
KS2sample
ensures a total worst-case run-time of order . Compared with other known algorithms, it not only allows more flexible choices on weights leading to better power (see Dimitrova, Jia, Kaishev 2024), but also is more efficient and more generally applicable for large sample sizes.
Kuiper2sample
is accurate and valid for large sample sizes. It ensures a total worst-case run-time of order . When
m
and n
have large greatest common divisor (an extreme case is m
= n
), it ensures a total worst-case run-time of order .
Dimitrina S. Dimitrova <[email protected]>, Yun Jia <[email protected]>, Vladimir K. Kaishev <[email protected]>, Senren Tan <[email protected]>
Maintainer: Dimitrina S. Dimitrova <[email protected]>
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
Computes the complementary cdf at a fixed
,
, for the one-sample two-sided Kolmogorov-Smirnov statistic,
, for a given sample size
, when the cdf
under the null hypothesis is continuous.
cont_ks_c_cdf(q, n)
cont_ks_c_cdf(q, n)
q |
numeric value between 0 and 1, at which the complementary cdf |
n |
the sample size |
Given a random sample of size
n
with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function cont_ks_c_cdf
implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the complementary cdf, at a value
, when
is continuous.
This algorithm ensures a total worst-case run-time of order
which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of
, e.g., the function
ks.test
in the package stats and the function ks.test
in the package dgof.
More precisely, in these packages, the exact p-value, is computed only in the case when
, where
is the value of the KS test statistic computed based on a user provided sample
.
Another limitation of the functions
ks.test
is that the sample size should be less than 100, and the computation time is .
In contrast, the function
cont_ks_c_cdf
provides results with at least 10 correct digits after the decimal point for sample sizes up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For
n
> 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix C for further details and examples.
Numeric value corresponding to .
Based on the C++ code available at https://github.com/mosco/crossing-probability developed by Moscovich and Nadler (2017). See also Dimitrova, Kaishev, Tan (2020) for more details.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Marsaglia G., Tsang WW., Wang J. (2003). "Evaluating Kolmogorov's Distribution". Journal of Statistical Software, 8(18), 1-4.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Compute the value for P(D_{100} >= 0.05) KSgeneral::cont_ks_c_cdf(0.05, 100) ## Compute P(D_{n} >= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n <- 100 q <- 1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_c_cdf(x, n)), type='l') ## Compute P(D_{n} >= q) for n = 141, nq^{2} = 2.1 as shown ## in Table 18 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_c_cdf(sqrt(2.1/141), 141)
## Compute the value for P(D_{100} >= 0.05) KSgeneral::cont_ks_c_cdf(0.05, 100) ## Compute P(D_{n} >= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n <- 100 q <- 1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_c_cdf(x, n)), type='l') ## Compute P(D_{n} >= q) for n = 141, nq^{2} = 2.1 as shown ## in Table 18 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_c_cdf(sqrt(2.1/141), 141)
Computes the cdf at a fixed
,
, for the one-sample two-sided Kolmogorov-Smirnov statistic,
, for a given sample size
, when the cdf
under the null hypothesis is continuous.
cont_ks_cdf(q, n)
cont_ks_cdf(q, n)
q |
numeric value between 0 and 1, at which the cdf |
n |
the sample size |
Given a random sample of size
n
with an empirical cdf , the Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function cont_ks_cdf
implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the cdf at a value
, when
is continuous.
This algorithm ensures a total worst-case run-time of order
which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of
, e.g., the function
ks.test
in the package stats and the function ks.test
in the package dgof.
More precisely, in these packages, the exact p-value, is computed only in the case when
, where
is the value of the KS statistic computed based on a user provided sample
.
Another limitation of the functions
ks.test
is that the sample size should be less than 100, and the computation time is .
In contrast, the function
cont_ks_cdf
provides results with at least 10 correct digits after the decimal point for sample sizes up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For
n
> 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix B for further details and examples.
Numeric value corresponding to .
Based on the C++ code available at https://github.com/mosco/crossing-probability developed by Moscovich and Nadler (2017). See also Dimitrova, Kaishev, Tan (2020) for more details.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Marsaglia G., Tsang WW., Wang J. (2003). "Evaluating Kolmogorov's Distribution". Journal of Statistical Software, 8(18), 1-4.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Compute the value for P(D_{100} <= 0.05) KSgeneral::cont_ks_cdf(0.05, 100) ## Compute P(D_{n} <= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n<-100 q<-1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_cdf(x, n)), type='l') ## Compute P(D_{n} <= q) for n = 40, nq^{2} = 0.76 as shown ## in Table 9 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_cdf(sqrt(0.76/40), 40)
## Compute the value for P(D_{100} <= 0.05) KSgeneral::cont_ks_cdf(0.05, 100) ## Compute P(D_{n} <= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n<-100 q<-1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_cdf(x, n)), type='l') ## Compute P(D_{n} <= q) for n = 40, nq^{2} = 0.76 as shown ## in Table 9 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_cdf(sqrt(0.76/40), 40)
Computes the p-value , where
is the value of the KS test statistic computed based on a data sample
, when
is continuous.
cont_ks_test(x, y, ...)
cont_ks_test(x, y, ...)
x |
a numeric vector of data sample values |
y |
a pre-specified continuous cdf, |
... |
values of the parameters of the cdf, |
Given a random sample of size
n
with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function cont_ks_test
implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the p-value , where
is the value of the KS test statistic computed based on a user provided data sample
, assuming
is continuous.
This algorithm ensures a total worst-case run-time of order
which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of
, e.g., the function
ks.test
in the package stats and the function ks.test
in the package dgof.
A limitation of the functions ks.test
is that the sample size should be less than 100, and the computation time is .
In contrast, the function
cont_ks_test
provides results with at least 10 correct digits after the decimal point for sample sizes up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For
n
> 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix C for further details and examples.
A list with class "htest" containing the following components:
statistic |
the value of the statistic. |
p.value |
the p-value of the test. |
alternative |
"two-sided". |
data.name |
a character string giving the name of the data. |
Based on the C++ code available at https://github.com/mosco/crossing-probability developed by Moscovich and Nadler (2017). See also Dimitrova, Kaishev, Tan (2020) for more details.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Comparing the p-values obtained by stat::ks.test ## and KSgeneral::cont_ks_test x<-abs(rnorm(100)) p.kt <- ks.test(x, "pexp", exact = TRUE)$p p.kt_fft <- KSgeneral::cont_ks_test(x, "pexp")$p abs(p.kt-p.kt_fft)
## Comparing the p-values obtained by stat::ks.test ## and KSgeneral::cont_ks_test x<-abs(rnorm(100)) p.kt <- ks.test(x, "pexp", exact = TRUE)$p p.kt_fft <- KSgeneral::cont_ks_test(x, "pexp")$p abs(p.kt-p.kt_fft)
Computes the complementary cdf, at a fixed
,
, of the one-sample two-sided Kolmogorov-Smirnov (KS) statistic, when the cdf
under the null hypothesis is purely discrete, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
Moreover, for comparison purposes,
disc_ks_c_cdf
gives, as an option, the possibility to compute (an approximate value for) the asymptotic using the simulation-based algorithm of Wood and Altavela (1978).
disc_ks_c_cdf(q, n, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)
disc_ks_c_cdf(q, n, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)
q |
numeric value between 0 and 1, at which the complementary cdf |
n |
the sample size |
y |
a pre-specified discrete cdf, |
... |
values of the parameters of the cdf, |
exact |
logical variable specifying whether one wants to compute exact p-value |
tol |
the value of |
sim.size |
the required number of simulated trajectories in order to produce one Monte Carlo estimate (one MC run) of the asymptotic complementary cdf using the algorithm of Wood and Altavela (1978). By default, |
num.sim |
the number of MC runs, each producing one estimate (based on |
Given a random sample of size
n
with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function disc_ks_c_cdf
implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020) to compute the complementary cdf at a value
, when
is purely discrete.
This algorithm ensures a total worst-case run-time of order
which makes it more efficient and numerically stable than the only alternative algorithm developed by Arnold and Emerson (2011) and implemented as the function
ks.test
in the package dgof.
The latter only computes a p-value , corresponding to the value of the KS test statistic
computed based on a user provided sample
.
More precisely, in the package dgof (function
ks.test
), the p-value for a one-sample two-sided KS test is calculated by combining the approaches of Gleser (1985) and Niederhausen (1981). However, the function ks.test
only provides exact p-values for n
30, since as noted by the authors (see Arnold and Emerson (2011)), when
n
is large, numerical instabilities may occur. In the latter case, ks.test
uses simulation to approximate p-values, which may be rather slow and inaccurate (see Table 6 of Dimitrova, Kaishev, Tan (2020)).
Thus, making use of the Exact-KS-FFT method, the function disc_ks_c_cdf
provides an exact and highly computationally efficient (alternative) way of computing at a value
, when
is purely discrete.
Lastly, incorporated into the function disc_ks_c_cdf
is the MC simulation-based method of Wood and Altavela (1978) for estimating the asymptotic complementary cdf of . The latter method is the default method behind
disc_ks_c_cdf
when the sample size n
is n
100000.
Numeric value corresponding to .
Arnold T.A., Emerson J.W. (2011). "Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions". The R Journal, 3(2), 34-39.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.
Niederhausen H. (1981). "Sheffer Polynomials for Computing Exact Kolmogorov-Smirnov and Renyi Type Distributions". The Annals of Statistics, 58-64.
Wood C.L., Altavela M.M. (1978). "Large-Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions". Biometrika, 65(1), 235-239.
## Example to compute the exact complementary cdf for D_{n} ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3) ## Not run: ## Compute P(D_{n} >= q) for n = 100, ## q = 1/5000, 2/5000, ..., 5000/5000, when ## the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020), ## and then plot the corresponding values against q, ## i.e. plot the resulting complementary cdf of D_{n} n <- 100 q <- 1:5000/5000 binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) plot(q, sapply(q, function(x) KSgeneral::disc_ks_c_cdf(x, n, binom_3)), type='l') ## End(Not run) ## Not run: ## Example to compute the asymptotic complementary cdf for D_{n} ## based on Wood and Altavela (1978), ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0: 3), c(0, pbinom(0 : 3, 3, 0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3, exact = FALSE, tol = 1e-08, sim.size = 1e+06, num.sim = 10) ## End(Not run)
## Example to compute the exact complementary cdf for D_{n} ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3) ## Not run: ## Compute P(D_{n} >= q) for n = 100, ## q = 1/5000, 2/5000, ..., 5000/5000, when ## the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020), ## and then plot the corresponding values against q, ## i.e. plot the resulting complementary cdf of D_{n} n <- 100 q <- 1:5000/5000 binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) plot(q, sapply(q, function(x) KSgeneral::disc_ks_c_cdf(x, n, binom_3)), type='l') ## End(Not run) ## Not run: ## Example to compute the asymptotic complementary cdf for D_{n} ## based on Wood and Altavela (1978), ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0: 3), c(0, pbinom(0 : 3, 3, 0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3, exact = FALSE, tol = 1e-08, sim.size = 1e+06, num.sim = 10) ## End(Not run)
Computes the p-value , where
is the value of the KS test statistic computed based on a data sample
, when
is purely discrete, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
disc_ks_test(x, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)
disc_ks_test(x, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)
x |
a numeric vector of data sample values |
y |
a pre-specified discrete cdf, |
... |
values of the parameters of the cdf, |
exact |
logical variable specifying whether one wants to compute exact p-value |
tol |
the value of |
sim.size |
the required number of simulated trajectories in order to produce one Monte Carlo estimate (one MC run) of the asymptotic p-value using the algorithm of Wood and Altavela (1978). By default, |
num.sim |
the number of MC runs, each producing one estimate (based on |
Given a random sample of size
n
with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function disc_ks_test
implements the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
It represents an accurate and fast (run time ) alternative to the function
ks.test
from the package dgof, which computes a p-value , where
is the value of the KS test statistic computed based on a user provided data sample
, assuming
is purely discrete.
In the function ks.test
, the p-value for a one-sample two-sided KS test is calculated by combining the approaches of Gleser (1985) and Niederhausen (1981). However, the function ks.test
due to Arnold and Emerson (2011) only provides exact p-values for n
30, since as noted by the authors, when
n
is large, numerical instabilities may occur. In the latter case, ks.test
uses simulation to approximate p-values, which may be rather slow and inaccurate (see Table 6 of Dimitrova, Kaishev, Tan (2020)).
Thus, making use of the Exact-KS-FFT method, the function disc_ks_test
provides an exact and highly computationally efficient (alternative) way of computing the p-value , when
is purely discrete.
Lastly, incorporated into the function disc_ks_test
is the MC simulation-based method of Wood and Altavela (1978) for estimating the asymptotic p-value of . The latter method is the default method behind
disc_ks_test
when the sample size n
is n
100000.
A list with class "htest" containing the following components:
statistic |
the value of the statistic. |
p.value |
the p-value of the test. |
alternative |
"two-sided". |
data.name |
a character string giving the name of the data. |
Arnold T.A., Emerson J.W. (2011). "Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions". The R Journal, 3(2), 34-39.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.
Niederhausen H. (1981). "Sheffer Polynomials for Computing Exact Kolmogorov-Smirnov and Renyi Type Distributions". The Annals of Statistics, 58-64.
Wood C.L., Altavela M.M. (1978). "Large-Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions". Biometrika, 65(1), 235-239.
# Comparison of results obtained from dgof::ks.test # and KSgeneral::disc_ks_test, when F(x) follows the discrete # Uniform[1, 10] distribution as in Example 3.5 of # Dimitrova, Kaishev, Tan (2020) # When the sample size is larger than 100, the # function dgof::ks.test will be numerically # unstable x3 <- sample(1:10, 25, replace = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE) dgof::ks.test(x3, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x3, ecdf(1:10), exact = TRUE)$p x4 <- sample(1:10, 500, replace = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE) dgof::ks.test(x4, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x4, ecdf(1:10), exact = TRUE)$p # Using stepfun() to specify the same discrete distribution as defined by ecdf(): steps <- stepfun(1:10, cumsum(c(0, rep(0.1, 10)))) KSgeneral::disc_ks_test(x3, steps, exact = TRUE)
# Comparison of results obtained from dgof::ks.test # and KSgeneral::disc_ks_test, when F(x) follows the discrete # Uniform[1, 10] distribution as in Example 3.5 of # Dimitrova, Kaishev, Tan (2020) # When the sample size is larger than 100, the # function dgof::ks.test will be numerically # unstable x3 <- sample(1:10, 25, replace = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE) dgof::ks.test(x3, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x3, ecdf(1:10), exact = TRUE)$p x4 <- sample(1:10, 500, replace = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE) dgof::ks.test(x4, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x4, ecdf(1:10), exact = TRUE)$p # Using stepfun() to specify the same discrete distribution as defined by ecdf(): steps <- stepfun(1:10, cumsum(c(0, rep(0.1, 10)))) KSgeneral::disc_ks_test(x3, steps, exact = TRUE)
Function calling directly the C++ routines that compute the complementary cdf for the one-sample two-sided Kolmogorov-Smirnov statistic, given the sample size n
and the file "Boundary_Crossing_Time.txt" in the working directory.
The latter file contains and
,
, specified in Steps 1 and 2 of the Exact-KS-FFT method (see Equation (5) in Section 2 of Dimitrova, Kaishev, Tan (2020)).
The latter values form the n-dimensional rectangular region for the uniform order statistics (see Equations (3), (5) and (6) in Dimitrova, Kaishev, Tan (2020)), namely
,
where the upper and lower boundary functions
,
are defined as
,
,
or equivalently, noting that
and
are correspondingly left and right continuous functions, we have
and
.
Note that on can also compute the (complementary) cdf for the one-sided KS statistics or
(cf., Dimitrova, Kaishev, Tan (2020)) by appropriately specifying correspondingly
for all
or
for all
, in the function
ks_c_cdf_Rcpp
.
ks_c_cdf_Rcpp(n)
ks_c_cdf_Rcpp(n)
n |
the sample size |
Note that all calculations here are done directly in C++ and output in R.
That leads to faster computation time, as well as in some cases, possibly higher accuracy (depending on the accuracy of the pre-computed values and
,
, provided in the file "Boundary_Crossing_Time.txt") compared to the functions
cont_ks_c_cdf
, disc_ks_c_cdf
, mixed_ks_c_cdf
.
Given a random sample of size
n
with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The one-sided KS test statistics are correspondingly defined as
and
.
The function ks_c_cdf_Rcpp
implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020), to compute the complementary cdf, at a value
, when
is arbitrary (i.e. purely discrete, mixed or continuous).
It is based on expressing the complementary cdf as
, where
and
are defined as in Step 1 of Dimitrova, Kaishev, Tan (2020).
The complementary cdf is then re-expressed in terms of the conditional probability that a homogeneous Poisson process, with intensity
will not cross an upper boundary
and a lower boundary
, given that
(see Steps 2 and 3 in Section 2.1 of Dimitrova, Kaishev, Tan (2020)). This conditional probability is evaluated using FFT in Step 4 of the method in order to obtain the value of the complementary cdf
.
This algorithm ensures a total worst-case run-time of order
which makes it highly computationally efficient compared to other known algorithms developed for the special cases of continuous or purely discrete
.
The values and
,
, specified in Steps 1 and 2 of the Exact-KS-FFT method (see Dimitrova, Kaishev, Tan (2020), Section 2) must be pre-computed (in R or, if needed, using alternative softwares offering high accuracy, e.g. Mathematica) and saved in a file with the name "Boundary_Crossing_Time.txt" (in the current working directory).
The function ks_c_cdf_Rcpp
is called in R and it first reads the file "Boundary_Crossing_Time.txt" and then computes the value for the complementaty cdf
in C++ and output in R (or as noted above, as a special case, computes the value of the complementary cdf
or
).
Numeric value corresponding to (or, as a special case, to
or
), given a sample size
n
and the file "Boundary_Crossing_Time.txt" containing and
,
, specified in Steps 1 and 2 of the Exact-KS-FFT method (see Dimitrova, Kaishev, Tan (2020), Section 2).
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Computing the complementary cdf P(D_{n} >= q) ## for n = 10 and q = 0.1, when F(x) is continuous, ## In this case, ## B_i = (i-1)/n + q ## A_i = i/n - q n <- 10 q <- 0.1 up_rec <- ((1:n)-1)/n + q low_rec <- (1:n)/n - q df <- data.frame(rbind(up_rec, low_rec)) write.table(df,"Boundary_Crossing_Time.txt", sep = ", ", row.names = FALSE, col.names = FALSE) ks_c_cdf_Rcpp(n)
## Computing the complementary cdf P(D_{n} >= q) ## for n = 10 and q = 0.1, when F(x) is continuous, ## In this case, ## B_i = (i-1)/n + q ## A_i = i/n - q n <- 10 q <- 0.1 up_rec <- ((1:n)-1)/n + q low_rec <- (1:n)/n - q df <- data.frame(rbind(up_rec, low_rec)) write.table(df,"Boundary_Crossing_Time.txt", sep = ", ", row.names = FALSE, col.names = FALSE) ks_c_cdf_Rcpp(n)
Computes the p-value , where
is the one- or two-sided two-sample Kolmogorov-Smirnov test statistic with weight function
weight
, when =
d
, i.e. the observed value of KS statistic computed based on two data samples and
that may come from continuous, discrete or mixed distribution, i.e. they may have repeated observations (ties).
KS2sample(x, y, alternative = c("two.sided", "less", "greater"), conservative = F, weight = 0, tol = 1e-08, tail = T)
KS2sample(x, y, alternative = c("two.sided", "less", "greater"), conservative = F, weight = 0, tol = 1e-08, tail = T)
x |
a numeric vector of data sample values |
y |
a numeric vector of data sample values |
alternative |
Indicates the alternative hypothesis and must be one of "two.sided" (default), "less", or "greater". One can specify just the initial letter of the string, but the argument name must be given in full, e.g. |
conservative |
logical variable indicating whether ties should be considered. See ‘Details’ for the meaning. |
weight |
either a numeric value between 0 and 1 which specifies the form of the weight function from a class of pre-defined functions, or a user-defined strictly positive function of one variable. By default, no weight function is assumed. See ‘Details’ for the meaning of the possible values. |
tol |
the value of |
tail |
logical variable indicating whether a p-value, |
Given a pair of random samples and
of sizes
m
and n
with empirical cdfs and
respectively, coming from some unknown cdfs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis
for all
, either against the alternative hypothesis
for at least one
, which corresponds to the two-sided test, or against
and
for at least one
, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample
,
is a strictly positive weight function defined on
.
Possible values of alternative
are "two.sided"
, "greater"
and "less"
which specify the alternative hypothesis, i.e. specify the test statistics to be either ,
or
respectively.
When weight
is assigned with a numeric value between
0
and 1
, the test statistic is specified as the weighted two-sample Kolmorogov-Smirnov test with generalized Anderson-Darling weight (see Finner and Gontscharuk 2018). Then for example, the two-sided two-sample Kolmogorov-Smirnov statistic has the following form:
The latter specification defines a family of weighted Kolmogorov-Smirnov tests, covering the unweighted test (when weight =
), and the widely-known weighted Kolmogorov-Smirnov test with Anderson-Darling weight (when
weight = 0.5
, see definition of this statistic also in Canner 1975).
If one wants to implement a weighted test with a user-specified weight function, for example, suggested by Buning (2001), which ensures higher power when both
x
and y
come from distributions that are left-skewed and heavy-tailed, one can directly assign a univariate function with output value 1/sqrt(t*(2-t))
to weight
. See ‘Examples’ for this demonstration.
For a particular realization of the pooled sample , let there be
distinct values,
, in the ordered, pooled sample
, where
, and where
is the number of times
,
appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of
weight
and alternative
, for two samples and
of sizes
and
, randomly drawn from the pooled sample without replacement and
=
d
, the observed value of the statistic calculated based on the user provided data samples x
and y
. By default tail = T
, the p-value is returned, otherwise is returned.
Note that, is defined on the space
of all possible pairs,
of edfs
and
,
, that correspond to the pairs of samples
and
, randomly drawn from,
, as follows. First,
observations are drawn at random without replacement, forming the first sample
, with corresponding edf,
. The remaining
observations are then assigned to the second sample
, with corresponding edf
. Observations are then replaced back in
and re-sampling is continued until the occurrence of all the
possible pairs of edfs
and
,
. The pairs of edf's may be coincident if there are ties in the data and each pair,
and
occurs with probability
.
conservative
is a logical variable whether the test should be conducted conservatively. By default, conservative = F
, KS2sample
returns the p-value that is defined through the conditional probability above. However, when the user has a priori knowledge that both samples are from a continuous distribution even if ties are present, for example, repeated observations are caused by rounding errors, the value conservative = T
should be assigned, since the conditional probability is no longer relevant. In this case, KS2sample
computes p-values for the Kolmogorov-Smirnov test assuming no ties are present, and returns a p-value which is an upper bound of the true p-value. Note that, if the null hypothesis is rejected using the calculated upper bound for the p-value, it should also be rejected with the true p-value.
KS2sample
calculates the exact p-value of the KS test using an algorithm which generalizes the method due to Nikiforov (1994). If tail = F
, KS2sample
calculates the complementary p-value, . For the purpose, an exact algorithm which generalizes the method due to Nikiforov (1994) is implemented. Alternatively, if
tail = T
, a version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is implemented, which computes directly the p-value, with higher accuracy, giving up to 17 correct digits, but at up to 3 times higher computational cost. KS2sample
ensures a total worst-case run-time of order . In comparison with other known algorithms, it not only allows the flexible choice of weights which in some cases improve the statistical power (see Dimitrova, Jia, Kaishev 2024), but also is more efficient and generally applicable for large sample sizes.
A list with class "htest"
containing the following components:
statistic |
the value of the test statistic |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
data.name |
a character string giving names of the data. |
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
Buning H (2001). "Kolmogorov-Smirnov- and Cramer-von Mises Type Two-sample Tests With Various Weight Functions." Communications in Statistics - Simulation and Computation, 30(4), 847-865.
Finner H, Gontscharuk V (2018). "Two-sample Kolmogorov-Smirnov-type tests revisited: Old and new tests in terms of local levels." The Annals of Statistics, 46(6A), 3014-3037.
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265-270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
##Computes p-value of two-sided unweighted test for continuous data data1 <- rexp(750, 1) data2 <- rexp(800, 1) KS2sample(data1, data2) ##Computes the complementary p-value KS2sample(data1, data2, tail = FALSE) ##Computes p-value of one-sided test with Anderson-Darling weight function KS2sample(data1, data2, alternative = "greater", weight = 0.5) ##Computes p-values of two-sided test with Buning's weight function for discrete data data3 <- rnbinom(100, size = 3, prob = 0.6) data4 <- rpois(120, lambda = 2) f <- function(t) 1 / sqrt( t * (2 - t) ) KS2sample(data3, data4, weight = f)
##Computes p-value of two-sided unweighted test for continuous data data1 <- rexp(750, 1) data2 <- rexp(800, 1) KS2sample(data1, data2) ##Computes the complementary p-value KS2sample(data1, data2, tail = FALSE) ##Computes p-value of one-sided test with Anderson-Darling weight function KS2sample(data1, data2, alternative = "greater", weight = 0.5) ##Computes p-values of two-sided test with Buning's weight function for discrete data data3 <- rnbinom(100, size = 3, prob = 0.6) data4 <- rpois(120, lambda = 2) f <- function(t) 1 / sqrt( t * (2 - t) ) KS2sample(data3, data4, weight = f)
Function calling directly the C++ routines that compute the exact complementary p-value for the (weighed) two-sample one- or two-sided Kolmogorov-Smirnov statistic, at a fixed
,
, given the sample sizes
m
and n
, the vector of weights w_vec
and the vector M
containing the number of times each distinct observation is repeated in the pooled sample.
KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)
KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
kind |
an integer value (= 1,2 or 3) which specified the alternative hypothesis. When = 1, the test is two-sided. When = 2 or 3, the test is one-sided. See ‘Details’ for the meaning of the possible values. Other value is invalid. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 1, at which the p-value |
w_vec |
a vector with |
tol |
the value of |
Given a pair of random samples and
of sizes
m
and n
with empirical cdfs and
respectively, coming from some unknown cdfs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis
for all
, either against the alternative hypothesis
for at least one
, which corresponds to the two-sided test, or against
and
for at least one
, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample
,
is a strictly positive weight function defined on
.
w_vec[i]
(0i
) is then equal to
(
is the i-th smallest observation in the pooled sample
).
Different value of
w_vec
specifies the weighted Kolmogorov-Smirnov test differently. For example, when w_vec=rep(1,m+n-1)
, KS2sample_Rcpp
calculates the p-value of the unweighted two-sample Kolmogorov-Smirnov test, when w_vec
= ((1:(m+n-1))*((m+n-1):1))^(-1/2)
, it calculates the p-value for the weighted two-sample Kolmogorov-Smirnov test with Anderson-Darling weight .
Possible values of kind
are 1,2 and 3, which specify the alternative hypothesis, i.e. specify the test statistic to be either ,
or
respectively.
The numeric array M
specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be
distinct values,
, in the ordered, pooled sample
, where
, and where
=
M[i]
is the number of times ,
appears in the pooled sample. The calculated complementary p-value is the conditional probability:
where is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of
weight
and alternative
, for two samples and
of sizes
and
, randomly drawn from the pooled sample without replacement, i.e.
is defined on the space
(see further details in
KS2sample
), and .
KS2sample_c_Rcpp
implements an exact algorithm, extending the Fortran 77 subroutine due to Nikiforov (1994), an extended functionality by allowing more flexible choice of weight, as well as for large sample sizes. This leads to faster computation time, as well as, relatively high accuracy for very large m
and n
(less accurate than KS2sample_Rcpp
). Compared with other known algorithms, it allows data samples come from continuous, discrete or mixed distribution(i.e. ties may appear), and it is more efficient and more generally applicable for large sample sizes. This algorithm ensures a total worst-case run-time of order .
Numeric value corresponding to , given sample sizes
m
, n
, M
and w_vec
. If the value of m
, n
are non-positive, or if the length of w_vec
is not equal to m+n-1
, then the function returns -1
, the non-permitted value of M
or non-permitted value inside w_vec
returns -2
, numerically unstable calculation returns -3
.
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)
Function calling directly the C++ routines that compute the exact p-value for the (weighed) two-sample one- or two-sided Kolmogorov-Smirnov statistic, at a fixed
,
, given the sample sizes
m
and n
, the vector of weights w_vec
and the vector M
containing the number of times each distinct observation is repeated in the pooled sample.
KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)
KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
kind |
an integer value (= 1,2 or 3) which specified the alternative hypothesis. When = 1, the test is two-sided. When = 2 or 3, the test is one-sided. See ‘Details’ for the meaning of the possible values. Other value is invalid. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 1, at which the p-value |
w_vec |
a vector with |
tol |
the value of |
Given a pair of random samples and
of sizes
m
and n
with empirical cdfs and
respectively, coming from some unknown cdfs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis
for all
, either against the alternative hypothesis
for at least one
, which corresponds to the two-sided test, or against
and
for at least one
, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample
,
is a strictly positive weight function defined on
.
w_vec[i]
(0i
) is then equal to
(
is the i-th smallest observation in the pooled sample
).
Different value of
w_vec
specifies the weighted Kolmogorov-Smirnov test differently. For example, when w_vec=rep(1,m+n-1)
, KS2sample_Rcpp
calculates the p-value of the unweighted two-sample Kolmogorov-Smirnov test, when w_vec
= ((1:(m+n-1))*((m+n-1):1))^(-1/2)
, it calculates the p-value for the weighted two-sample Kolmogorov-Smirnov test with Anderson-Darling weight .
Possible values of kind
are 1,2 and 3, which specify the alternative hypothesis, i.e. specify the test statistic to be either ,
or
respectively.
The numeric array M
specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be
distinct values,
, in the ordered, pooled sample
, where
, and where
=
M[i]
is the number of times ,
appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of
weight
and alternative
, for two samples and
of sizes
and
, randomly drawn from the pooled sample without replacement, i.e.
is defined on the space
(see further details in
KS2sample
), and .
KS2sample_Rcpp
implements an exact algorithm, extending the Fortran 77 subroutine due to Nikiforov (1994), an extended functionality by allowing more flexible choices of weight, as well as for large sample sizes. A version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with higher accuracy, giving up to 17 correct digits, but at up to 3 times higher computational cost than KS2sample_c_Rcpp
. Compared with other known algorithms, it allows data samples to come from continuous, discrete or mixed distribution(i.e. ties may appear), and it is more efficient and more generally applicable for large sample sizes. This algorithm ensures a total worst-case run-time of order .
Numeric value corresponding to , given sample sizes
m
, n
, M
and w_vec
. If the value of m
, n
are non-positive, or if the length of w_vec
is not equal to m+n-1
, then the function returns -1
, the non-permitted value of M
or non-permitted value inside w_vec
returns -2
, numerically unstable calculation returns -3
.
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)
Computes the p-value, , where
is the two-sample Kuiper test statistic,
=
v
, i.e. the observed value of the Kuiper statistic, computed based on two data samples and
that may come from continuous, discrete or mixed distribution, i.e. they may have repeated observations (ties).
Kuiper2sample(x, y, conservative = F, tail = T)
Kuiper2sample(x, y, conservative = F, tail = T)
x |
a numeric vector of data sample values |
y |
a numeric vector of data sample values |
conservative |
logical variable indicating whether ties should be considered. See ‘Details’ for the meaning. |
tail |
logical variable indicating whether a p-value, |
Given a pair of random samples, either on the real line or the circle, denoted by and
, of sizes
m
and n
with empirical cdfs and
respectively, coming from some unknown cdfs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis
for all
, against the alternative hypothesis
for at least one
. The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:
For a particular realization of the pooled sample , let there be
distinct values,
, in the ordered, pooled sample
, where
, and where
is the number of times
,
appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kuiper test statistic defined as
, for two samples
and
of sizes
and
, randomly drawn from the pooled sample without replacement and
=
v
, the observed value of the statistic calculated based on the user provided data samples x
and y
. By default tail = T
, the p-value is returned, otherwise is returned.
Note that, is defined on the space
of all possible pairs,
of edfs
and
,
, that correspond to the pairs of samples
and
, randomly drawn from,
, as follows. First,
observations are drawn at random without replacement, forming the first sample
, with corresponding edf,
. The remaining
observations are then assigned to the second sample
, with corresponding edf
. Observations are then replaced back in
and re-sampling is continued until the occurrence of all the
possible pairs of edfs
and
,
. The pairs of edf's may be coincident if there are ties in the data and each pair,
and
occurs with probability
.
conservative
is a logical variable whether the test should be conducted conservatively. By default, conservative = F
, Kuiper2sample
returns the p-value that is defined through the conditional probability above. However, when the user has a priori knowledge that both samples are from a continuous distribution even if ties are present, for example, repeated observations are caused by rounding errors, the value conservative = T
should be assigned, since the conditional probability is no longer relevant. In this case, Kuiper2sample
computes p-values for the Kuiper test assuming no ties are present, and returns a p-value which is an upper bound of the true p-value. Note that, if the null hypothesis is rejected using the calculated upper bound for the p-value, it should also be rejected with the true p-value.
Kuiper2sample
calculates the exact p-value of the Kuiper test using an algorithm from Dimitrova, Jia, Kaishev (2024), which is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). If tail = F
, Kuiper2sample
calculates the complementary p-value . For the purpose, an exact algorithm which generalizes the method due to Nikiforov (1994) is implemented. Alternatively, if
tail = T
, a version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with up to 4 digits extra accuracy, but at up to 3 times higher computational cost. It is accurate and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order . When
m
and n
have large greatest common divisor (an extreme case is m
= n
), it ensures a total worst-case run-time of order .
Kuiper2sample
is accurate and fast compared with the function based on the Monte Carlo simulation. Compared to the implementation using asymptotic method, Kuiper2sample
allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate than asymptotic method when sample sizes are small.
A list with class "htest"
containing the following components:
statistic |
the value of the test statistic |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
data.name |
a character string giving names of the data. |
Maag, U. R., Stephens, M. A. (1968). The Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.
Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
##Computes discrete circular data data1 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data2 <- c(rep(pi/2,50),rep(pi,40),rep(3*pi/2,10),rep(2*pi,50)) Kuiper2sample(data1, data2) ##The calculated p-value does not change with the choice of the original point data3 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data4 <- c(rep(pi/2,50),rep(pi,50),rep(3*pi/2,40),rep(2*pi,10)) Kuiper2sample(data3, data4)
##Computes discrete circular data data1 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data2 <- c(rep(pi/2,50),rep(pi,40),rep(3*pi/2,10),rep(2*pi,50)) Kuiper2sample(data1, data2) ##The calculated p-value does not change with the choice of the original point data3 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data4 <- c(rep(pi/2,50),rep(pi,50),rep(3*pi/2,40),rep(2*pi,10)) Kuiper2sample(data3, data4)
Function calling directly the C++ routines that compute the exact complementary p-value for the two-sample Kuiper test, at a fixed
,
, given the sample sizes
m
, n
and the vector M
containing the number of times each distinct observation is repeated in the pooled sample.
Kuiper2sample_c_Rcpp(m, n, M, q)
Kuiper2sample_c_Rcpp(m, n, M, q)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 2, at which the p-value |
Given a pair of random samples, either on the real line or the circle, denoted by and
, of sizes
m
and n
with empirical cdfs and
respectively, coming from some unknown cdfs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis
for all
, against the alternative hypothesis
for at least one
. The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:
The numeric array M
specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be
distinct values,
, in the ordered, pooled sample
, where
, and where
=
M[i]
is the number of times ,
appears in the pooled sample. The calculated complementary p-value is then the conditional probability:
where is the two-sample Kuiper test statistic defined as
, for two samples
and
of sizes
and
, randomly drawn from the pooled sample without replacement, i.e.
is defined on the space
(see further details in
Kuiper2sample
), and .
Kuiper2sample_c_Rcpp
implements an algorithm from Dimitrova, Jia, Kaishev (2024), that is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). It is relatively accurate (less accurate than Kuiper2sample_Rcpp
) and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order . When
m
and n
have large greatest common divisor (an extreme case is m
= n
), it ensures a total worst-case run-time of order .
Other known implementations for the two-sample Kuiper test mainly use the approximation method or Monte Carlo simulation (See also Kuiper2sample
). The former method is invalid for data with ties and often gives p-values with large errors when sample sizes are small, the latter method is usually slow and inaccurate. Compared with other known algorithms, Kuiper2sample_c_Rcpp
allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate and generally applicable for large sample sizes.
Numeric value corresponding to , given sample sizes
m
, n
and M
. If the value of m
, n
are non-positive, or their least common multiple exceeds the limit 2147483647, then the function returns -1
, the non-permitted value of M
returns -2
, numerically unstable calculation returns -3
.
Maag, U. R., Stephens, M. A. (1968). The Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.
Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_c_Rcpp(m, n, M, q)
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_c_Rcpp(m, n, M, q)
Function calling directly the C++ routines that compute the exact p-value for the two-sample Kuiper test, at a fixed
,
, given the sample sizes
m
, n
and the vector M
containing the number of times each distinct observation is repeated in the pooled sample.
Kuiper2sample_Rcpp(m, n, M, q)
Kuiper2sample_Rcpp(m, n, M, q)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 2, at which the p-value |
Given a pair of random samples, either on the real line or the circle, denoted by and
, of sizes
m
and n
with empirical cdfs and
respectively, coming from some unknown cdfs
and
. It is assumed that
and
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis
for all
, against the alternative hypothesis
for at least one
. The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:
The numeric array M
specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be
distinct values,
, in the ordered, pooled sample
, where
, and where
=
M[i]
is the number of times ,
appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kuiper test statistic defined as
, for two samples
and
of sizes
and
, randomly drawn from the pooled sample without replacement, i.e.
is defined on the space
(see further details in
Kuiper2sample
), and .
Kuiper2sample_Rcpp
implements an algorithm from Dimitrova, Jia, Kaishev (2024), that is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). A version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with up to 4 digits extra accuracy, but at up to 3 times higher computational cost than Kuiper2sample_c_Rcpp
. It is accurate and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order . When
m
and n
have large greatest common divisor (an extreme case is m
= n
), it ensures a total worst-case run-time of order .
Other known implementations for the two-sample Kuiper test mainly use the approximation method or Monte Carlo simulation (See also Kuiper2sample
). The former method is invalid for data with ties and often gives p-values with large errors when sample sizes are small, the latter method is usually slow and inaccurate. Compared with other known algorithms, Kuiper2sample_Rcpp
allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate and generally applicable for large sample sizes.
Numeric value corresponding to , given sample sizes
m
, n
and M
. If the value of m
, n
are non-positive, or their least common multiple exceeds the limit 2147483647, then the function returns -1
, the non-permitted value of M
returns -2
, numerically unstable calculation returns -3
.
Maag, U. R., Stephens, M. A. (1968). The Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.
Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_Rcpp(m, n, M, q)
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_Rcpp(m, n, M, q)
Computes the complementary cdf, at a fixed
,
, of the one-sample two-sided Kolmogorov-Smirnov statistic, when the cdf
under the null hypothesis is mixed, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
mixed_ks_c_cdf(q, n, jump_points, Mixed_dist, ..., tol = 1e-10)
mixed_ks_c_cdf(q, n, jump_points, Mixed_dist, ..., tol = 1e-10)
q |
numeric value between 0 and 1, at which the complementary cdf |
n |
the sample size |
jump_points |
a numeric vector containing the points of (jump) discontinuity, i.e. where the underlying cdf |
Mixed_dist |
a pre-specified (user-defined) mixed cdf, |
... |
values of the parameters of the cdf, |
tol |
the value of |
Given a random sample of size
n
with an empirical cdf , the Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function mixed_ks_c_cdf
implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020) to compute the complementary cdf at a value
, when
is mixed.
This algorithm ensures a total worst-case run-time of order
.
We have not been able to identify alternative, fast and accurate, method (software) that has been developed/implemented when the hypothesized is mixed.
Numeric value corresponding to .
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
# Compute the complementary cdf of D_{n} # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) ## Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_c_cdf(0.1, 25, c(0, log(2.5)), Mixed_cdf_example) ## Not run: ## Compute P(D_{n} >= q) for n = 5, ## q = 1/5000, 2/5000, ..., 5000/5000 ## when the underlying distribution is a mixed distribution ## with four jumps at 0, 0.2, 0.8, 1.0, ## as in Example 2.8 of Dimitrova, Kaishev, Tan (2020) n <- 5 q <- 1:5000/5000 Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.2 } else if (x < 0.2){ result <- 0.2 + x } else if (x < 0.8){ result <- 0.5 } else if (x < 1){ result <- x - 0.1 } else{ result <- 1 } return (result) } plot(q, sapply(q, function(x) KSgeneral::mixed_ks_c_cdf(x, n, c(0, 0.2, 0.8, 1.0), Mixed_cdf_example)), type='l') ## End(Not run)
# Compute the complementary cdf of D_{n} # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) ## Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_c_cdf(0.1, 25, c(0, log(2.5)), Mixed_cdf_example) ## Not run: ## Compute P(D_{n} >= q) for n = 5, ## q = 1/5000, 2/5000, ..., 5000/5000 ## when the underlying distribution is a mixed distribution ## with four jumps at 0, 0.2, 0.8, 1.0, ## as in Example 2.8 of Dimitrova, Kaishev, Tan (2020) n <- 5 q <- 1:5000/5000 Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.2 } else if (x < 0.2){ result <- 0.2 + x } else if (x < 0.8){ result <- 0.5 } else if (x < 1){ result <- x - 0.1 } else{ result <- 1 } return (result) } plot(q, sapply(q, function(x) KSgeneral::mixed_ks_c_cdf(x, n, c(0, 0.2, 0.8, 1.0), Mixed_cdf_example)), type='l') ## End(Not run)
Computes the p-value , where
is the value of the KS test statistic computed based on a data sample
, when
is mixed, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
mixed_ks_test(x, jump_points, Mixed_dist, ..., tol = 1e-10)
mixed_ks_test(x, jump_points, Mixed_dist, ..., tol = 1e-10)
x |
a numeric vector of data sample values |
jump_points |
a numeric vector containing the points of (jump) discontinuity, i.e. where the underlying cdf |
Mixed_dist |
a pre-specified (user-defined) mixed cdf, |
... |
values of the parameters of the cdf, |
tol |
the value of |
Given a random sample of size
n
with an empirical cdf , the Kolmogorov-Smirnov goodness-of-fit statistic is defined as
, where
is the cdf of a prespecified theoretical distribution under the null hypothesis
, that
comes from
.
The function mixed_ks_test
implements the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
This algorithm ensures a total worst-case run-time of order .
The function mixed_ks_test
computes the p-value , where
is the value of the KS test statistic computed based on a user-provided data sample
, when
is mixed,
We have not been able to identify alternative, fast and accurate, method (software) that has been developed/implemented when the hypothesized is mixed.
A list with class "htest" containing the following components:
statistic |
the value of the statistic. |
p.value |
the p-value of the test. |
alternative |
"two-sided". |
data.name |
a character string giving the name of the data. |
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
# Example to compute the p-value of the one-sample two-sided KS test, # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) # Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } test_data <- c(0,0,0,0,0,0,0.1,0.2,0.3,0.4, 0.5,0.6,0.7,0.8,log(2.5),log(2.5), log(2.5),log(2.5),log(2.5),log(2.5)) KSgeneral::mixed_ks_test(test_data, c(0, log(2.5)), Mixed_cdf_example) ## Compute the p-value of a two-sided K-S test ## when F(x) follows a zero-and-one-inflated ## beta distribution, as in Example 3.3 ## of Dimitrova, Kaishev, Tan (2020) ## The data set is the proportion of inhabitants ## living within a 200 kilometer wide costal strip ## in 232 countries in the year 2010 data("Population_Data") mu <- 0.6189 phi <- 0.6615 a <- mu * phi b <- (1 - mu) * phi Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.1141 } else if (x < 1){ result <- 0.1141 + 0.4795 * pbeta(x, a, b) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_test(Population_Data, c(0, 1), Mixed_cdf_example)
# Example to compute the p-value of the one-sample two-sided KS test, # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) # Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } test_data <- c(0,0,0,0,0,0,0.1,0.2,0.3,0.4, 0.5,0.6,0.7,0.8,log(2.5),log(2.5), log(2.5),log(2.5),log(2.5),log(2.5)) KSgeneral::mixed_ks_test(test_data, c(0, log(2.5)), Mixed_cdf_example) ## Compute the p-value of a two-sided K-S test ## when F(x) follows a zero-and-one-inflated ## beta distribution, as in Example 3.3 ## of Dimitrova, Kaishev, Tan (2020) ## The data set is the proportion of inhabitants ## living within a 200 kilometer wide costal strip ## in 232 countries in the year 2010 data("Population_Data") mu <- 0.6189 phi <- 0.6615 a <- mu * phi b <- (1 - mu) * phi Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.1141 } else if (x < 1){ result <- 0.1141 + 0.4795 * pbeta(x, a, b) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_test(Population_Data, c(0, 1), Mixed_cdf_example)
This data set contains the proportion of inhabitants living within a 200 kilometer wide costal strip in 232 countries in the year 2010. In Example 3.3 of Dimitrova, Kaishev, Tan (2020), the data set is modelled using a zero-and-one-inflated beta distribution in the null hypothesis and a one-sample two-sided Kolmogorov-Smirnov test is performed to test whether the proposed distribution fits the data well enough.
data("Population_Data")
data("Population_Data")
A data frame with 232 observations on the proportion of inhabitants living within a 200 kilometer wide costal strip in 2010.
https://sedac.ciesin.columbia.edu/data/set/nagdc-population-landscape-climate-estimates-v3
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.