| Title: | Computing P-Values of the One-Sample K-S Test and the Two-Sample K-S and Kuiper Tests for (Dis)Continuous Null Distribution |
|---|---|
| Description: | Contains functions to compute p-values for the one-sample and two-sample Kolmogorov-Smirnov (KS) tests and the two-sample Kuiper test for any fixed critical level and arbitrary (possibly very large) sample sizes. For the one-sample KS test, this package implements a novel, accurate and efficient method named Exact-KS-FFT, which allows the pre-specified cumulative distribution function under the null hypothesis to be continuous, purely discrete or mixed. In the two-sample case, it is assumed that both samples come from an unspecified (unknown) continuous, purely discrete or mixed distribution, i.e. ties (repeated observations) are allowed, and exact p-values of the KS and the Kuiper tests are computed. Note, the two-sample Kuiper test is often used when data samples are on the line or on the circle (circular data). To cite this package in publication: (for the use of the one-sample KS test) Dimitrina S. Dimitrova, Vladimir K. Kaishev, and Senren Tan. Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed, or Continuous. Journal of Statistical Software. 2020; 95(10): 1--42. <doi:10.18637/jss.v095.i10>. (for the use of the two-sample KS and Kuiper tests) Dimitrina S. Dimitrova, Yun Jia and Vladimir K. Kaishev (2024). The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests. submitted. |
| Authors: | Dimitrina S. Dimitrova <[email protected]>, Yun Jia <[email protected]>, Vladimir K. Kaishev <[email protected]>, Senren Tan <[email protected]> |
| Maintainer: | Dimitrina S. Dimitrova <[email protected]> |
| License: | GPL (>= 2.0) |
| Version: | 2.0.2 |
| Built: | 2026-05-09 08:46:37 UTC |
| Source: | https://github.com/cran/KSgeneral |
This package computes p-values of the one-sample and two-sample Kolmogorov-Smirnov (KS) tests and the two-sample Kuiper test.
The one-sample two-sided Kolmogorov-Smirnov (KS) statistic is one of the most popular goodness-of-fit test statistics that is used to measure how well the distribution of a random sample agrees with a prespecified theoretical distribution.
Given a random sample of size with an empirical cdf , the two-sided KS statistic is defined as
, where is the cdf of the prespecified theoretical distribution under the null hypothesis , that comes from .
The package KSgeneral implements a novel, accurate and efficient Fast Fourier Transform (FFT)-based method, referred as Exact-KS-FFT method to compute the complementary cdf,
, at a fixed for a given (hypothezied) purely discrete, mixed or continuous underlying cdf , and arbitrary, possibly very large sample size .
A plot of the complementary cdf , , can also be produced.
In other words, the package computes the p-value, for any fixed critical level .
If an observed (data) sample, is supplied, KSgeneral computes the p-value , where is the value of the KS test statistic computed based on . One can also compute the (complementary) cdf for the one-sided KS statistics or (cf., Dimitrova, Kaishev, Tan (2020)) by appropriately specifying correspondingly for all or for all , in the function ks_c_cdf_Rcpp.
The two-sample Kolmogorov-Smirnov (KS) and the Kuiper statistics are widely used to test the null hypothesis () that two data samples come from the same underlying distribution. Given a pair of random samples and of sizes m and n with empirical cdfs and respectively, coming from unknown CDFs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. We want to test the null hypothesis for all , either against the alternative hypothesis for at least one , which corresponds to the two-sided test, or against and for at least one , which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample , is a strictly positive weight function defined on . KSgeneral implements an exact algorithm which is an extension of the Fortran 77 subroutine due to Nikiforov (1994), to calculate the exact p-value , where and is the two-sample Kolmogorov-Smirnov goodness-of-fit test defined on the space of all possible pairs of samples, and of sizes and , that are randomly drawn from the pooled sample without replacement. If two data samples and are supplied, the package computes d, where d is the observed value of computed based on these two observed samples. Samples may come from any continuous, discrete or mixed distribution, i.e. the test allows repeated observations to appear in the user provided data samples , and their pooled sample .
The two-sample (unweighted) Kuiper goodness-of-fit statistic is defined as:
It is widely used when the data samples are periodic or circular (data that are measured in radians). KSgeneral calculates the exact p-value , where and is the two-sample Kuiper goodness-of-fit test defined on the on the space, , as described above. If two data samples and are supplied, the package computes v, where v is the observed value of computed based on these two observed samples. Similarly, as for the KS test, the two-sample Kuiper test also allows repeated observations in the user provided data samples , and their pooled sample .
One-sample KS test:
The Exact-KS-FFT method to compute p-values of the one-sample KS test in KSgeneral is based on expressing the p-value in terms of an appropriate rectangle probability with respect to the uniform order statistics, as noted by Gleser (1985) for .
The latter representation is used to express via a double-boundary non-crossing probability for a homogeneous Poisson process, with intensity , which is then efficiently computed using FFT, ensuring total run-time of order (see Dimitrova, Kaishev, Tan (2020) and also Moscovich and Nadler (2017) for the special case when is continuous).
The code for the one-sample KS test in KSgeneral represents an R wrapper of the original C++ code due to Dimitrova, Kaishev, Tan (2020) and based on the C++ code developed by Moscovich and Nadler (2017).
The package includes the functions disc_ks_c_cdf, mixed_ks_c_cdf and cont_ks_c_cdf that compute the complementary cdf , for a fixed , , when is purely discrete, mixed or continuous, respectively.
KSgeneral includes also the functions disc_ks_test, mixed_ks_test and cont_ks_test that compute the p-value , where is the value of the KS test statistic computed based on a user provided data sample , when is purely discrete, mixed or continuous, respectively.
The functions disc_ks_test and cont_ks_test represent accurate and fast (run time ) alternatives to the functions ks.test from the package dgof and the function ks.test from the package stat, which compute p-values of , assuming is purely discrete or continuous, respectively.
The package also includes the function ks_c_cdf_Rcpp which gives the flexibility to compute the complementary cdf (p-value) for the one-sided KS test statistics or .
It also allows for faster computation time and possibly higher accuracy in computing .
Two-sample KS test and Kuiper test:
The method underlying for computing p-values of the two-sample KS and Kuiper tests in KSgeneral is the extension of the algorithm due to Nikiforov (1994) and is based on expressing the p-value as the probability that a point sequence stays within a certain region in the two-dimensional integer-valued lattice. The algorithm for both tests uses a recursive formula to calculate the total number of point sequences within the region which is divided by the total number of elements in , i.e. to obtain the probability.
For a particular realization of the pooled sample , the p-values calculated by the functions KS2sample and Kuiper2sample are the probabilities:
where and are the two-sample Kolmogorov-Smirnov and Kuiper test statistics respectively, for two samples and of sizes and , randomly drawn from the pooled sample without replacement, i.e. they are defined on the space and for the KS test, for the Kuiper test.
Both KS2sample and Kuiper2sample implement algorithms which generalize the method due to Nikiforov (1994), and calculate the exact p-values of the KS test and the Kuiper test respectively. Both of them allow tested data samples to come from continuous, discrete or mixed distributions (ties are also allowed).
KS2sample ensures a total worst-case run-time of order . Compared with other known algorithms, it not only allows more flexible choices on weights leading to better power (see Dimitrova, Jia, Kaishev 2024), but also is more efficient and more generally applicable for large sample sizes. Kuiper2sample is accurate and valid for large sample sizes. It ensures a total worst-case run-time of order . When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order .
Dimitrina S. Dimitrova <[email protected]>, Yun Jia <[email protected]>, Vladimir K. Kaishev <[email protected]>, Senren Tan <[email protected]>
Maintainer: Dimitrina S. Dimitrova <[email protected]>
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
Computes the complementary cdf at a fixed , , for the one-sample two-sided Kolmogorov-Smirnov statistic, , for a given sample size , when the cdf under the null hypothesis is continuous.
cont_ks_c_cdf(q, n)cont_ks_c_cdf(q, n)
q |
numeric value between 0 and 1, at which the complementary cdf |
n |
the sample size |
Given a random sample of size n with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function cont_ks_c_cdf implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the complementary cdf, at a value , when is continuous.
This algorithm ensures a total worst-case run-time of order which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of , e.g., the function ks.test in the package stats and the function ks.test in the package dgof.
More precisely, in these packages, the exact p-value, is computed only in the case when , where is the value of the KS test statistic computed based on a user provided sample .
Another limitation of the functions ks.test is that the sample size should be less than 100, and the computation time is .
In contrast, the function cont_ks_c_cdf provides results with at least 10 correct digits after the decimal point for sample sizes up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For n > 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix C for further details and examples.
Numeric value corresponding to .
Based on the C++ code available at https://github.com/mosco/crossing-probability developed by Moscovich and Nadler (2017). See also Dimitrova, Kaishev, Tan (2020) for more details.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Marsaglia G., Tsang WW., Wang J. (2003). "Evaluating Kolmogorov's Distribution". Journal of Statistical Software, 8(18), 1-4.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Compute the value for P(D_{100} >= 0.05) KSgeneral::cont_ks_c_cdf(0.05, 100) ## Compute P(D_{n} >= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n <- 100 q <- 1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_c_cdf(x, n)), type='l') ## Compute P(D_{n} >= q) for n = 141, nq^{2} = 2.1 as shown ## in Table 18 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_c_cdf(sqrt(2.1/141), 141)## Compute the value for P(D_{100} >= 0.05) KSgeneral::cont_ks_c_cdf(0.05, 100) ## Compute P(D_{n} >= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n <- 100 q <- 1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_c_cdf(x, n)), type='l') ## Compute P(D_{n} >= q) for n = 141, nq^{2} = 2.1 as shown ## in Table 18 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_c_cdf(sqrt(2.1/141), 141)
Computes the cdf at a fixed , , for the one-sample two-sided Kolmogorov-Smirnov statistic, , for a given sample size , when the cdf under the null hypothesis is continuous.
cont_ks_cdf(q, n)cont_ks_cdf(q, n)
q |
numeric value between 0 and 1, at which the cdf |
n |
the sample size |
Given a random sample of size n with an empirical cdf , the Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function cont_ks_cdf implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the cdf at a value , when is continuous.
This algorithm ensures a total worst-case run-time of order which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of , e.g., the function ks.test in the package stats and the function ks.test in the package dgof.
More precisely, in these packages, the exact p-value, is computed only in the case when , where is the value of the KS statistic computed based on a user provided sample .
Another limitation of the functions ks.test is that the sample size should be less than 100, and the computation time is .
In contrast, the function cont_ks_cdf provides results with at least 10 correct digits after the decimal point for sample sizes up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For n > 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix B for further details and examples.
Numeric value corresponding to .
Based on the C++ code available at https://github.com/mosco/crossing-probability developed by Moscovich and Nadler (2017). See also Dimitrova, Kaishev, Tan (2020) for more details.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Marsaglia G., Tsang WW., Wang J. (2003). "Evaluating Kolmogorov's Distribution". Journal of Statistical Software, 8(18), 1-4.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Compute the value for P(D_{100} <= 0.05) KSgeneral::cont_ks_cdf(0.05, 100) ## Compute P(D_{n} <= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n<-100 q<-1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_cdf(x, n)), type='l') ## Compute P(D_{n} <= q) for n = 40, nq^{2} = 0.76 as shown ## in Table 9 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_cdf(sqrt(0.76/40), 40)## Compute the value for P(D_{100} <= 0.05) KSgeneral::cont_ks_cdf(0.05, 100) ## Compute P(D_{n} <= q) ## for n = 100, q = 1/500, 2/500, ..., 500/500 ## and then plot the corresponding values against q n<-100 q<-1:500/500 plot(q, sapply(q, function(x) KSgeneral::cont_ks_cdf(x, n)), type='l') ## Compute P(D_{n} <= q) for n = 40, nq^{2} = 0.76 as shown ## in Table 9 of Dimitrova, Kaishev, Tan (2020) KSgeneral::cont_ks_cdf(sqrt(0.76/40), 40)
Computes the p-value , where is the value of the KS test statistic computed based on a data sample , when is continuous.
cont_ks_test(x, y, ...)cont_ks_test(x, y, ...)
x |
a numeric vector of data sample values |
y |
a pre-specified continuous cdf, |
... |
values of the parameters of the cdf, |
Given a random sample of size n with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function cont_ks_test implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the p-value , where is the value of the KS test statistic computed based on a user provided data sample , assuming is continuous.
This algorithm ensures a total worst-case run-time of order which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of , e.g., the function ks.test in the package stats and the function ks.test in the package dgof.
A limitation of the functions ks.test is that the sample size should be less than 100, and the computation time is .
In contrast, the function cont_ks_test provides results with at least 10 correct digits after the decimal point for sample sizes up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For n > 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix C for further details and examples.
A list with class "htest" containing the following components:
statistic |
the value of the statistic. |
p.value |
the p-value of the test. |
alternative |
"two-sided". |
data.name |
a character string giving the name of the data. |
Based on the C++ code available at https://github.com/mosco/crossing-probability developed by Moscovich and Nadler (2017). See also Dimitrova, Kaishev, Tan (2020) for more details.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Comparing the p-values obtained by stat::ks.test ## and KSgeneral::cont_ks_test x<-abs(rnorm(100)) p.kt <- ks.test(x, "pexp", exact = TRUE)$p p.kt_fft <- KSgeneral::cont_ks_test(x, "pexp")$p abs(p.kt-p.kt_fft)## Comparing the p-values obtained by stat::ks.test ## and KSgeneral::cont_ks_test x<-abs(rnorm(100)) p.kt <- ks.test(x, "pexp", exact = TRUE)$p p.kt_fft <- KSgeneral::cont_ks_test(x, "pexp")$p abs(p.kt-p.kt_fft)
Computes the complementary cdf, at a fixed , , of the one-sample two-sided Kolmogorov-Smirnov (KS) statistic, when the cdf under the null hypothesis is purely discrete, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
Moreover, for comparison purposes, disc_ks_c_cdf gives, as an option, the possibility to compute (an approximate value for) the asymptotic using the simulation-based algorithm of Wood and Altavela (1978).
disc_ks_c_cdf(q, n, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)disc_ks_c_cdf(q, n, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)
q |
numeric value between 0 and 1, at which the complementary cdf |
n |
the sample size |
y |
a pre-specified discrete cdf, |
... |
values of the parameters of the cdf, |
exact |
logical variable specifying whether one wants to compute exact p-value |
tol |
the value of |
sim.size |
the required number of simulated trajectories in order to produce one Monte Carlo estimate (one MC run) of the asymptotic complementary cdf using the algorithm of Wood and Altavela (1978). By default, |
num.sim |
the number of MC runs, each producing one estimate (based on |
Given a random sample of size n with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function disc_ks_c_cdf implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020) to compute the complementary cdf at a value , when is purely discrete.
This algorithm ensures a total worst-case run-time of order which makes it more efficient and numerically stable than the only alternative algorithm developed by Arnold and Emerson (2011) and implemented as the function ks.test in the package dgof.
The latter only computes a p-value , corresponding to the value of the KS test statistic computed based on a user provided sample .
More precisely, in the package dgof (function ks.test), the p-value for a one-sample two-sided KS test is calculated by combining the approaches of Gleser (1985) and Niederhausen (1981). However, the function ks.test only provides exact p-values for n 30, since as noted by the authors (see Arnold and Emerson (2011)), when n is large, numerical instabilities may occur. In the latter case, ks.test uses simulation to approximate p-values, which may be rather slow and inaccurate (see Table 6 of Dimitrova, Kaishev, Tan (2020)).
Thus, making use of the Exact-KS-FFT method, the function disc_ks_c_cdf provides an exact and highly computationally efficient (alternative) way of computing at a value , when is purely discrete.
Lastly, incorporated into the function disc_ks_c_cdf is the MC simulation-based method of Wood and Altavela (1978) for estimating the asymptotic complementary cdf of . The latter method is the default method behind disc_ks_c_cdf when the sample size n is n 100000.
Numeric value corresponding to .
Arnold T.A., Emerson J.W. (2011). "Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions". The R Journal, 3(2), 34-39.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.
Niederhausen H. (1981). "Sheffer Polynomials for Computing Exact Kolmogorov-Smirnov and Renyi Type Distributions". The Annals of Statistics, 58-64.
Wood C.L., Altavela M.M. (1978). "Large-Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions". Biometrika, 65(1), 235-239.
## Example to compute the exact complementary cdf for D_{n} ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3) ## Not run: ## Compute P(D_{n} >= q) for n = 100, ## q = 1/5000, 2/5000, ..., 5000/5000, when ## the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020), ## and then plot the corresponding values against q, ## i.e. plot the resulting complementary cdf of D_{n} n <- 100 q <- 1:5000/5000 binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) plot(q, sapply(q, function(x) KSgeneral::disc_ks_c_cdf(x, n, binom_3)), type='l') ## End(Not run) ## Not run: ## Example to compute the asymptotic complementary cdf for D_{n} ## based on Wood and Altavela (1978), ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0: 3), c(0, pbinom(0 : 3, 3, 0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3, exact = FALSE, tol = 1e-08, sim.size = 1e+06, num.sim = 10) ## End(Not run)## Example to compute the exact complementary cdf for D_{n} ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3) ## Not run: ## Compute P(D_{n} >= q) for n = 100, ## q = 1/5000, 2/5000, ..., 5000/5000, when ## the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020), ## and then plot the corresponding values against q, ## i.e. plot the resulting complementary cdf of D_{n} n <- 100 q <- 1:5000/5000 binom_3 <- stepfun(c(0:3), c(0,pbinom(0:3,3,0.5))) plot(q, sapply(q, function(x) KSgeneral::disc_ks_c_cdf(x, n, binom_3)), type='l') ## End(Not run) ## Not run: ## Example to compute the asymptotic complementary cdf for D_{n} ## based on Wood and Altavela (1978), ## when the underlying cdf F(x) is a binomial(3, 0.5) distribution, ## as shown in Example 3.4 of Dimitrova, Kaishev, Tan (2020) binom_3 <- stepfun(c(0: 3), c(0, pbinom(0 : 3, 3, 0.5))) KSgeneral::disc_ks_c_cdf(0.05, 400, binom_3, exact = FALSE, tol = 1e-08, sim.size = 1e+06, num.sim = 10) ## End(Not run)
Computes the p-value , where is the value of the KS test statistic computed based on a data sample , when is purely discrete, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
disc_ks_test(x, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)disc_ks_test(x, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10)
x |
a numeric vector of data sample values |
y |
a pre-specified discrete cdf, |
... |
values of the parameters of the cdf, |
exact |
logical variable specifying whether one wants to compute exact p-value |
tol |
the value of |
sim.size |
the required number of simulated trajectories in order to produce one Monte Carlo estimate (one MC run) of the asymptotic p-value using the algorithm of Wood and Altavela (1978). By default, |
num.sim |
the number of MC runs, each producing one estimate (based on |
Given a random sample of size n with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function disc_ks_test implements the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
It represents an accurate and fast (run time ) alternative to the function ks.test from the package dgof, which computes a p-value , where is the value of the KS test statistic computed based on a user provided data sample , assuming is purely discrete.
In the function ks.test, the p-value for a one-sample two-sided KS test is calculated by combining the approaches of Gleser (1985) and Niederhausen (1981). However, the function ks.test due to Arnold and Emerson (2011) only provides exact p-values for n 30, since as noted by the authors, when n is large, numerical instabilities may occur. In the latter case, ks.test uses simulation to approximate p-values, which may be rather slow and inaccurate (see Table 6 of Dimitrova, Kaishev, Tan (2020)).
Thus, making use of the Exact-KS-FFT method, the function disc_ks_test provides an exact and highly computationally efficient (alternative) way of computing the p-value , when is purely discrete.
Lastly, incorporated into the function disc_ks_test is the MC simulation-based method of Wood and Altavela (1978) for estimating the asymptotic p-value of . The latter method is the default method behind disc_ks_test when the sample size n is n 100000.
A list with class "htest" containing the following components:
statistic |
the value of the statistic. |
p.value |
the p-value of the test. |
alternative |
"two-sided". |
data.name |
a character string giving the name of the data. |
Arnold T.A., Emerson J.W. (2011). "Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions". The R Journal, 3(2), 34-39.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.
Niederhausen H. (1981). "Sheffer Polynomials for Computing Exact Kolmogorov-Smirnov and Renyi Type Distributions". The Annals of Statistics, 58-64.
Wood C.L., Altavela M.M. (1978). "Large-Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions". Biometrika, 65(1), 235-239.
# Comparison of results obtained from dgof::ks.test # and KSgeneral::disc_ks_test, when F(x) follows the discrete # Uniform[1, 10] distribution as in Example 3.5 of # Dimitrova, Kaishev, Tan (2020) # When the sample size is larger than 100, the # function dgof::ks.test will be numerically # unstable x3 <- sample(1:10, 25, replace = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE) dgof::ks.test(x3, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x3, ecdf(1:10), exact = TRUE)$p x4 <- sample(1:10, 500, replace = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE) dgof::ks.test(x4, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x4, ecdf(1:10), exact = TRUE)$p # Using stepfun() to specify the same discrete distribution as defined by ecdf(): steps <- stepfun(1:10, cumsum(c(0, rep(0.1, 10)))) KSgeneral::disc_ks_test(x3, steps, exact = TRUE)# Comparison of results obtained from dgof::ks.test # and KSgeneral::disc_ks_test, when F(x) follows the discrete # Uniform[1, 10] distribution as in Example 3.5 of # Dimitrova, Kaishev, Tan (2020) # When the sample size is larger than 100, the # function dgof::ks.test will be numerically # unstable x3 <- sample(1:10, 25, replace = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE) dgof::ks.test(x3, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x3, ecdf(1:10), exact = TRUE)$p x4 <- sample(1:10, 500, replace = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE) dgof::ks.test(x4, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE)$p - dgof::ks.test(x4, ecdf(1:10), exact = TRUE)$p # Using stepfun() to specify the same discrete distribution as defined by ecdf(): steps <- stepfun(1:10, cumsum(c(0, rep(0.1, 10)))) KSgeneral::disc_ks_test(x3, steps, exact = TRUE)
Function calling directly the C++ routines that compute the complementary cdf for the one-sample two-sided Kolmogorov-Smirnov statistic, given the sample size n and the file "Boundary_Crossing_Time.txt" in the working directory.
The latter file contains and , , specified in Steps 1 and 2 of the Exact-KS-FFT method (see Equation (5) in Section 2 of Dimitrova, Kaishev, Tan (2020)).
The latter values form the n-dimensional rectangular region for the uniform order statistics (see Equations (3), (5) and (6) in Dimitrova, Kaishev, Tan (2020)), namely
,
where the upper and lower boundary functions , are defined as
, ,
or equivalently, noting that and are correspondingly left and right continuous functions, we have
and .
Note that on can also compute the (complementary) cdf for the one-sided KS statistics or (cf., Dimitrova, Kaishev, Tan (2020)) by appropriately specifying correspondingly for all or for all , in the function ks_c_cdf_Rcpp.
ks_c_cdf_Rcpp(n)ks_c_cdf_Rcpp(n)
n |
the sample size |
Note that all calculations here are done directly in C++ and output in R.
That leads to faster computation time, as well as in some cases, possibly higher accuracy (depending on the accuracy of the pre-computed values and , , provided in the file "Boundary_Crossing_Time.txt") compared to the functions cont_ks_c_cdf, disc_ks_c_cdf, mixed_ks_c_cdf.
Given a random sample of size n with an empirical cdf , the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The one-sided KS test statistics are correspondingly defined as and .
The function ks_c_cdf_Rcpp implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020), to compute the complementary cdf, at a value , when is arbitrary (i.e. purely discrete, mixed or continuous).
It is based on expressing the complementary cdf as
, where and are defined as in Step 1 of Dimitrova, Kaishev, Tan (2020).
The complementary cdf is then re-expressed in terms of the conditional probability that a homogeneous Poisson process, with intensity will not cross an upper boundary and a lower boundary , given that (see Steps 2 and 3 in Section 2.1 of Dimitrova, Kaishev, Tan (2020)). This conditional probability is evaluated using FFT in Step 4 of the method in order to obtain the value of the complementary cdf .
This algorithm ensures a total worst-case run-time of order which makes it highly computationally efficient compared to other known algorithms developed for the special cases of continuous or purely discrete .
The values and , , specified in Steps 1 and 2 of the Exact-KS-FFT method (see Dimitrova, Kaishev, Tan (2020), Section 2) must be pre-computed (in R or, if needed, using alternative softwares offering high accuracy, e.g. Mathematica) and saved in a file with the name "Boundary_Crossing_Time.txt" (in the current working directory).
The function ks_c_cdf_Rcpp is called in R and it first reads the file "Boundary_Crossing_Time.txt" and then computes the value for the complementaty cdf
in C++ and output in R (or as noted above, as a special case, computes the value of the complementary cdf or ).
Numeric value corresponding to (or, as a special case, to or ), given a sample size n and the file "Boundary_Crossing_Time.txt" containing and , , specified in Steps 1 and 2 of the Exact-KS-FFT method (see Dimitrova, Kaishev, Tan (2020), Section 2).
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.
## Computing the complementary cdf P(D_{n} >= q) ## for n = 10 and q = 0.1, when F(x) is continuous, ## In this case, ## B_i = (i-1)/n + q ## A_i = i/n - q n <- 10 q <- 0.1 up_rec <- ((1:n)-1)/n + q low_rec <- (1:n)/n - q df <- data.frame(rbind(up_rec, low_rec)) write.table(df,"Boundary_Crossing_Time.txt", sep = ", ", row.names = FALSE, col.names = FALSE) ks_c_cdf_Rcpp(n)## Computing the complementary cdf P(D_{n} >= q) ## for n = 10 and q = 0.1, when F(x) is continuous, ## In this case, ## B_i = (i-1)/n + q ## A_i = i/n - q n <- 10 q <- 0.1 up_rec <- ((1:n)-1)/n + q low_rec <- (1:n)/n - q df <- data.frame(rbind(up_rec, low_rec)) write.table(df,"Boundary_Crossing_Time.txt", sep = ", ", row.names = FALSE, col.names = FALSE) ks_c_cdf_Rcpp(n)
Computes the p-value , where is the one- or two-sided two-sample Kolmogorov-Smirnov test statistic with weight function weight, when = d, i.e. the observed value of KS statistic computed based on two data samples and that may come from continuous, discrete or mixed distribution, i.e. they may have repeated observations (ties).
KS2sample(x, y, alternative = c("two.sided", "less", "greater"), conservative = F, weight = 0, tol = 1e-08, tail = T)KS2sample(x, y, alternative = c("two.sided", "less", "greater"), conservative = F, weight = 0, tol = 1e-08, tail = T)
x |
a numeric vector of data sample values |
y |
a numeric vector of data sample values |
alternative |
Indicates the alternative hypothesis and must be one of "two.sided" (default), "less", or "greater". One can specify just the initial letter of the string, but the argument name must be given in full, e.g. |
conservative |
logical variable indicating whether ties should be considered. See ‘Details’ for the meaning. |
weight |
either a numeric value between 0 and 1 which specifies the form of the weight function from a class of pre-defined functions, or a user-defined strictly positive function of one variable. By default, no weight function is assumed. See ‘Details’ for the meaning of the possible values. |
tol |
the value of |
tail |
logical variable indicating whether a p-value, |
Given a pair of random samples and of sizes m and n with empirical cdfs and respectively, coming from some unknown cdfs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis for all , either against the alternative hypothesis for at least one , which corresponds to the two-sided test, or against and for at least one , which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample , is a strictly positive weight function defined on .
Possible values of alternative are "two.sided", "greater" and "less" which specify the alternative hypothesis, i.e. specify the test statistics to be either , or respectively.
When weight is assigned with a numeric value between 0 and 1, the test statistic is specified as the weighted two-sample Kolmorogov-Smirnov test with generalized Anderson-Darling weight (see Finner and Gontscharuk 2018). Then for example, the two-sided two-sample Kolmogorov-Smirnov statistic has the following form:
The latter specification defines a family of weighted Kolmogorov-Smirnov tests, covering the unweighted test (when weight = ), and the widely-known weighted Kolmogorov-Smirnov test with Anderson-Darling weight (when weight = 0.5, see definition of this statistic also in Canner 1975).
If one wants to implement a weighted test with a user-specified weight function, for example, suggested by Buning (2001), which ensures higher power when both x and y come from distributions that are left-skewed and heavy-tailed, one can directly assign a univariate function with output value 1/sqrt(t*(2-t)) to weight. See ‘Examples’ for this demonstration.
For a particular realization of the pooled sample , let there be distinct values, , in the ordered, pooled sample , where , and where is the number of times , appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of weight and alternative, for two samples and of sizes and , randomly drawn from the pooled sample without replacement and = d, the observed value of the statistic calculated based on the user provided data samples x and y. By default tail = T, the p-value is returned, otherwise is returned.
Note that, is defined on the space of all possible pairs, of edfs and , , that correspond to the pairs of samples and , randomly drawn from, , as follows. First, observations are drawn at random without replacement, forming the first sample , with corresponding edf, . The remaining observations are then assigned to the second sample , with corresponding edf . Observations are then replaced back in and re-sampling is continued until the occurrence of all the possible pairs of edfs and , . The pairs of edf's may be coincident if there are ties in the data and each pair, and occurs with probability .
conservative is a logical variable whether the test should be conducted conservatively. By default, conservative = F, KS2sample returns the p-value that is defined through the conditional probability above. However, when the user has a priori knowledge that both samples are from a continuous distribution even if ties are present, for example, repeated observations are caused by rounding errors, the value conservative = T should be assigned, since the conditional probability is no longer relevant. In this case, KS2sample computes p-values for the Kolmogorov-Smirnov test assuming no ties are present, and returns a p-value which is an upper bound of the true p-value. Note that, if the null hypothesis is rejected using the calculated upper bound for the p-value, it should also be rejected with the true p-value.
KS2sample calculates the exact p-value of the KS test using an algorithm which generalizes the method due to Nikiforov (1994). If tail = F, KS2sample calculates the complementary p-value, . For the purpose, an exact algorithm which generalizes the method due to Nikiforov (1994) is implemented. Alternatively, if tail = T, a version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is implemented, which computes directly the p-value, with higher accuracy, giving up to 17 correct digits, but at up to 3 times higher computational cost. KS2sample ensures a total worst-case run-time of order . In comparison with other known algorithms, it not only allows the flexible choice of weights which in some cases improve the statistical power (see Dimitrova, Jia, Kaishev 2024), but also is more efficient and generally applicable for large sample sizes.
A list with class "htest" containing the following components:
statistic |
the value of the test statistic |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
data.name |
a character string giving names of the data. |
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
Buning H (2001). "Kolmogorov-Smirnov- and Cramer-von Mises Type Two-sample Tests With Various Weight Functions." Communications in Statistics - Simulation and Computation, 30(4), 847-865.
Finner H, Gontscharuk V (2018). "Two-sample Kolmogorov-Smirnov-type tests revisited: Old and new tests in terms of local levels." The Annals of Statistics, 46(6A), 3014-3037.
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265-270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
##Computes p-value of two-sided unweighted test for continuous data data1 <- rexp(750, 1) data2 <- rexp(800, 1) KS2sample(data1, data2) ##Computes the complementary p-value KS2sample(data1, data2, tail = FALSE) ##Computes p-value of one-sided test with Anderson-Darling weight function KS2sample(data1, data2, alternative = "greater", weight = 0.5) ##Computes p-values of two-sided test with Buning's weight function for discrete data data3 <- rnbinom(100, size = 3, prob = 0.6) data4 <- rpois(120, lambda = 2) f <- function(t) 1 / sqrt( t * (2 - t) ) KS2sample(data3, data4, weight = f)##Computes p-value of two-sided unweighted test for continuous data data1 <- rexp(750, 1) data2 <- rexp(800, 1) KS2sample(data1, data2) ##Computes the complementary p-value KS2sample(data1, data2, tail = FALSE) ##Computes p-value of one-sided test with Anderson-Darling weight function KS2sample(data1, data2, alternative = "greater", weight = 0.5) ##Computes p-values of two-sided test with Buning's weight function for discrete data data3 <- rnbinom(100, size = 3, prob = 0.6) data4 <- rpois(120, lambda = 2) f <- function(t) 1 / sqrt( t * (2 - t) ) KS2sample(data3, data4, weight = f)
Function calling directly the C++ routines that compute the exact complementary p-value for the (weighed) two-sample one- or two-sided Kolmogorov-Smirnov statistic, at a fixed , , given the sample sizes m and n, the vector of weights w_vec and the vector M containing the number of times each distinct observation is repeated in the pooled sample.
KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
kind |
an integer value (= 1,2 or 3) which specified the alternative hypothesis. When = 1, the test is two-sided. When = 2 or 3, the test is one-sided. See ‘Details’ for the meaning of the possible values. Other value is invalid. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 1, at which the p-value |
w_vec |
a vector with |
tol |
the value of |
Given a pair of random samples and of sizes m and n with empirical cdfs and respectively, coming from some unknown cdfs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis for all , either against the alternative hypothesis for at least one , which corresponds to the two-sided test, or against and for at least one , which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample , is a strictly positive weight function defined on .
w_vec[i] (0i) is then equal to ( is the i-th smallest observation in the pooled sample ).
Different value of w_vec specifies the weighted Kolmogorov-Smirnov test differently. For example, when w_vec=rep(1,m+n-1), KS2sample_Rcpp calculates the p-value of the unweighted two-sample Kolmogorov-Smirnov test, when w_vec = ((1:(m+n-1))*((m+n-1):1))^(-1/2), it calculates the p-value for the weighted two-sample Kolmogorov-Smirnov test with Anderson-Darling weight .
Possible values of kind are 1,2 and 3, which specify the alternative hypothesis, i.e. specify the test statistic to be either , or respectively.
The numeric array M specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be distinct values, , in the ordered, pooled sample , where , and where =M[i] is the number of times , appears in the pooled sample. The calculated complementary p-value is the conditional probability:
where is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of weight and alternative, for two samples and of sizes and , randomly drawn from the pooled sample without replacement, i.e. is defined on the space (see further details in KS2sample), and .
KS2sample_c_Rcpp implements an exact algorithm, extending the Fortran 77 subroutine due to Nikiforov (1994), an extended functionality by allowing more flexible choice of weight, as well as for large sample sizes. This leads to faster computation time, as well as, relatively high accuracy for very large m and n (less accurate than KS2sample_Rcpp). Compared with other known algorithms, it allows data samples come from continuous, discrete or mixed distribution(i.e. ties may appear), and it is more efficient and more generally applicable for large sample sizes. This algorithm ensures a total worst-case run-time of order .
Numeric value corresponding to , given sample sizes m, n, M and w_vec. If the value of m, n are non-positive, or if the length of w_vec is not equal to m+n-1, then the function returns -1, the non-permitted value of M or non-permitted value inside w_vec returns -2, numerically unstable calculation returns -3.
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_c_Rcpp(m, n, kind, M, q, w_vec, tol)
Function calling directly the C++ routines that compute the exact p-value for the (weighed) two-sample one- or two-sided Kolmogorov-Smirnov statistic, at a fixed , , given the sample sizes m and n, the vector of weights w_vec and the vector M containing the number of times each distinct observation is repeated in the pooled sample.
KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
kind |
an integer value (= 1,2 or 3) which specified the alternative hypothesis. When = 1, the test is two-sided. When = 2 or 3, the test is one-sided. See ‘Details’ for the meaning of the possible values. Other value is invalid. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 1, at which the p-value |
w_vec |
a vector with |
tol |
the value of |
Given a pair of random samples and of sizes m and n with empirical cdfs and respectively, coming from some unknown cdfs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis for all , either against the alternative hypothesis for at least one , which corresponds to the two-sided test, or against and for at least one , which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
where is the empirical cdf of the pooled sample , is a strictly positive weight function defined on .
w_vec[i] (0i) is then equal to ( is the i-th smallest observation in the pooled sample ).
Different value of w_vec specifies the weighted Kolmogorov-Smirnov test differently. For example, when w_vec=rep(1,m+n-1), KS2sample_Rcpp calculates the p-value of the unweighted two-sample Kolmogorov-Smirnov test, when w_vec = ((1:(m+n-1))*((m+n-1):1))^(-1/2), it calculates the p-value for the weighted two-sample Kolmogorov-Smirnov test with Anderson-Darling weight .
Possible values of kind are 1,2 and 3, which specify the alternative hypothesis, i.e. specify the test statistic to be either , or respectively.
The numeric array M specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be distinct values, , in the ordered, pooled sample , where , and where =M[i] is the number of times , appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of weight and alternative, for two samples and of sizes and , randomly drawn from the pooled sample without replacement, i.e. is defined on the space (see further details in KS2sample), and .
KS2sample_Rcpp implements an exact algorithm, extending the Fortran 77 subroutine due to Nikiforov (1994), an extended functionality by allowing more flexible choices of weight, as well as for large sample sizes. A version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with higher accuracy, giving up to 17 correct digits, but at up to 3 times higher computational cost than KS2sample_c_Rcpp. Compared with other known algorithms, it allows data samples to come from continuous, discrete or mixed distribution(i.e. ties may appear), and it is more efficient and more generally applicable for large sample sizes. This algorithm ensures a total worst-case run-time of order .
Numeric value corresponding to , given sample sizes m, n, M and w_vec. If the value of m, n are non-positive, or if the length of w_vec is not equal to m+n-1, then the function returns -1, the non-permitted value of M or non-permitted value inside w_vec returns -2, numerically unstable calculation returns -3.
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 kind <- 1 q <- 0.1 M <- c(80,70,40,80) w_vec <- rep(1,m+n-1) tol <- 1e-6 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) kind <- 2 KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol) ## Computing the weighted two-sample Kolmogorov-Smirnov test ## with Anderson-Darling weight kind <- 3 w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2) KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)
Computes the p-value, , where is the two-sample Kuiper test statistic, = v, i.e. the observed value of the Kuiper statistic, computed based on two data samples and that may come from continuous, discrete or mixed distribution, i.e. they may have repeated observations (ties).
Kuiper2sample(x, y, conservative = F, tail = T)Kuiper2sample(x, y, conservative = F, tail = T)
x |
a numeric vector of data sample values |
y |
a numeric vector of data sample values |
conservative |
logical variable indicating whether ties should be considered. See ‘Details’ for the meaning. |
tail |
logical variable indicating whether a p-value, |
Given a pair of random samples, either on the real line or the circle, denoted by and , of sizes m and n with empirical cdfs and respectively, coming from some unknown cdfs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis for all , against the alternative hypothesis for at least one . The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:
For a particular realization of the pooled sample , let there be distinct values, , in the ordered, pooled sample , where , and where is the number of times , appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kuiper test statistic defined as , for two samples and of sizes and , randomly drawn from the pooled sample without replacement and = v, the observed value of the statistic calculated based on the user provided data samples x and y. By default tail = T, the p-value is returned, otherwise is returned.
Note that, is defined on the space of all possible pairs, of edfs and , , that correspond to the pairs of samples and , randomly drawn from, , as follows. First, observations are drawn at random without replacement, forming the first sample , with corresponding edf, . The remaining observations are then assigned to the second sample , with corresponding edf . Observations are then replaced back in and re-sampling is continued until the occurrence of all the possible pairs of edfs and , . The pairs of edf's may be coincident if there are ties in the data and each pair, and occurs with probability .
conservative is a logical variable whether the test should be conducted conservatively. By default, conservative = F, Kuiper2sample returns the p-value that is defined through the conditional probability above. However, when the user has a priori knowledge that both samples are from a continuous distribution even if ties are present, for example, repeated observations are caused by rounding errors, the value conservative = T should be assigned, since the conditional probability is no longer relevant. In this case, Kuiper2sample computes p-values for the Kuiper test assuming no ties are present, and returns a p-value which is an upper bound of the true p-value. Note that, if the null hypothesis is rejected using the calculated upper bound for the p-value, it should also be rejected with the true p-value.
Kuiper2sample calculates the exact p-value of the Kuiper test using an algorithm from Dimitrova, Jia, Kaishev (2024), which is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). If tail = F, Kuiper2sample calculates the complementary p-value . For the purpose, an exact algorithm which generalizes the method due to Nikiforov (1994) is implemented. Alternatively, if tail = T, a version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with up to 4 digits extra accuracy, but at up to 3 times higher computational cost. It is accurate and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order . When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order .
Kuiper2sample is accurate and fast compared with the function based on the Monte Carlo simulation. Compared to the implementation using asymptotic method, Kuiper2sample allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate than asymptotic method when sample sizes are small.
A list with class "htest" containing the following components:
statistic |
the value of the test statistic |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
data.name |
a character string giving names of the data. |
Maag, U. R., Stephens, M. A. (1968). The Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.
Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
##Computes discrete circular data data1 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data2 <- c(rep(pi/2,50),rep(pi,40),rep(3*pi/2,10),rep(2*pi,50)) Kuiper2sample(data1, data2) ##The calculated p-value does not change with the choice of the original point data3 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data4 <- c(rep(pi/2,50),rep(pi,50),rep(3*pi/2,40),rep(2*pi,10)) Kuiper2sample(data3, data4)##Computes discrete circular data data1 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data2 <- c(rep(pi/2,50),rep(pi,40),rep(3*pi/2,10),rep(2*pi,50)) Kuiper2sample(data1, data2) ##The calculated p-value does not change with the choice of the original point data3 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30)) data4 <- c(rep(pi/2,50),rep(pi,50),rep(3*pi/2,40),rep(2*pi,10)) Kuiper2sample(data3, data4)
Function calling directly the C++ routines that compute the exact complementary p-value for the two-sample Kuiper test, at a fixed , , given the sample sizes m, n and the vector M containing the number of times each distinct observation is repeated in the pooled sample.
Kuiper2sample_c_Rcpp(m, n, M, q)Kuiper2sample_c_Rcpp(m, n, M, q)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 2, at which the p-value |
Given a pair of random samples, either on the real line or the circle, denoted by and , of sizes m and n with empirical cdfs and respectively, coming from some unknown cdfs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis for all , against the alternative hypothesis for at least one . The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:
The numeric array M specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be distinct values, , in the ordered, pooled sample , where , and where = M[i] is the number of times , appears in the pooled sample. The calculated complementary p-value is then the conditional probability:
where is the two-sample Kuiper test statistic defined as , for two samples and of sizes and , randomly drawn from the pooled sample without replacement, i.e. is defined on the space (see further details in Kuiper2sample), and .
Kuiper2sample_c_Rcpp implements an algorithm from Dimitrova, Jia, Kaishev (2024), that is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). It is relatively accurate (less accurate than Kuiper2sample_Rcpp) and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order . When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order .
Other known implementations for the two-sample Kuiper test mainly use the approximation method or Monte Carlo simulation (See also Kuiper2sample). The former method is invalid for data with ties and often gives p-values with large errors when sample sizes are small, the latter method is usually slow and inaccurate. Compared with other known algorithms, Kuiper2sample_c_Rcpp allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate and generally applicable for large sample sizes.
Numeric value corresponding to , given sample sizes m, n and M. If the value of m, n are non-positive, or their least common multiple exceeds the limit 2147483647, then the function returns -1, the non-permitted value of M returns -2, numerically unstable calculation returns -3.
Maag, U. R., Stephens, M. A. (1968). The Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.
Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_c_Rcpp(m, n, M, q)## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_c_Rcpp(m, n, M, q)
Function calling directly the C++ routines that compute the exact p-value for the two-sample Kuiper test, at a fixed , , given the sample sizes m, n and the vector M containing the number of times each distinct observation is repeated in the pooled sample.
Kuiper2sample_Rcpp(m, n, M, q)Kuiper2sample_Rcpp(m, n, M, q)
m |
the sample size of first tested sample. |
n |
the sample size of second tested sample. |
M |
an integer-valued vector with |
q |
numeric value between 0 and 2, at which the p-value |
Given a pair of random samples, either on the real line or the circle, denoted by and , of sizes m and n with empirical cdfs and respectively, coming from some unknown cdfs and . It is assumed that and could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis for all , against the alternative hypothesis for at least one . The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:
The numeric array M specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample , let there be distinct values, , in the ordered, pooled sample , where , and where = M[i] is the number of times , appears in the pooled sample. The p-value is then defined as the probability
where is the two-sample Kuiper test statistic defined as , for two samples and of sizes and , randomly drawn from the pooled sample without replacement, i.e. is defined on the space (see further details in Kuiper2sample), and .
Kuiper2sample_Rcpp implements an algorithm from Dimitrova, Jia, Kaishev (2024), that is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). A version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with up to 4 digits extra accuracy, but at up to 3 times higher computational cost than Kuiper2sample_c_Rcpp. It is accurate and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order . When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order .
Other known implementations for the two-sample Kuiper test mainly use the approximation method or Monte Carlo simulation (See also Kuiper2sample). The former method is invalid for data with ties and often gives p-values with large errors when sample sizes are small, the latter method is usually slow and inaccurate. Compared with other known algorithms, Kuiper2sample_Rcpp allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate and generally applicable for large sample sizes.
Numeric value corresponding to , given sample sizes m, n and M. If the value of m, n are non-positive, or their least common multiple exceeds the limit 2147483647, then the function returns -1, the non-permitted value of M returns -2, numerically unstable calculation returns -3.
Maag, U. R., Stephens, M. A. (1968). The Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.
Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_Rcpp(m, n, M, q)## Computing the unweighted two-sample Kolmogorov-Smirnov test ## Example see in Nikiforov (1994) m <- 120 n <- 150 q <- 0.183333333 M <- c(80,70,40,80) Kuiper2sample_Rcpp(m, n, M, q)
Computes the complementary cdf, at a fixed , , of the one-sample two-sided Kolmogorov-Smirnov statistic, when the cdf under the null hypothesis is mixed, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
mixed_ks_c_cdf(q, n, jump_points, Mixed_dist, ..., tol = 1e-10)mixed_ks_c_cdf(q, n, jump_points, Mixed_dist, ..., tol = 1e-10)
q |
numeric value between 0 and 1, at which the complementary cdf |
n |
the sample size |
jump_points |
a numeric vector containing the points of (jump) discontinuity, i.e. where the underlying cdf |
Mixed_dist |
a pre-specified (user-defined) mixed cdf, |
... |
values of the parameters of the cdf, |
tol |
the value of |
Given a random sample of size n with an empirical cdf , the Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function mixed_ks_c_cdf implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020) to compute the complementary cdf at a value , when is mixed.
This algorithm ensures a total worst-case run-time of order .
We have not been able to identify alternative, fast and accurate, method (software) that has been developed/implemented when the hypothesized is mixed.
Numeric value corresponding to .
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
# Compute the complementary cdf of D_{n} # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) ## Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_c_cdf(0.1, 25, c(0, log(2.5)), Mixed_cdf_example) ## Not run: ## Compute P(D_{n} >= q) for n = 5, ## q = 1/5000, 2/5000, ..., 5000/5000 ## when the underlying distribution is a mixed distribution ## with four jumps at 0, 0.2, 0.8, 1.0, ## as in Example 2.8 of Dimitrova, Kaishev, Tan (2020) n <- 5 q <- 1:5000/5000 Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.2 } else if (x < 0.2){ result <- 0.2 + x } else if (x < 0.8){ result <- 0.5 } else if (x < 1){ result <- x - 0.1 } else{ result <- 1 } return (result) } plot(q, sapply(q, function(x) KSgeneral::mixed_ks_c_cdf(x, n, c(0, 0.2, 0.8, 1.0), Mixed_cdf_example)), type='l') ## End(Not run)# Compute the complementary cdf of D_{n} # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) ## Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_c_cdf(0.1, 25, c(0, log(2.5)), Mixed_cdf_example) ## Not run: ## Compute P(D_{n} >= q) for n = 5, ## q = 1/5000, 2/5000, ..., 5000/5000 ## when the underlying distribution is a mixed distribution ## with four jumps at 0, 0.2, 0.8, 1.0, ## as in Example 2.8 of Dimitrova, Kaishev, Tan (2020) n <- 5 q <- 1:5000/5000 Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.2 } else if (x < 0.2){ result <- 0.2 + x } else if (x < 0.8){ result <- 0.5 } else if (x < 1){ result <- x - 0.1 } else{ result <- 1 } return (result) } plot(q, sapply(q, function(x) KSgeneral::mixed_ks_c_cdf(x, n, c(0, 0.2, 0.8, 1.0), Mixed_cdf_example)), type='l') ## End(Not run)
Computes the p-value , where is the value of the KS test statistic computed based on a data sample , when is mixed, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
mixed_ks_test(x, jump_points, Mixed_dist, ..., tol = 1e-10)mixed_ks_test(x, jump_points, Mixed_dist, ..., tol = 1e-10)
x |
a numeric vector of data sample values |
jump_points |
a numeric vector containing the points of (jump) discontinuity, i.e. where the underlying cdf |
Mixed_dist |
a pre-specified (user-defined) mixed cdf, |
... |
values of the parameters of the cdf, |
tol |
the value of |
Given a random sample of size n with an empirical cdf , the Kolmogorov-Smirnov goodness-of-fit statistic is defined as , where is the cdf of a prespecified theoretical distribution under the null hypothesis , that comes from .
The function mixed_ks_test implements the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).
This algorithm ensures a total worst-case run-time of order .
The function mixed_ks_test computes the p-value , where is the value of the KS test statistic computed based on a user-provided data sample , when is mixed,
We have not been able to identify alternative, fast and accurate, method (software) that has been developed/implemented when the hypothesized is mixed.
A list with class "htest" containing the following components:
statistic |
the value of the statistic. |
p.value |
the p-value of the test. |
alternative |
"two-sided". |
data.name |
a character string giving the name of the data. |
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.
# Example to compute the p-value of the one-sample two-sided KS test, # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) # Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } test_data <- c(0,0,0,0,0,0,0.1,0.2,0.3,0.4, 0.5,0.6,0.7,0.8,log(2.5),log(2.5), log(2.5),log(2.5),log(2.5),log(2.5)) KSgeneral::mixed_ks_test(test_data, c(0, log(2.5)), Mixed_cdf_example) ## Compute the p-value of a two-sided K-S test ## when F(x) follows a zero-and-one-inflated ## beta distribution, as in Example 3.3 ## of Dimitrova, Kaishev, Tan (2020) ## The data set is the proportion of inhabitants ## living within a 200 kilometer wide costal strip ## in 232 countries in the year 2010 data("Population_Data") mu <- 0.6189 phi <- 0.6615 a <- mu * phi b <- (1 - mu) * phi Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.1141 } else if (x < 1){ result <- 0.1141 + 0.4795 * pbeta(x, a, b) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_test(Population_Data, c(0, 1), Mixed_cdf_example)# Example to compute the p-value of the one-sample two-sided KS test, # when the underlying distribution is a mixed distribution # with two jumps at 0 and log(2.5), # as in Example 3.1 of Dimitrova, Kaishev, Tan (2020) # Defining the mixed distribution Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.5 } else if (x < log(2.5)){ result <- 1 - 0.5 * exp(-x) } else{ result <- 1 } return (result) } test_data <- c(0,0,0,0,0,0,0.1,0.2,0.3,0.4, 0.5,0.6,0.7,0.8,log(2.5),log(2.5), log(2.5),log(2.5),log(2.5),log(2.5)) KSgeneral::mixed_ks_test(test_data, c(0, log(2.5)), Mixed_cdf_example) ## Compute the p-value of a two-sided K-S test ## when F(x) follows a zero-and-one-inflated ## beta distribution, as in Example 3.3 ## of Dimitrova, Kaishev, Tan (2020) ## The data set is the proportion of inhabitants ## living within a 200 kilometer wide costal strip ## in 232 countries in the year 2010 data("Population_Data") mu <- 0.6189 phi <- 0.6615 a <- mu * phi b <- (1 - mu) * phi Mixed_cdf_example <- function(x) { result <- 0 if (x < 0){ result <- 0 } else if (x == 0){ result <- 0.1141 } else if (x < 1){ result <- 0.1141 + 0.4795 * pbeta(x, a, b) } else{ result <- 1 } return (result) } KSgeneral::mixed_ks_test(Population_Data, c(0, 1), Mixed_cdf_example)
This data set contains the proportion of inhabitants living within a 200 kilometer wide costal strip in 232 countries in the year 2010. In Example 3.3 of Dimitrova, Kaishev, Tan (2020), the data set is modelled using a zero-and-one-inflated beta distribution in the null hypothesis and a one-sample two-sided Kolmogorov-Smirnov test is performed to test whether the proposed distribution fits the data well enough.
data("Population_Data")data("Population_Data")
A data frame with 232 observations on the proportion of inhabitants living within a 200 kilometer wide costal strip in 2010.
https://sedac.ciesin.columbia.edu/data/set/nagdc-population-landscape-climate-estimates-v3
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.