We remind that a random variable X following an admixture distribution has cumulative distribution function (cdf) L given by L(x) = pF(x) + (1 − p)G(x), x ∈ ℝ, where G is a mixture component whose distribution is perfectly known, whereas p and F are unknown. In this setting, if no parametric assumption is made on the unknown component distribution F, the mixture is considered as a semiparametric mixture. For an overview on semiparametric extensions of finite mixture models, see (Xiang and Yang 2018).
The goal of this vignette is to introduce the functionalities that enable to perform hypothesis tests on the unknown component distribution F. We aim to test whether F belongs to certain parametric family (e.g. the Gaussian one) in a 1-sample case, or if two different decontaminated versions of F1 and F2 (obtained from two observed samples X1 and X2) are similar in the K-sample case (K ≥ 2). All the specific tests presented hereafter can be performed using one single generic function for testing with appropriate arguments, the so-called admix_test function.
In this setting, the test to be performed is a parametric family testing, i.e. H0: F ∈ ℱ against H1: F ∉ ℱ, where ℱ = {Fθ: θ ∈ Θ}.
The support of the known component density has to be in line with the one of the unknown component density. Such tests have been introduced in (Pommeret and Vandekerkhove 2019), and the idea underlying this hypothesis test follows these steps:
Because of the use of asymptotically normal estimators, it is not possible to use the estimator provided in (Patra and Sen 2016) to perform hypothesis testing. On the contrary, Bordes and Vandekerkhove (2010) propose an estimator that can be used if the unknown component density is assumed to be symmetric. More generally, Pommeret and Vandekerkhove (2019) give more details about the distribution of the test statistic under the null (hypothesis H0), and under the alternative H1.
Here, the implemented function allows to perform the so-called Gaussianity test, meaning that the parametric family against which the unknown component is tested belongs to Gaussian distributions. Below is an example of hypothesis testing in this 1-sample case:
####### Under the null hypothesis H0.
## Simulate mixture data:
mixt1 <- twoComp_mixt(n = 300, weight = 0.6,
comp.dist = list("norm", "norm"),
comp.param = list(c("mean" = 2, "sd" = 0.5),
c("mean" = 0, "sd" = 1)))
data1 <- getmixtData(mixt1)
## Define the admixture model:
admixMod <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
knownComp_param = mixt1$comp.param[[2]])
admix_test(samples = list(data1), admixMod = list(admixMod), conf_level = 0.95,
test_method = "poly", ask_poly_param = FALSE, support = "Real")
#> Call:admix_test(samples = list(data1), admixMod = list(admixMod),
#> test_method = "poly", conf_level = 0.95, ask_poly_param = FALSE,
#> support = "Real")
#>
#> Do we reject the null hypothesis? No
#> Here is the associated p-value of the test: 0.861
The result of the test is that we cannot reject the null hypothesis H0, which is in line with the specified distribution for the unknown component. Indeed, simulated data is a Gaussian mixture with two components, i.e. F ∼ 𝒩(μ, σ) where μ = 2 and σ = 0.5.
Let us introduce two random samples X1 and X2 following admixture models, such that
The goal here is to perform the following hypothesis test: H0: F1 = F2 against H1 : F1 ≠ F2.
In this framework, we assume that F1 and F2 both have a symmetric density. This way the normally-distributed estimator of p1 and p2, proposed in (Bordes and Vandekerkhove 2010), can be used together with the testing strategy suggested in (Milhaud et al. 2022). This testing strategy is closely connected to (Pommeret and Vandekerkhove 2019), where the computation of the expansion coefficients is duplicated on each of the two samples under study.
In what follows, we simulate two samples under the null and check whether the test provides satisfactory results.
mixt1 <- twoComp_mixt(n = 600, weight = 0.8,
comp.dist = list("norm", "norm"),
comp.param = list(list("mean" = 3, "sd" = 0.5),
list("mean" = 0, "sd" = 1)))
mixt2 <- twoComp_mixt(n = 800, weight = 0.7,
comp.dist = list("norm", "norm"),
comp.param = list(list("mean" = 3, "sd" = 0.5),
list("mean" = 0, "sd" = 1)))
data1 <- getmixtData(mixt1)
data2 <- getmixtData(mixt2)
## Define the admixture models:
admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
knownComp_param = mixt1$comp.param[[2]])
admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]],
knownComp_param = mixt2$comp.param[[2]])
## Using expansion coefficients in orthonormal polynomial basis:
admix_test(samples = list(data1,data2), admixMod = list(admixMod1,admixMod2), conf_level = 0.95,
test_method = "poly", ask_poly_param = FALSE, support = "Real")
#> Call:admix_test(samples = list(data1, data2), admixMod = list(admixMod1,
#> admixMod2), test_method = "poly", conf_level = 0.95, ask_poly_param = FALSE,
#> support = "Real")
#>
#> Do we reject the null hypothesis? Yes
#> Here is the associated p-value of the test: 0.038
The hypothesis test concludes that the null hypothesis cannot be rejected, once again in line with what was expected given the specified parameters when simulating the data.
Note that the following arguments, involved in subroutines, were set to default values (but that the user can choose them setting parameter ‘ask_poly_param’ to TRUE):
When the unknown component distributions are not supposed to have symmetric densities, another solution to perform the test is to set ‘est_method’ to ‘PS’ following ‘ask_poly_param’ set to TRUE, but keeping in mind that plugging in such estimators for the test procedure should not be allowed in theory. However, that works quite well in practice. Another solution to perform this test in full generality is to use the IBM method (see below).
Estimation of the unknown quantities is made by the Inversion - Best Matching approach, see (Milhaud et al. 2024b). In this case, one can still use the function with same first arguments except ‘method’, and ‘method’ should be set to ‘icv’. The user also has to define the number of simulated gaussian processes used to tabulate the test statistic distribution (‘n_sim_tab’), and can accelerate computations using parallel computations and choosing an adequate number of cpus. Other arguments such as support are useless.
We introduce hereafter a natural extension of the two-sample case to the K-sample one, see (Milhaud et al. 2024a). In what follows, the K-sample test is illustrated within the framework of the IBM approach, i.e. using the associated inner convergence property. Of course, in the case when all the unknown component densities are assumed to be symmetric, one could use a pairwise version of the two sample test using the comparison of expansion coefficients in a polynomial orthonormal basis, associated to the estimation method provided by (Bordes and Vandekerkhove 2010).
Consider K samples. For i = 1, ..., K, sample X(i) = (X1(i), ..., Xni(i)) follows Li(x) = piFi(x) + (1 − pi)Gi, x ∈ ℝ. The test to perform is given by H0: F1 = ... = FK against H1: Fi ≠ Fj for some i ≠ j. We use the IBM approach to do so, where assumptions are (straightforwardly) adapted to deal with the K samples.
Basically, we apply the theoretical results of IBM for each pair of populations (i, j), and then build a series of embedded statistics.
Consider the set of pair indices: ${\cal S}(K) = \{(i,j)\in \mathbb{N}^2 ; \; 1\leq i<j \leq K\}$.\ Order ${\cal S}(K)$ lexicographically, and denote rK[(i, j)] the rank of (i, j) in the set S(K).
Then, ∀i ≠ j ∈ {1, ..., K},
We then obtain d(K) = K(K − 1)/2 comparisons that we embed in a series of statistics:
To choose automatically the right order k for testing, consider the penalization rule (mimicking Schwarz criteria procedure, see (Schwarz 1978)): S(n) = min {arg max1 ≤ k ≤ d(K)(Uk − k∑(i, j) ∈ S(K)ln(i, j) 1{rK(i, j) = k})}.
Our data-driven test statistic is given by Ũn = US(n).
It can be shown that under H0 and appropriate assumptions, S(n) converges in probablity towards 1 as n → +∞; meaning that we asymptotically choose the first element of ${\cal S}(K)$.\ Moreover, under H0, US(n) converges in law towards U0(1, 2), which is exactly the null limit distribution studied in the two-sample case. Finally, we thus consider the H0-rejection rule: Ũn ≥ q̂1 − α ⇒ H0 is rejected.
We now provide the way to perform this test with the package admix with Gaussian mixtures. First, let us study the case where we are under the null hypothesis H0, considering K = 3 different populations.
mixt1 <- twoComp_mixt(n = 450, weight = 0.4,
comp.dist = list("norm", "norm"),
comp.param = list(c("mean" = -2, "sd" = 0.5),
c("mean" = 0, "sd" = 1)))
mixt2 <- twoComp_mixt(n = 600, weight = 0.24,
comp.dist = list("norm", "norm"),
comp.param = list(c("mean" = -2, "sd" = 0.5),
c("mean" = -1, "sd" = 1)))
mixt3 <- twoComp_mixt(n = 400, weight = 0.53,
comp.dist = list("norm", "norm"),
comp.param = list(c("mean" = -2, "sd" = 0.5),
c("mean" = 2, "sd" = 1)))
data1 <- getmixtData(mixt1)
data2 <- getmixtData(mixt2)
data3 <- getmixtData(mixt3)
admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
knownComp_param = mixt1$comp.param[[2]])
admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]],
knownComp_param = mixt2$comp.param[[2]])
admixMod3 <- admix_model(knownComp_dist = mixt3$comp.dist[[2]],
knownComp_param = mixt3$comp.param[[2]])
admix_test(samples = list(data1, data2, data3),
admixMod = list(admixMod1, admixMod2, admixMod3),
conf_level = 0.95, test_method = "icv", n_sim_tab = 8,
tune_penalty = FALSE, parallel = FALSE, n_cpu = 2)
#> Call:admix_test(samples = list(data1, data2, data3), admixMod = list(admixMod1,
#> admixMod2, admixMod3), test_method = "icv", conf_level = 0.95,
#> n_sim_tab = 8, tune_penalty = FALSE, parallel = FALSE, n_cpu = 2)
#>
#> Do we reject the null hypothesis? No
#> Here is the associated p-value of the test: 1