Title: | VALID Inference for Clusters Separation Testing |
---|---|
Description: | Given a partition resulting from any clustering algorithm, the implemented tests allow valid post-clustering inference by testing if a given variable significantly separates two of the estimated clusters. Methods are detailed in: Hivert B, Agniel D, Thiebaut R & Hejblum BP (2022). "Post-clustering difference testing: valid inference and practical considerations", <arXiv:2210.13172>. |
Authors: | Benjamin Hivert |
Maintainer: | Benjamin Hivert <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-10-30 06:55:51 UTC |
Source: | CRAN |
Merged version of the selective test
merge_selective_inference(X, k1, k2, g, ndraws = 2000, cl_fun, cl)
merge_selective_inference(X, k1, k2, g, ndraws = 2000, cl_fun, cl)
X |
The data matrix of size on which the clustering is applied |
k1 |
The first cluster of interest |
k2 |
The second cluster of interest |
g |
The variables for which the test is applied |
ndraws |
The number of Monte-Carlo samples |
cl_fun |
The clustering function used to build clusters |
cl |
The labels of the data obtained thanks to the |
A list with the following elements
pval
: The resulting p-values of the test.
adjacent
: List of the adjacent clusters between k1 and k2
pval_adj
: The corresponding adjacent p-values that are merged
X <- matrix(rnorm(200),ncol = 2) hcl_fun <- function(x){ return(as.factor(cutree(hclust(dist(x), method = "ward.D2"), k=4)))} cl <- hcl_fun(X) plot(X, col=cl) #Note that in practice the value of ndraws (the number of Monte-Carlo simulations must be higher) test_var1 <- test_selective_inference(X, k1=1, k2=4, g=1, ndraws =100, cl_fun = hcl_fun, cl = cl)
X <- matrix(rnorm(200),ncol = 2) hcl_fun <- function(x){ return(as.factor(cutree(hclust(dist(x), method = "ward.D2"), k=4)))} cl <- hcl_fun(X) plot(X, col=cl) #Note that in practice the value of ndraws (the number of Monte-Carlo simulations must be higher) test_var1 <- test_selective_inference(X, k1=1, k2=4, g=1, ndraws =100, cl_fun = hcl_fun, cl = cl)
Multimodality test for post clustering variable involvement
test_multimod(X, g, cl, k1, k2)
test_multimod(X, g, cl, k1, k2)
X |
The data matrix of size on which the clustering is applied |
g |
The variable on which the test is applied |
cl |
The labels of the data obtained thanks to a clustering algorithm |
k1 |
The first cluster of interest |
k2 |
The second cluster of interest |
A list containing : A list with the following elements
data_for_test
: The data used for the test
stat_g
: The dip statistic
pval
: The resulting p-values of the test computed with the diptest
function
X <- matrix(rnorm(200),ncol = 2) hcl_fun <- function(x){ return(as.factor(cutree(hclust(dist(x), method = "ward.D2"), k=2)))} cl <- hcl_fun(X) plot(X, col=cl) test_var1 <- test_multimod(X, g=1, k1=1, k2=2, cl = cl) test_var2 <- test_multimod(X, g=2, k1=1, k2=2, cl = cl)
X <- matrix(rnorm(200),ncol = 2) hcl_fun <- function(x){ return(as.factor(cutree(hclust(dist(x), method = "ward.D2"), k=2)))} cl <- hcl_fun(X) plot(X, col=cl) test_var1 <- test_multimod(X, g=1, k1=1, k2=2, cl = cl) test_var2 <- test_multimod(X, g=2, k1=1, k2=2, cl = cl)
Selective inference for post-clustering variable involvement
test_selective_inference( X, k1, k2, g, ndraws = 2000, cl_fun, cl = NULL, sig = NULL )
test_selective_inference( X, k1, k2, g, ndraws = 2000, cl_fun, cl = NULL, sig = NULL )
X |
The data matrix of size on which the clustering is applied |
k1 |
The first cluster of interest |
k2 |
The second cluster of interest |
g |
The variables for which the test is applied |
ndraws |
The number of Monte-Carlo samples |
cl_fun |
The clustering function used to build clusters |
cl |
The labels of the data obtained thanks to the |
sig |
The estimated standard deviation. Default is NULL and the standard deviation is estimated using only observations in the two clusters of interest |
A list with the following elements
stat_g
: the test statistic used for the test.
pval
: The resulting p-values of the test.
stder
: The standard deviation of the p-values computed thanks to the Monte-Carlo samples.
clusters
: The labels of the data.
This function is adapted from the clusterpval::test_clusters_approx() of Gao et al. (2022) (available on Github: https://github.com/lucylgao/clusterpval)
Gao, L. L., Bien, J., & Witten, D. (2022). Selective inference for hierarchical clustering. Journal of the American Statistical Association, (just-accepted), 1-27.
X <- matrix(rnorm(200),ncol = 2) hcl_fun <- function(x){ return(as.factor(cutree(hclust(dist(x), method = "ward.D2"), k=2)))} cl <- hcl_fun(X) plot(X, col=cl) #Note that in practice the value of ndraws (the number of Monte-Carlo simulations must be higher) test_var1 <- test_selective_inference(X, k1=1, k2=2, g=1, ndraws =100, cl_fun = hcl_fun, cl = cl)
X <- matrix(rnorm(200),ncol = 2) hcl_fun <- function(x){ return(as.factor(cutree(hclust(dist(x), method = "ward.D2"), k=2)))} cl <- hcl_fun(X) plot(X, col=cl) #Note that in practice the value of ndraws (the number of Monte-Carlo simulations must be higher) test_var1 <- test_selective_inference(X, k1=1, k2=2, g=1, ndraws =100, cl_fun = hcl_fun, cl = cl)