--- title: "Clustering of unknown subpopulations in admixture models" author: "Xavier Milhaud" bibliography: references.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Clustering of unknown subpopulations in admixture models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ```{r setup} library(admix) ``` The clustering of populations following admixture models is, for now, based on the K-sample test theory (see [@MilhaudPommeretSalhiVandekerkhove2024b]. Consider $K$ samples. For $i=1,...,K$, sample $X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})$ follows $$L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.$$ We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the K-sample test procedure to obtain a data-driven method that cluster the $K$ populations into $N$ subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold: * the number $N$ of clusters is automatically chosen by the procedure, * Each subgroup is validated by the K-sample testing method, which has theoretical guarantees. This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the K-sample 2-component mixture clustering (K2MC). # Algorithm We now detail the steps of the algorithm. 1. Initialization: create the first cluster to be filled, i.e. $c = 1$. By convention, $S_0=\emptyset$. 2. Select $\{x,y\}={\rm argmin}\{d_n(i,j); i \neq j \in S \setminus \bigcup_{k=1}^c S_{k-1}\}$. 3. Test $H_0$ between $x$ and $y$. If $H_0$ is not rejected then $S_1 = \{x,y\}$,\\ Else $S_1 = \{x\}$, $S_{c+1} = \{y\}$ and then $c=c+1$. 4. While $S\setminus \bigcup_{k=1}^c S_k = \emptyset$ do Select $u={\rm argmin}\{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k\}$; Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\\ If $H_0$ not rejected, then put $S_c=S_c\bigcup \{u\}$;\\ Else $S_{c+1} = \{u\}$ and $c = c+1$. # Applications ## On $\mathbb{R}^+$ We present a case study with 5 populations to cluster on $\mathbb{R}^+$, with Gamma-Exponential, Exponential-Exponential and Gamma-Gamma mixtures. ```{r} set.seed(123) ## Simulate mixture data: mixt1 <- twoComp_mixt(n = 6000, weight = 0.8, comp.dist = list("gamma", "exp"), comp.param = list(list("shape" = 16, "scale" = 1/4), list("rate" = 1/3.5))) mixt2 <- twoComp_mixt(n = 6000, weight = 0.7, comp.dist = list("gamma", "exp"), comp.param = list(list("shape" = 14, "scale" = 1/2), list("rate" = 1/5))) mixt3 <- twoComp_mixt(n = 6000, weight = 0.6, comp.dist = list("gamma", "gamma"), comp.param = list(list("shape" = 16, "scale" = 1/4), list("shape" = 12, "scale" = 1/2))) mixt4 <- twoComp_mixt(n = 6000, weight = 0.5, comp.dist = list("exp", "exp"), comp.param = list(list("rate" = 1/2), list("rate" = 1/7))) mixt5 <- twoComp_mixt(n = 6000, weight = 0.5, comp.dist = list("gamma", "exp"), comp.param = list(list("shape" = 14, "scale" = 1/2), list("rate" = 1/6))) data1 <- getmixtData(mixt1) data2 <- getmixtData(mixt2) data3 <- getmixtData(mixt3) data4 <- getmixtData(mixt4) data5 <- getmixtData(mixt5) admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]], knownComp_param = mixt1$comp.param[[2]]) admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]], knownComp_param = mixt2$comp.param[[2]]) admixMod3 <- admix_model(knownComp_dist = mixt3$comp.dist[[2]], knownComp_param = mixt3$comp.param[[2]]) admixMod4 <- admix_model(knownComp_dist = mixt4$comp.dist[[2]], knownComp_param = mixt4$comp.param[[2]]) admixMod5 <- admix_model(knownComp_dist = mixt5$comp.dist[[2]], knownComp_param = mixt5$comp.param[[2]]) ## Look for the clusters: admix_cluster(samples = list(data1,data2,data3,data4,data5), admixMod = list(admixMod1,admixMod2,admixMod3,admixMod4,admixMod5), conf_level = 0.95, n_sim_tab = 10, tune_penalty = FALSE, tabul_dist = NULL, echo = FALSE, parallel = FALSE, n_cpu = 2) ```