HeteroGGM

Table of contents

  1. Description
  2. GMMPF
  3. PGGMBC

Description

The Gaussian Graphical Model-based (GGM) framework, focusing on the precision matrix and conditional dependence, is a more popular paradigm of heterogeneity analysis, which is more informative than that limited to simple distributional properties. In GGM-based analyses, to determine the number of subgroups is a challenging and important task. This package contains a recently developed and novel method via penalized fusion which can determine the number of subgroups and recover the subgrouping structure fully data-dependently. Moreover, the package also includes some Gaussian graphical mixture model methods requiring a given number of subgroups. The main functions contained in the package are as follows.

  • GMMPF: This function implements the GGM-based heterogeneity analysis via penalized fusion (Ren et al., 2021).
  • PGGMBC: This method implements the penalized GGM-based clustering with unconstrained covariance matrices (Zhou et al., 2009).
  • summary-network: This function provides the summary of the characteristics of the resulting network structures, including the overlap of edges of different subgroups, the connection of node, and so on.
  • plot-network: This function implements the visualization of network structures.

We note that the penalties p(⋅, λ) used in Ren et al. (2021) and Zhou et al. (2009) are MCP and lasso, respectively. Our package provides the variety of types of penalties for both two methods, including convex and concave penalties. The workflow of the GMMPF package is as follows.

GMMPF

A relatively large number K, an upper bound of the true number of subgroups K0, needs to be set by the users, which is easy to specify based on some biological knowledge. A new fusion penalty is developed to shrink differences of parameters among the K subgroups and encourage equality, and then a smaller number of subgroups can be yielded. Three tuning parameters λ1, λ2, and λ3 are involved, where λ1 and λ2 are routine to determine the sparsity of parameters in means and precision matrices and regularize estimation. And the conditional dependence relationships for each subgroup can be obtained by examining the nonzero estimates of the resulting precision matrices. λ3 is a pivotal parameter to control the degree of shrinking differences, which implements the effective ``searching” between 1 and K based on the penalized fusion technique.

Data setting

Denote n as the size of independent subjects. Consider sample i( = 1, …, n), p-dimensional measurement xi is available. Further assume that the n subjects belong to K0 subgroups, where the value of K0 is unknown. For the lth subgroup, assume the Gaussian distribution: where the mean and covariance matrix are unknown. Overall, xis satisfy distribution: where the mixture probabilities πl*s are also unknown. Our goal is to determine the number of subgroups K0 and estimate the subgrouping structure fully data-dependently.

Method

GGM-based heterogeneity analysis via penalized fusion is based on the penalized objective function: where X denotes the collection of observed data, Ω = (Ω1, ⋯, ΩK), Ωk = vec (μk, Θk) = (μk1, …, μkp, θk11, …, θkp1, …, θk1p, …, θkpp) ∈ ℝp2 + p, Θk = Σk−1 is the k-th precision matrix with the ij-th entry θkij, π = (π1, ⋯, πK), ∥ ⋅ ∥F is the Frobenius norm, and p(⋅, λ) is a penalty function with tuning parameter λ > 0, which can be selected as lasso, SCAD, MCP, and others. K is a known constant that satisfies K > K0. Consider: Denote $\{\widehat{\boldsymbol{\Upsilon}}_1 , \cdots, \widehat{\boldsymbol{\Upsilon}}_{\widehat{K}_0} \}$ as the distinct values of $\widehat{\boldsymbol{\Omega}}$, that is, $\{k: \widehat{\boldsymbol{\Omega}}_k \equiv \widehat{\boldsymbol{\Upsilon}}_l, k=1, \cdots, K \}_{ l=1, \cdots, \widehat{K}_0 }$ constitutes a partition of {1, ⋯, K}. Then there are 0 subgroups with estimated mean and precision parameters in $\widehat{\boldsymbol{\Omega}}$. The mixture probabilities can be extracted from $\widehat{\boldsymbol{\pi}}$.

Example

First, we call the built-in simulation data set (K0 = 3), and set the upper bound of K0 and the sequences of the tuning parameters (λ1, λ2, and λ3).

data(example.data)
K <- 6
lambda <- genelambda.obo(nlambda1=5,lambda1_max=0.5,lambda1_min=0.1,
                         nlambda2=15,lambda2_max=1.5,lambda2_min=0.1,
                         nlambda3=10,lambda3_max=3.5,lambda3_min=0.5)

Apply GGMPF to the data.

res <- GGMPF(lambda, example.data$data, K, penalty = "MCP")
Theta_hat.list <- res$Theta_hat.list
Mu_hat.list <- res$Mu_hat.list
opt_num <- res$Opt_num
opt_Mu_hat <- Mu_hat.list[[opt_num]]
opt_Theta_hat <- Theta_hat.list[[opt_num]]
K_hat <- dim(opt_Theta_hat)[3]
K_hat  # Output the estimated K0.

Summarize the characteristics of the resulting network structures, and implement visualization of network structures.

summ <- summary_network(opt_Mu_hat, opt_Theta_hat, example.data$data)
summ$Theta_summary$overlap
va_names <- c("6")
linked_node_names(summ, va_names, num_subgroup=1)
plot_network(summ, num_subgroup = c(1:K_hat), plot.mfrow=c(1,K_hat))

PGGMBC

This method combines Gaussian graphical mixture model and the regularization of the means and precision matrices based on the given number of subgroups in advance. The two involved tuning parameters λ1 and λ2 are same as those in GMMPF. Moreover, The users can easily implement BIC-based subgroup number selection using the function of outputing BIC values.

Data setting

It is same as the GGMPF.

Method

Given the number of subgroups K0, penalized GGM-based clustering with unconstrained covariance matrices is based on the model: where Ω = (Ω1, ⋯, ΩK0), π = (π1, ⋯, πK0), and other notations are similar to those in Section .

Example

First, we call the built-in simulation data set, and give the true K0 and the sequences of the tuning parameters (λ1 and λ2).

data(example.data)
K <- 3
lambda <- genelambda.obo(nlambda1=5,lambda1_max=0.5,lambda1_min=0.1,
                         nlambda2=15,lambda2_max=1.5,lambda2_min=0.1)

Apply PGGMBC to the data.

res <- PGGMBC(lambda, example.data$data, K, initial.selection="K-means")
Theta_hat.list <- res$Theta_hat.list
opt_num <- res$Opt_num
opt_Theta_hat <- Theta_hat.list[[opt_num]]

The usages of summarizing the characteristics of the resulting network structures and implementing visualization of network structures are same as the GGMPF.