Preliminary
Consider the following setting:
Gaussian graphical model (GGM) assumption:
The data Xp × p consists of independent and identically distributed samples X1, …, Xn ∼ Np(μ, Σ).
Disjoint group structure:
The p variables can be partitioned into disjoint groups.
Goal:
Estimate the precision matrix Ω = Σ−1 = (ωij)p × p.
Sparse-Group Estimator
where:
$S = n^{-1} \sum_{i=1}^n (X_i-\bar{X})(X_i-\bar{X})^\top$ is the empirical covariance matrix.
λ ≥ 0 is the global regularization parameter controlling overall shrinkage.
α ∈ [0, 1] is the mixing parameter controlling the balance between element-wise and block-wise penalties.
γ is the additional parameter for non-convex penalties, controlling the degree of nonconvexity (or concavity) of the penalty function.
𝒫λ, α, γ(Ω) is a generic bi-level penalty template that combines element-wise and block-wise regularization, allowing convex or non-convex regularizers while preserving the intrinsic group structure among variables.
𝒫λ, γidv(Ω) is the element-wise individual penalty component.
𝒫λ, γgrp(Ω) is the block-wise group penalty component.
Pλ, γ(⋅) is the penalty function.
Ωgg′ is the submatrix of Ω with the rows from group g and columns from group g′.
The Frobenius norm ‖Ω‖F is defined as ‖Ω‖F = (∑i, j|ωij|2)1/2 = [tr(Ω⊤Ω)]1/2.
Note:
- The parameter γ is only relevant for non-convex penalties. The Lasso penalty can be viewed as a special case in which γ is not required.
Penalties
- Lasso: Least absolute shrinkage and selection operator (Tibshirani 1996; Friedman et al. 2008)
Pλ(ωij) = λ|ωij|.
- Adaptive lasso (Zou 2006; Fan et al. 2009)
$$
P_{\lambda,\gamma}(\omega_{ij}) = \lambda\frac{\vert\omega_{ij}\vert}{v_{ij}},
$$ where V = (vij)d × d = (|ω̃ij|γ)d × d is a matrix of adaptive weights, and ω̃ij is the initial estimate obtained using penalty = "lasso".
- Atan: Arctangent type penalty (Wang and Zhu 2016)
$$
P_{\lambda,\gamma}(\omega_{ij})
= \lambda(\gamma+\frac{2}{\pi})
\arctan\left(\frac{\vert\omega_{ij}\vert}{\gamma}\right),
\quad \gamma > 0.
$$
- Exp: Exponential type penalty (Wang et al. 2018)
$$
P_{\lambda,\gamma}(\omega_{ij})
= \lambda\left[1-\exp\left(-\frac{\vert\omega_{ij}\vert}{\gamma}\right)\right],
\quad \gamma > 0.
$$
- Lq (Frank and Friedman 1993; Fu 1998; Fan and Li 2001)
Pλ, γ(ωij) = λ|ωij|γ, 0 < γ < 1.
- LSP: Log-sum penalty (Candès et al. 2008)
$$
P_{\lambda,\gamma}(\omega_{ij})
= \lambda\log\left(1+\frac{\vert\omega_{ij}\vert}{\gamma}\right),
\quad \gamma > 0.
$$
- MCP: Minimax concave penalty (Zhang 2010)
$$
P_{\lambda,\gamma}(\omega_{ij})
= \begin{cases}
\lambda\vert\omega_{ij}\vert - \dfrac{\omega_{ij}^2}{2\gamma},
& \text{if } \vert\omega_{ij}\vert \leq \gamma\lambda, \\
\dfrac{1}{2}\gamma\lambda^2,
& \text{if } \vert\omega_{ij}\vert > \gamma\lambda.
\end{cases}
\quad \gamma > 1.
$$
- SCAD: Smoothly clipped absolute deviation (Fan and Li 2001; Fan et al. 2009)
$$
P_{\lambda,\gamma}(\omega_{ij})
= \begin{cases}
\lambda\vert\omega_{ij}\vert
& \text{if } \vert\omega_{ij}\vert \leq \lambda, \\
\dfrac{2\gamma\lambda\vert\omega_{ij}\vert-\omega_{ij}^2-\lambda^2}{2(\gamma-1)}
& \text{if } \lambda < \vert\omega_{ij}\vert < \gamma\lambda, \\
\dfrac{\lambda^2(\gamma+1)}{2}
& \text{if } \vert\omega_{ij}\vert \geq \gamma\lambda.
\end{cases}
\quad \gamma > 2.
$$
Note:
- For Lasso, which is convex, the additional parameter γ is not required, and the penalty function Pλ, γ(⋅) simplifies to Pλ(⋅).
Illustrative Visualization
Figure 1 illustrates a comparison of various penalty functions P(ω) evaluated over a range of ω values. The main panel (right) provides a wider view of the penalty functions’ behavior for larger |ω|, while the inset panel (left) magnifies the region near zero [−1, 1].
library(grasps) ## for penalty computation
library(ggplot2) ## for visualization
penalties <- c("atan", "exp", "lasso", "lq", "lsp", "mcp", "scad")
pen_df <- compute_penalty(seq(-4, 4, by = 0.01), penalties, lambda = 1)
plot(pen_df, xlim = c(-1, 1), ylim = c(0, 1), zoom.size = 1) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
Figure 2 displays the derivative function P′(ω) associated with a range of penalty types. The Lasso exhibits a constant derivative, corresponding to uniform shrinkage. For MCP and SCAD, the derivatives are piecewise: initially equal to the Lasso derivative, then decreasing over an intermediate region, and eventually dropping to zero, indicating that large |ω| receive no shrinkage. Other non-convex penalties show smoothly diminishing derivatives as |ω| increases, reflecting their tendency to shrink small |ω| strongly while exerting little to no shrinkage on large ones.
deriv_df <- compute_derivative(seq(0, 4, by = 0.01), penalties, lambda = 1)
plot(deriv_df) +
scale_y_continuous(limits = c(0, 1.5)) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
Reference