| Title: | Semi-Supervised Learning under a Mixed-Missingness Mechanism in Finite Mixture Models |
|---|---|
| Description: | Implements a semi-supervised learning framework for finite mixture models under a mixed-missingness mechanism. The approach models both missing completely at random (MCAR) and entropy-based missing at random (MAR) processes using a logistic–entropy formulation. Estimation is carried out via an Expectation–-Conditional Maximisation (ECM) algorithm with robust initialisation routines for stable convergence. The methodology relates to the statistical perspective and informative missingness behaviour discussed in Ahfock and McLachlan (2020) <doi:10.1007/s11222-020-09971-5> and Ahfock and McLachlan (2023) <doi:10.1016/j.ecosta.2022.03.007>. The package provides functions for data simulation, model estimation, prediction, and theoretical Bayes error evaluation for analysing partially labelled data under a mixed-missingness mechanism. |
| Authors: | Jinran Wu [aut, cre] (ORCID: <https://orcid.org/0000-0002-2388-3614>), Geoffrey J. McLachlan [aut] (ORCID: <https://orcid.org/0000-0002-5921-3145>) |
| Maintainer: | Jinran Wu <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-05-08 07:42:32 UTC |
| Source: | https://github.com/cran/SSLfmm |
Classifier specified by Bayes' rule.
Assigns .
bayesclassifier(dat, p, g, pi = NULL, mu = NULL, sigma = NULL, paralist = NULL)bayesclassifier(dat, p, g, pi = NULL, mu = NULL, sigma = NULL, paralist = NULL)
dat |
An |
p |
Integer; dimension of the observation vector. |
g |
Integer; number of Gaussian components/classes. |
pi |
Numeric length- |
mu |
Either:
|
sigma |
Either:
|
paralist |
Optional list with elements |
An integer vector of length with predicted class labels in .
# Minimal example with list-style mu and sigma: set.seed(1) p <- 2; g <- 2 pi <- c(0.5, 0.5) mu <- list(`1`=c(0.96, 0.02), `2`=c(-1.02, -0.03)) sigma <- list( matrix(c(0.9417379,0.5447264, 0.5447264,0.9811853), 2, 2, byrow=TRUE), matrix(c(0.9984812,0.3314474, 0.3314474,1.1316865), 2, 2, byrow=TRUE) ) X <- mvtnorm::rmvnorm(5, mean = mu[[1]], sigma = sigma[[1]]) bayesclassifier(X, p=p, g=g, pi=pi, mu=mu, sigma=sigma)# Minimal example with list-style mu and sigma: set.seed(1) p <- 2; g <- 2 pi <- c(0.5, 0.5) mu <- list(`1`=c(0.96, 0.02), `2`=c(-1.02, -0.03)) sigma <- list( matrix(c(0.9417379,0.5447264, 0.5447264,0.9811853), 2, 2, byrow=TRUE), matrix(c(0.9984812,0.3314474, 0.3314474,1.1316865), 2, 2, byrow=TRUE) ) X <- mvtnorm::rmvnorm(5, mean = mu[[1]], sigma = sigma[[1]]) bayesclassifier(X, p=p, g=g, pi=pi, mu=mu, sigma=sigma)
Computes where for two classes
under the common-covariance (LDA) model:
compute_d2(y, mu1, mu2, Sigma_inv, pi1, pi2)compute_d2(y, mu1, mu2, Sigma_inv, pi1, pi2)
y |
Numeric vector (length |
mu1, mu2
|
Numeric vectors of length |
Sigma_inv |
|
pi1, pi2
|
Positive scalars: class prior probabilities (need not sum to 1). |
If y is a vector, a single numeric . If y is a matrix,
a numeric vector of values for each row.
set.seed(1) mu1 <- c(0, 0); mu2 <- c(1, 1) S <- matrix(c(1, .2, .2, 1), 2, 2) Sigma_inv <- solve(S) x <- c(0.5, -0.2) compute_d2(x, mu1, mu2, Sigma_inv, pi1 = 0.6, pi2 = 0.4) X <- matrix(rnorm(10 * 2), ncol = 2) compute_d2(X, mu1, mu2, Sigma_inv, 0.5, 0.5)set.seed(1) mu1 <- c(0, 0); mu2 <- c(1, 1) S <- matrix(c(1, .2, .2, 1), 2, 2) Sigma_inv <- solve(S) x <- c(0.5, -0.2) compute_d2(x, mu1, mu2, Sigma_inv, pi1 = 0.6, pi2 = 0.4) X <- matrix(rnorm(10 * 2), ncol = 2) compute_d2(X, mu1, mu2, Sigma_inv, 0.5, 0.5)
Runs an EM-like procedure that models a mixed-missingness mechanism:
unlabeled indicator follows a mixed-missingness mechanism with MCAR (prob )
and entropy-based MAR via a logistic link .
Supports shared (ncov = 1) or class-specific (ncov = 2) covariance.
EM_FMM_SemiSupervised( data, g = 2, init_res, max_iter = 5, tol = 1e-06, ncov = 1, verbose = FALSE )EM_FMM_SemiSupervised( data, g = 2, init_res, max_iter = 5, tol = 1e-06, ncov = 1, verbose = FALSE )
data |
A data.frame or matrix with |
g |
Integer, number of mixture components (classes). |
init_res |
A list with initial parameters:
|
max_iter |
Integer, max EM iterations. |
tol |
Convergence tolerance on log-likelihood increase. |
ncov |
Integer covariance structure: |
verbose |
Logical; if |
This function expects the following helpers to be available:
pack_theta(pi_k, mu_k, Sigma_k, g, p, ncov)
unpack_theta(theta, g, p, ncov)
neg_loglik(theta, Y_all, m_j, Z_all, d2_yj, xi, alpha_k, unpacker)
get_entropy(dat, n, p, g, paralist) returning per-observation
entropy-like values
A list with elements:
pi, mu, Sigma, xi, alpha, loglik,
and ncov.
## Toy example using a simple partially-labelled dataset set.seed(1) ## 1) Construct an n x (p+2) partially-labelled dataset: ## first p columns = features, then 'missing' and 'z' n <- 100; p <- 2; g <- 2 X <- matrix(rnorm(n * p), nrow = n, ncol = p) ## missing: 0 = labelled, 1 = unlabelled missing <- rbinom(n, size = 1, prob = 0.3) ## z: observed class labels for labelled rows, NA for unlabelled z <- rep(NA_integer_, n) z[missing == 0] <- sample(1:g, sum(missing == 0), replace = TRUE) sim_dat <- data.frame(X, missing = missing, z = z) ## 2) Warm-up initialisation using the complete-data initializer init <- EM_FMM_SemiSupervised_Complete_Initial( data = sim_dat, g = g, ncov = 1 ) ## 3) Run the main EM algorithm (small number of iterations) fit <- EM_FMM_SemiSupervised( data = sim_dat, g = g, init_res = init, ncov = 1, max_iter = 5, verbose = FALSE ) str(fit)## Toy example using a simple partially-labelled dataset set.seed(1) ## 1) Construct an n x (p+2) partially-labelled dataset: ## first p columns = features, then 'missing' and 'z' n <- 100; p <- 2; g <- 2 X <- matrix(rnorm(n * p), nrow = n, ncol = p) ## missing: 0 = labelled, 1 = unlabelled missing <- rbinom(n, size = 1, prob = 0.3) ## z: observed class labels for labelled rows, NA for unlabelled z <- rep(NA_integer_, n) z[missing == 0] <- sample(1:g, sum(missing == 0), replace = TRUE) sim_dat <- data.frame(X, missing = missing, z = z) ## 2) Warm-up initialisation using the complete-data initializer init <- EM_FMM_SemiSupervised_Complete_Initial( data = sim_dat, g = g, ncov = 1 ) ## 3) Run the main EM algorithm (small number of iterations) fit <- EM_FMM_SemiSupervised( data = sim_dat, g = g, init_res = init, ncov = 1, max_iter = 5, verbose = FALSE ) str(fit)
Uses both labeled and unlabeled subsets of the data to obtain quick initial
estimates for mixture parameters and missingness mechanism parameters
(alpha, xi) via a warm-up EM procedure.
EM_FMM_SemiSupervised_Complete_Initial( data, g = 2, ncov = 1, alpha_init = 0.01, warm_up_iter = 200, tol = 1e-06 )EM_FMM_SemiSupervised_Complete_Initial( data, g = 2, ncov = 1, alpha_init = 0.01, warm_up_iter = 200, tol = 1e-06 )
data |
A data frame containing:
|
g |
Integer, number of mixture components (default |
ncov |
Integer, covariance structure:
|
alpha_init |
Numeric in (0,1), initial MCAR proportion (default |
warm_up_iter |
Integer, number of warm-up EM iterations (default |
tol |
Convergence tolerance on |
This function first calls initialestimate to get initial
, , .
Then it calls EM_FMM_SemiSupervised_Initial with these
values for a short warm-up run.
Covariance structure (equal vs. unequal) is determined
by ncov.
A list with initial values for EM_FMM_SemiSupervised:
pi - mixture weights.
mu - list of component mean vectors.
Sigma - covariance matrix/matrices.
alpha - MCAR proportion.
xi - logistic regression coefficients for MAR mechanism.
Provides rough initial estimates of the missingness parameters alpha
and xi, together with mixture parameters pi, mu, and
covariance matrices, using a lightweight EM-style routine. The covariance
structure is chosen automatically based on Sigma_init:
If Sigma_init is a matrix, a
shared (equal) covariance is used.
If Sigma_init is a list of length g of
matrices or a array,
class-specific (unequal) covariances are used.
If Sigma_init is NULL, a shared covariance is
estimated from the labeled data.
This function is intended as a fast, heuristic initializer rather than a final estimator for the mixed missingness model.
EM_FMM_SemiSupervised_Initial( Y_labelled, Z_labelled, Y_unlabelled, g = 2, pi_init = NULL, mu_init = NULL, Sigma_init = NULL, alpha_init = 0.01, warm_up_iter = 50, tol = 1e-06 )EM_FMM_SemiSupervised_Initial( Y_labelled, Z_labelled, Y_unlabelled, g = 2, pi_init = NULL, mu_init = NULL, Sigma_init = NULL, alpha_init = 0.01, warm_up_iter = 50, tol = 1e-06 )
Y_labelled |
Numeric matrix of labeled observations ( |
Z_labelled |
Integer vector of class labels in |
Y_unlabelled |
Numeric matrix of unlabeled observations
( |
g |
Integer, number of mixture components (default |
pi_init |
Optional numeric length- |
mu_init |
Optional list of length |
Sigma_init |
Optional initial covariance:
a |
alpha_init |
Numeric in |
warm_up_iter |
Integer, number of warm-up EM iterations used to
refine the quick initial estimates (default |
tol |
Convergence tolerance on |
A list with elements:
pi - length-g vector of mixing proportions.
mu - list of g mean vectors.
Sigma - shared matrix (equal-Sigma) or list
of g matrices (unequal-Sigma).
xi - length-2 numeric vector c(xi0, xi1) from the
logistic MAR model.
alpha - estimated MCAR proportion.
gamma - responsibility matrix.
d2_yj - numeric vector of entropy-based scores used in the
missingness model.
This is a heuristic warm-up routine. It requires helper functions
normalise_logprob() and get_entropy() to be available in the
package namespace.
Computes the Bayes'classification error rate for a two-component Gaussian
mixture given mu_hat, Sigma_hat, and pi_hat. If
mu_hat is supplied as a list of length 2, it is converted to a
matrix internally.
error_beta_classification(mu_hat, Sigma_hat, pi_hat)error_beta_classification(mu_hat, Sigma_hat, pi_hat)
mu_hat |
Either a numeric matrix of size |
Sigma_hat |
Numeric |
pi_hat |
Numeric vector of length 2 with mixing proportions
|
The linear discriminant is
and the Bayes' error is
where is the standard normal cdf.
A numeric scalar giving the theoretical Bayes' classification error rate.
mu_hat <- matrix(c(1, 0, -1, 0), nrow = 2) # columns are mu1, mu2 Sigma_hat <- diag(2) pi_hat <- c(0.5, 0.5) error_beta_classification(mu_hat, Sigma_hat, pi_hat)mu_hat <- matrix(c(1, 0, -1, 0), nrow = 2) # columns are mu1, mu2 Sigma_hat <- diag(2) pi_hat <- c(0.5, 0.5) error_beta_classification(mu_hat, Sigma_hat, pi_hat)
Compute posterior membership probabilities for each
observation under a Gaussian mixture with either a shared covariance
matrix or component-specific covariances.
get_clusterprobs( dat, n, p, g, pi = NULL, mu = NULL, sigma = NULL, paralist = NULL )get_clusterprobs( dat, n, p, g, pi = NULL, mu = NULL, sigma = NULL, paralist = NULL )
dat |
Numeric matrix |
n |
Integer. Number of observations (checked against |
p |
Integer. Number of variables (checked against |
g |
Integer. Number of components. |
pi |
Optional numeric vector length |
mu |
Optional numeric matrix |
sigma |
Optional numeric matrix |
paralist |
Optional list with elements |
Numeric matrix of posterior probabilities .
n <- 100; p <- 2; g <- 2 X <- matrix(rnorm(n*p), n, p) pi <- c(0.6, 0.4) mu <- cbind(c(0,0), c(1,1)) Sig <- array(0, dim = c(p,p,g)); Sig[,,1] <- diag(p); Sig[,,2] <- diag(p) tau <- get_clusterprobs(X, n, p, g, pi = pi, mu = mu, sigma = Sig) head(tau)n <- 100; p <- 2; g <- 2 X <- matrix(rnorm(n*p), n, p) pi <- c(0.6, 0.4) mu <- cbind(c(0,0), c(1,1)) Sig <- array(0, dim = c(p,p,g)); Sig[,,1] <- diag(p); Sig[,,2] <- diag(p) tau <- get_clusterprobs(X, n, p, g, pi = pi, mu = mu, sigma = Sig) head(tau)
Compute the Shannon entropy (in nats) of posterior membership probabilities
for each observation, given a Gaussian mixture model. Posterior probabilities
are obtained via get_clusterprobs().
get_entropy(dat, n, p, g, pi = NULL, mu = NULL, sigma = NULL, paralist = NULL)get_entropy(dat, n, p, g, pi = NULL, mu = NULL, sigma = NULL, paralist = NULL)
dat |
Numeric matrix |
n |
Integer. Number of rows in |
p |
Integer. Number of columns (features) in |
g |
Integer. Number of mixture components. |
pi |
Optional numeric vector length |
mu |
Optional numeric matrix |
sigma |
Optional numeric matrix |
paralist |
Optional list with elements |
Numeric vector of length , the entropy per observation (in nats).
# Suppose you have get_clusterprobs(), and a fitted/pi, mu, sigma: n <- 100; p <- 2; g <- 2 X <- matrix(rnorm(n * p), n, p) pi <- c(0.6, 0.4) mu <- cbind(c(0,0), c(1,1)) Sigma <- array(0, dim = c(p, p, g)) Sigma[,,1] <- diag(p); Sigma[,,2] <- diag(p) ent <- get_entropy(dat = X, n = n, p = p, g = g, pi = pi, mu = mu, sigma = Sigma) summary(ent)# Suppose you have get_clusterprobs(), and a fitted/pi, mu, sigma: n <- 100; p <- 2; g <- 2 X <- matrix(rnorm(n * p), n, p) pi <- c(0.6, 0.4) mu <- cbind(c(0,0), c(1,1)) Sigma <- array(0, dim = c(p, p, g)) Sigma[,,1] <- diag(p); Sigma[,,2] <- diag(p) ent <- get_entropy(dat = X, n = n, p = p, g = g, pi = pi, mu = mu, sigma = Sigma) summary(ent)
Builds initial estimates for a g-component Gaussian
mixture using only rows with observed labels in zm. Supports either a
shared covariance (ncov = 1) or class-specific covariances
(ncov = 2).
initialestimate(dat, zm, g, ncov = 2, ridge = 1e-06)initialestimate(dat, zm, g, ncov = 2, ridge = 1e-06)
dat |
A numeric matrix or data frame of features (n x p). |
zm |
Integer vector of length n with class labels in |
g |
Integer, number of mixture components. |
ncov |
Integer, |
ridge |
Numeric, small diagonal ridge added to covariance(s) for
numerical stability. Default |
If a class has zero or one labeled sample, its covariance is set to the global empirical covariance (from labeled data) with a small ridge. Class means for empty classes default to the global mean with a small jitter.
A list with
pi: length-g vector of mixing proportions (summing to 1).
mu: p x g matrix of class means (column i is ).
sigma: if ncov = 1, a p x p shared covariance matrix;
if ncov = 2, a p x p x g array of class-specific covariances.
set.seed(1) n <- 50; p <- 3; g <- 2 X <- matrix(rnorm(n*p), n, p) z <- sample(c(1:g, NA), n, replace = TRUE, prob = c(0.4, 0.4, 0.2)) init <- initialestimate(X, z, g, ncov = 2) str(init)set.seed(1) n <- 50; p <- 3; g <- 2 X <- matrix(rnorm(n*p), n, p) z <- sample(c(1:g, NA), n, replace = TRUE, prob = c(0.4, 0.4, 0.2)) init <- initialestimate(X, z, g, ncov = 2) str(init)
Computes in a numerically stable way
by subtracting the maximum value before exponentiation.
logsumexp(x)logsumexp(x)
x |
A numeric vector. |
A single numeric value: the log-sum-exp of x.
logsumexp(c(1000, 1001, 1002))logsumexp(c(1000, 1001, 1002))
Computes the negative log-likelihood for a semi-supervised Gaussian mixture
model under a mixed missingness mechanism (MCAR + entropy-based MAR).
Assumes a covariance matrix across all mixture
components.
neg_loglik(theta, Y, m_j, Z, d2_yj, xi, alpha_k, unpack_fn)neg_loglik(theta, Y, m_j, Z, d2_yj, xi, alpha_k, unpack_fn)
theta |
Numeric vector of packed model parameters to be unpacked by |
Y |
Numeric matrix of observations (n x p). |
m_j |
Integer or logical vector of length n indicating missingness: 0 for observed (labeled block), 1 for unlabeled/missingness block. |
Z |
Integer vector of length n with class labels for labeled samples
(1..g); use |
d2_yj |
Numeric vector of length n with the entropy-like score used in the MAR mechanism (e.g., posterior entropy or any scalar proxy). |
xi |
Numeric length-2 vector |
alpha_k |
Numeric scalar in (0,1), the MCAR mixing proportion in the missingness mechanism. |
unpack_fn |
Function that takes
|
The total log-likelihood is composed of three parts:
Labeled samples () with observed class labels .
Unlabeled samples attributed to MCAR with probability mass .
Unlabeled samples attributed to MAR with probability mass .
The MAR probability for each sample is .
Internally, the function uses a numerically stable logSumExp.
A single numeric value: the negative log-likelihood.
This implementation is for the equal covariance case (shared ).
# Minimal example (illustrative only): library(mvtnorm) set.seed(1) n <- 20; p <- 2; g <- 2 Y <- matrix(rnorm(n*p), n, p) Z <- sample(c(1:g, rep(NA, n - g)), n, replace = TRUE) m_j <- ifelse(is.na(Z), 1L, 0L) d2_yj <- runif(n) xi <- c(-1, 2) alpha_k <- 0.4 unpack_fn <- function(theta) { list(pi = c(0.6, 0.4), mu = list(c(0,0), c(1,1)), sigma = diag(p)) } theta <- numeric(1) # not used in this toy unpack_fn neg_loglik(theta, Y, m_j, Z, d2_yj, xi, alpha_k, unpack_fn)# Minimal example (illustrative only): library(mvtnorm) set.seed(1) n <- 20; p <- 2; g <- 2 Y <- matrix(rnorm(n*p), n, p) Z <- sample(c(1:g, rep(NA, n - g)), n, replace = TRUE) m_j <- ifelse(is.na(Z), 1L, 0L) d2_yj <- runif(n) xi <- c(-1, 2) alpha_k <- 0.4 unpack_fn <- function(theta) { list(pi = c(0.6, 0.4), mu = list(c(0,0), c(1,1)), sigma = diag(p)) } theta <- numeric(1) # not used in this toy unpack_fn neg_loglik(theta, Y, m_j, Z, d2_yj, xi, alpha_k, unpack_fn)
Converts log-probabilities into a probability distribution by exponentiating in a numerically stable way.
normalise_logprob(log_probs)normalise_logprob(log_probs)
log_probs |
A numeric vector of log-probabilities. |
A numeric vector summing to 1 (the normalised probabilities).
lp <- c(-1000, -999, -998) normalise_logprob(lp)lp <- c(-1000, -999, -998) normalise_logprob(lp)
Packs mixture weights, means, and covariance(s) into a single numeric vector. Uses the last component as the baseline for mixture weights (g-1 logits stored).
pack_theta(pi_k, mu_k, Sigma, g, p, ncov = 1)pack_theta(pi_k, mu_k, Sigma, g, p, ncov = 1)
pi_k |
Numeric vector of length g with mixture weights (positive, sum to 1). |
mu_k |
List of length g; each element a numeric vector of length p (component means). |
Sigma |
Covariance: if |
g |
Integer: number of components. |
p |
Integer: dimension. |
ncov |
Integer: covariance structure; 1 for shared covariance, 2 for class-specific. |
Numeric vector with parameters packed.
Generate i.i.d. samples from a finite Gaussian mixture with either a shared covariance matrix or component-specific covariance matrices.
rmix(n, pi, mu, sigma, seed_number)rmix(n, pi, mu, sigma, seed_number)
n |
Integer. Number of observations to generate. |
pi |
Numeric vector of length |
mu |
Numeric matrix |
sigma |
Either a numeric matrix |
seed_number |
Integer. Seed for reproducibility. |
A list with:
Y |
Numeric matrix |
Z |
Numeric matrix |
clust |
Integer vector |
set.seed(1) g <- 2; p <- 2 pi <- c(0.5, 0.5) mu <- cbind(c(1,0), c(-1,0)) Sigma <- diag(p) out <- rmix(500, pi, mu, Sigma, seed_number = 123) str(out)set.seed(1) g <- 2; p <- 2 pi <- c(0.5, 0.5) mu <- cbind(c(1,0), c(-1,0)) Sigma <- diag(p) out <- rmix(500, pi, mu, Sigma, seed_number = 123) str(out)
Simulate a Gaussian Mixture Dataset with a Mixed-Missingness Mechanism (MAR + MCAR)
simulate_mixed_missingness( n = 500, pi, mu, sigma, xi0 = 2, xi1 = 3, alpha = 0.1, seed_id = 123 )simulate_mixed_missingness( n = 500, pi, mu, sigma, xi0 = 2, xi1 = 3, alpha = 0.1, seed_id = 123 )
n |
Integer; sample size. |
pi |
Numeric vector; mixing proportions (sum to 1). |
mu |
Matrix (p x K); component means, columns = components. |
sigma |
Array (p x p x K); component covariance matrices. |
xi0 |
Numeric; MAR logit intercept. |
xi1 |
Numeric; MAR logit slope on entropy. |
alpha |
Numeric in |
seed_id |
Integer; seed passed to rmix() (your generator). |
Requires user-provided functions:
rmix(n, pi, mu, sigma, seed_number)
get_entropy(dat, n, p, g, paralist)
Missingness mechanism codes:
0 = fully observed
1 = MCAR
2 = MAR (entropy-based)
A list with:
data: data.frame with columns x1..xp, en, missing, label, truth
true_setup: list(pi, mu, sigma)
groups: list(mar_group, obs_group, mcar_in_mar, mcar_in_obs)
probs: vector prob_mar
raw: original rmix output dat augmented with en and labels
Unpacks mixture weights, means, and covariance(s) from a parameter vector.
unpack_theta(theta, g, p, ncov = 1)unpack_theta(theta, g, p, ncov = 1)
theta |
Numeric vector as returned by |
g |
Integer: number of components. |
p |
Integer: dimension. |
ncov |
Integer: covariance structure; 1 for shared covariance, 2 for class-specific. |
A list with:
Mixture weights (length g).
List of g mean vectors.
Shared covariance matrix (ncov=1) or list of g covariance matrices (ncov=2).