--- title: "A Tutorial on Mixtures of Quantile Regressions" subtitle: "Finding hidden subgroups and their effects, for applied researchers" author: "Kailas Venkitasubramanian, University of North Carolina at Charlotte" output: rmarkdown::html_vignette: toc: true toc_depth: 2 fig_caption: true bibliography: mixqr.bib link-citations: true vignette: > %\VignetteIndexEntry{A Tutorial on Mixtures of Quantile Regressions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, fig.width = 7, fig.height = 4.4, dpi = 150, fig.align = "center" ) set.seed(2026) ``` This tutorial is written for applied researchers across education, economics, sociology, political science, public health, and adjacent fields, who suspect that one regression line is hiding more than one story, and who care about more than the average. You do not need prior exposure to quantile regression or to mixture models. Each idea is introduced in words before any equation, every analysis step is shown in runnable code, and every result is visualized. By the end you will be able to: recognize when a *mixture of quantile regressions* is the right tool; fit one with **mixqr**; read and visualize the estimates; classify observations into the recovered subgroups; check whether the answer can be trusted; describe effects across the whole outcome distribution; and report the analysis reproducibly. This tutorial covers the **core** of the framework. **mixqr** is built on a single EM platform with an engine/extension contract, so the same machinery also supports expectile and M-quantile component families, component-specific penalized selection, and non-crossing multi-quantile estimation (see the *Extensions* article), while the companion package **mixqrgate** adds location-varying gating. ```{r libs} library(mixqr) library(ggplot2) library(quantreg) ``` ```{r theme, include = FALSE} # A clean, consistent look used throughout. theme_mixqr <- function() { theme_minimal(base_size = 12) + theme(panel.grid.minor = element_blank(), plot.title = element_text(face = "bold"), plot.subtitle = element_text(colour = "grey35"), legend.position = "top") } pal_seq <- c("#1b6ca8", "#e07b39", "#5aae61", "#9b59b6") ``` # 1. The problem: one line, two stories Most regression models a single relationship: "a one-unit change in *x* moves *y* by β." But populations are often **mixtures of unobserved subgroups**, each with its own relationship. A few examples from the social sciences: * **Schools.** The link between family socioeconomic status (SES) and a student's test score is steep in some schools and shallow in others, and you may not have a clean label for "which kind of school." * **Labor markets.** Returns to education differ between a primary, credentialed segment and a secondary, casualized one. * **Program evaluation.** A policy helps a "responder" subgroup and barely moves a "non-responder" subgroup. The average effect describes neither. When you fit one line to a mixture, you get a number that averages away the groups and mirrors none of them. A *finite mixture of regressions* instead assumes the data come from a small number of latent groups (called *components* or *regimes*) and estimates a separate regression in each, discovering the grouping and the group-specific effects *jointly*, without a pre-existing label [@mclachlanpeel2000]. `mixqr` does this for **quantile** regressions rather than mean regressions. That buys two things, explained next. # 2. Why quantiles, and why a mixture **Why quantiles.** A quantile regression at level $\tau$ models the conditional $\tau$-th quantile of the outcome (the median at $\tau = 0.5$, a lower tail at $\tau = 0.1$, an upper tail at $\tau = 0.9$) instead of the mean [@koenker1978; @koenker2005]. The median is far more robust to skew and outliers than the mean, and fitting several quantiles reveals effects that a mean model cannot: a predictor may lift the bottom of the distribution while leaving the top alone. **Why a mixture.** Within each latent regime, `mixqr` fits a quantile regression; across regimes, a *mixing probability* says how prevalent each regime is. Formally, with a latent class $Z \in \{1, \dots, m\}$ and $\Pr(Z = j) = \pi_j$, the conditional $\tau$-th quantile in regime $j$ is $$ Q_\tau(Y \mid x, Z = j) = x'\beta_j , $$ and the components are estimated by an EM algorithm so that the grouping and the $\beta_j$ are found together [@wuyao2016]. **Why not simpler tools?** | Approach | What it misses | |---|---| | OLS / single mean regression | averages the regimes away; sensitive to skew | | A single quantile regression | robust and distributional, but still **one** line | | Cluster first (k-means), then regress | two-stage; clusters ignore the *relationship*, and the second-stage SEs ignore the clustering uncertainty | | `mixqr` | finds regimes **and** their quantile-specific effects jointly, with calibrated uncertainty | # 3. An illustrative dataset To keep everything reproducible and transparent, we *simulate* a student- achievement example with a known structure, then let `mixqr` recover it. (You would use your own data; the workflow is identical.) We generate 600 students. Each belongs to one of two unobserved school regimes: * **Supportive schools** (60% of students): a high baseline and a strong SES gradient. * **Under-resourced schools** (40%): a lower baseline and a weaker gradient. Two features make this a job for quantile methods. Scores are **right-skewed** (a few unusually high achievers), so the median is a steadier summary than the mean. Within each regime the **spread of scores grows with SES**, so SES affects the top of the distribution differently from the bottom, a pattern a mean model cannot express. ```{r simulate} simulate_schools <- function(n = 600, pi_A = 0.6) { ses <- rnorm(n) # standardized SES regime <- ifelse(runif(n) < pi_A, "A", "B") err <- exp(rnorm(n, 0, 0.45)) - 1 # right-skewed, median 0 spread <- ifelse(regime == "A", 6, 9) * exp(0.30 * ses) # grows with SES score <- ifelse(regime == "A", 56 + 8.0 * ses, # supportive schools 46 + 3.0 * ses) + spread * err data.frame(score = score, ses = ses, regime = factor(regime)) } schools <- simulate_schools() head(schools) ``` The raw scatter already hints at two fans of points, but a single median line (dashed) splits the difference and represents neither regime: ```{r raw, fig.height = 4.2, fig.alt = "Scatter of score against SES with a single pooled median line."} single <- rq(score ~ ses, tau = 0.5, data = schools) ggplot(schools, aes(ses, score)) + geom_point(alpha = 0.35, colour = "grey30") + geom_abline(intercept = coef(single)[1], slope = coef(single)[2], linewidth = 1.1, linetype = "dashed") + labs(x = "Family SES (standardized)", y = "Math score", title = "One median line hides the subgroups", subtitle = "A single quantile regression averages two regimes together") + theme_mixqr() ``` # 4. Fitting your first model We fit a two-component mixture of **median** regressions. We ask for stochastic-EM standard errors (the recommended, calibrated method; see Section 8) and several restarts because the mixture likelihood is multimodal. ```{r fit} fit <- mixqr(score ~ ses, data = schools, tau = 0.5, m = 2, nstart = 25, variance = "stochEM", vcontrol = mixqr_vcontrol(B = 200, seed = 1)) fit ``` Read the output as two regressions plus the mixing probabilities. `summary()` adds standard errors, $z$-values, and the diagnostics discussed later: ```{r summary} summary(fit) ``` The estimated components recover the design: one regime with a high intercept and a steep SES slope (the supportive schools), and one with a lower intercept and a shallower slope (the under-resourced schools). Components are ordered by slope, so component 1 is always the shallower-gradient regime. # 5. Visualizing the estimates ## 5.1 The two regimes The single most useful picture overlays the component median lines on the data, colored by each student's most likely regime. ```{r est-plot, fig.alt = "Score against SES, points coloured by recovered regime, with the two component median lines."} # Components are ordered by SES slope, so component 1 is the shallower-gradient # (under-resourced) regime and component 2 the steeper (supportive) one. comp_label <- c("Under-resourced schools", "Supportive schools") pal2 <- c("Under-resourced schools" = "#e07b39", "Supportive schools" = "#1b6ca8") schools$regime_hat <- factor(comp_label[predict(fit, type = "class")], levels = comp_label) xseq <- seq(min(schools$ses), max(schools$ses), length.out = 100) lines_df <- do.call(rbind, lapply(1:2, function(j) { data.frame(ses = xseq, score = as.numeric(cbind(1, xseq) %*% fit$beta[, j]), regime = factor(comp_label[j], levels = comp_label)) })) ggplot(schools, aes(ses, score)) + geom_point(aes(colour = regime_hat), alpha = 0.45, size = 1.8) + geom_line(data = lines_df, aes(colour = regime), linewidth = 1.3) + scale_colour_manual(values = pal2, name = "Recovered regime") + labs(x = "Family SES (standardized)", y = "Math score", title = "Two SES–achievement gradients, recovered jointly", subtitle = "Component median regressions from mixqr") + theme_mixqr() ``` ## 5.2 A coefficient comparison For reporting, a coefficient plot with confidence intervals communicates the substantive contrast between regimes at a glance. We pull the component coefficients and their intervals straight from `confint()`. ```{r coef-plot, fig.height = 3.6, fig.alt = "Coefficient estimates with 95% confidence intervals for each regime, by term."} ci <- confint(fit) # Wald intervals from stochastic-EM SEs keep <- grep("^comp", rownames(ci)) coef_df <- data.frame( label = rownames(ci)[keep], est = c(as.numeric(fit$beta)), lo = ci[keep, 1], hi = ci[keep, 2] ) comp_idx <- as.integer(sub("comp(\\d+):.*", "\\1", coef_df$label)) coef_df$regime <- factor(comp_label[comp_idx], levels = comp_label) coef_df$term <- sub("comp\\d+:", "", coef_df$label) coef_df$term <- factor(coef_df$term, labels = c("Intercept", "SES slope")) ggplot(coef_df, aes(est, regime, colour = regime)) + geom_pointrange(aes(xmin = lo, xmax = hi), linewidth = 0.9, size = 0.7) + facet_wrap(~ term, scales = "free_x") + scale_colour_manual(values = pal2, guide = "none") + labs(x = "Estimate (95% CI)", y = NULL, title = "Regime-specific effects with uncertainty", subtitle = "The SES gradient is far steeper in the supportive regime") + theme_mixqr() ``` # 6. Who is in which group? Mixture models give **soft** memberships: each observation has a posterior probability of belonging to each regime. Hard labels (`type = "class"`) take the most likely regime, but the probabilities themselves are informative: many points are assigned with near-certainty, a few are ambiguous. ```{r posterior, fig.height = 3.6, fig.alt = "Histogram of the probability of belonging to regime A."} post <- predict(fit, type = "posterior") schools$p_supportive <- post[, 2] # component 2 = supportive (steeper) regime ggplot(schools, aes(p_supportive)) + geom_histogram(bins = 30, fill = "#1b6ca8", colour = "white") + labs(x = "Posterior probability of the supportive regime", y = "Students", title = "Most students are classified with confidence", subtitle = "Mass near 0 and 1 means confident assignments") + theme_mixqr() ``` Here about `r round(100 * mean(apply(post, 1, max) > 0.9))`% of students are assigned to a regime with probability above 0.9. Only a few sit in the ambiguous middle. The package summarizes the separation in one number, the **responsibility overlap** (0 = perfectly separated, 1 = fully overlapping), moderate in this example: ```{r overlap} fit$diagnostics$overlap ``` # 7. Diagnostics: can you trust it? Two checks and a caveat matter for a mixture of quantile regressions. ## 7.1 The component error densities `mixqr`'s semiparametric engine estimates each regime's error distribution non-parametrically, constrained so its $\tau$-th quantile is exactly zero [@hallpresnell1999]. Plotting these densities shows the shape of each regime's residuals and confirms the constraint (the mass below 0 equals $\tau$). ```{r densities, fig.height = 3.8, fig.alt = "Estimated component error densities with the constraint marker at zero."} fit_kd <- mixqr(score ~ ses, data = schools, tau = 0.5, m = 2, engine = "kdEM", init = "ald", nstart = 15) rng <- range(vapply(fit_kd$density, function(d) range(d$grid), numeric(2))) tt <- seq(rng[1], rng[2], length.out = 400) dens_df <- do.call(rbind, lapply(1:2, function(j) { data.frame(resid = tt, density = fit_kd$density[[j]]$eval(tt), regime = factor(comp_label[j], levels = comp_label)) })) ggplot(dens_df, aes(resid, density, colour = regime)) + geom_line(linewidth = 1.1) + geom_vline(xintercept = 0, linetype = 3, colour = "grey40") + scale_colour_manual(values = pal2, name = "Regime") + labs(x = "Residual", y = "Density", title = "Each regime has its own (skewed) error distribution", subtitle = "Estimated by the Wu & Yao kernel-density EM; tau-quantile fixed at 0") + theme_mixqr() ``` ## 7.2 Two engines agree The fast `"ald"` engine assumes a parametric (asymmetric Laplace) error; the `"kdEM"` engine estimates the error shape. When they give similar component lines, you can trust the parametric default; when they diverge, the data have structure the parametric model misses. Here they agree closely. ```{r engines} rbind(ALD = as.numeric(fit$beta), kdEM = as.numeric(fit_kd$beta)) ``` ## 7.3 A caveat worth knowing The semiparametric estimator can be **biased when the regimes overlap heavily and the errors are strongly asymmetric** (a property of the estimating equations, not a bug; see @wuyao2016, sec. 6). The overlap diagnostic above is your early warning, and cross-checking the two engines is the practical guard. With well-separated regimes (the common, useful case), the estimates are reliable. # 8. Inference done right `mixqr` offers two standard-error methods: * `variance = "sparsity"` — fast, but **conditional on the classification**: it treats group membership as known and therefore *understates* uncertainty. `summary()` flags this. * `variance = "stochEM"` — a stochastic-EM multiple-imputation estimator [@wuyao2016, Alg. 3.1] that propagates the classification and mixing uncertainty. Use this method for inference. A Monte-Carlo benchmark (see the [Validation article](https://kvenkita.github.io/mixqr/articles/mixqr-validation.html)) shows its intervals reach roughly nominal 95% coverage for the regression coefficients. ```{r confint} confint(fit) ``` One caveat: the **mixing-probability** estimate carries a finite-sample bias, so its interval is correctly sized but can sit off-center. Here the true split is 60/40, yet the estimate is about `r round(100 * fit$pi[2])`/`r round(100 * fit$pi[1])` — a live reminder to report $\pi$ as an approximate split, not to the decimal. # 9. Beyond the median: the whole distribution The real payoff of *quantile* mixtures is describing each regime across the outcome distribution, beyond its median. We refit at $\tau = 0.1, 0.5, 0.9$ and draw each regime's conditional-quantile lines. (Components are slope-ordered, so they align across $\tau$.) ```{r multitau, fig.height = 4.6, fig.alt = "Conditional quantile lines at tau = 0.1, 0.5, 0.9 for each regime."} taus <- c(0.1, 0.5, 0.9) multi <- do.call(rbind, lapply(taus, function(tt) { f <- mixqr(score ~ ses, data = schools, tau = tt, m = 2, nstart = 15) do.call(rbind, lapply(1:2, function(j) { data.frame(ses = xseq, score = as.numeric(cbind(1, xseq) %*% f$beta[, j]), regime = factor(comp_label[j], levels = comp_label), tau = tt) })) })) ggplot(schools, aes(ses, score)) + geom_point(alpha = 0.18, colour = "grey45", size = 1.2) + geom_line(data = multi, aes(colour = regime, group = interaction(regime, tau), linewidth = factor(tau))) + scale_colour_manual(values = pal2, name = "Regime") + scale_linewidth_manual(values = c("0.1" = 0.5, "0.5" = 1.2, "0.9" = 0.5), name = "Quantile") + labs(x = "Family SES (standardized)", y = "Math score", title = "Conditional quantile bands by regime", subtitle = "Wider bands at high SES reveal growing within-regime inequality") + theme_mixqr() ``` The vertical spread between a regime's 10th- and 90th-percentile lines is its *within-regime inequality* at a given SES. Where that gap widens with SES, the top and bottom of the regime pull apart — a distributional finding invisible to a mean model. (Fitting several $\tau$ separately can in principle let a regime's quantile lines cross; jointly enforcing non-crossing is the role of a companion extension to `mixqr`.) # 10. How many groups? We assumed two regimes. In practice you choose `m`. `mixqr_select()` compares component counts. Cross-validated predictive log-likelihood (`"cv"`) is engine- agnostic and penalizes complexity; `AIC`/`BIC` are available too (with the usual caveat that mixture-component counts sit on a parameter-space boundary). ```{r select, fig.height = 3.6, fig.alt = "Cross-validated score against number of components."} sel <- mixqr_select(score ~ ses, data = schools, tau = 0.5, m = 1:4, criterion = "cv", folds = 5, nstart = 8, control = mixqr_control(seed = 1)) sel$table ggplot(sel$table, aes(m, cv_negloglik)) + geom_line(colour = "#1b6ca8") + geom_point(size = 3, colour = "#1b6ca8") + geom_point(data = sel$table[sel$table$m == 2, ], size = 5, shape = 21, fill = "#e07b39", colour = "#e07b39") + labs(x = "Number of components (m)", y = "CV held-out negative log-likelihood", title = "Choosing the number of regimes", subtitle = "Lower is better; the elbow at two regimes is the parsimonious choice") + theme_mixqr() ``` The sharp drop from one to two components, then the near-flat stretch beyond, is the signature of two genuine regimes. The strict cross-validation minimum here falls at m = `r sel$best` (`r round(min(sel$table$cv_negloglik))` versus `r round(sel$table$cv_negloglik[2])` at m = 2 — a rounding-error apart, while m = 1 is far worse). Read the **elbow**, not the bare minimum: two regimes is the parsimonious, defensible choice. # 11. Reporting and reproducibility A complete, reproducible write-up needs little: * the quantile level $\tau$, the number of components `m` and how it was chosen, the engine, and the number of restarts; * component coefficients with **stochastic-EM** intervals (Section 8); * the overlap / separability diagnostic (Section 6) and, ideally, the two-engine cross-check (Section 7); * a seed. ```{r report, eval = FALSE} # everything needed to reproduce the headline fit fit <- mixqr(score ~ ses, data = schools, tau = 0.5, m = 2, nstart = 25, variance = "stochEM", vcontrol = mixqr_vcontrol(B = 200, seed = 1)) summary(fit) ``` That is the whole arc. One model call turned "one line, two stories" into two regimes, their quantile-specific effects, calibrated intervals, and diagnostics that say when to trust them. The same few lines of code scale from this teaching example to your own data. # Citation If `mixqr` supports your research, please cite the package and the underlying method: ```{r cite, eval = FALSE} citation("mixqr") ``` > Venkitasubramanian, K. (2026). *mixqr: Finite Mixtures of Quantile > Regressions*. R package version 0.1.0.9000. > > > Wu, Q. & Yao, W. (2016). Mixtures of quantile regressions. *Computational > Statistics & Data Analysis*, 93, 162–176. # References