--- title: "Introduction to rrMixture" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to rrMixture} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\renewcommand{\vec}[1]{\mathbf{#1}} --- ```{css, echo = FALSE, eval = FALSE} body .main-container { max-width: 1280px !important; width: 1280px !important; } body { max-width: 1280px !important; } ``` ```{r, include = FALSE, eval = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, message = FALSE, include = FALSE} options(width = 200) ``` ## Background Consider the following multivariate linear regression model where we want to study a relationship between multiple responses and a set of predictors: \[ Y=XB+E \] where $Y$ is an $n \times q$ response matrix, $X$ is an $n \times p$ design matrix, $B$ is a $p \times q$ coefficient matrix, and $E$ is an $n \times q$ error matrix. To explore the relationship between $p$ predictors and $q$ responses, our interest is to estimate the coefficient matrix $B$. Traditionally, it can be estimated by \[ \hat{B}_L=\arg\min_{B} \|Y-XB\|^2_F=(X^\top X)^{-}X^\top Y \] where $\|\cdot\|_F$ denotes the Frobenius norm and $(\cdot)^{-}$ denotes a Moore-Penrose inverse. This approach is simple and fast, but it has some limitations: * It ignores the joint structure of the multivariate response. * It may not be feasible with high dimensionality. * It assumes a full-rank structure of the coefficient matrix. * It assumes the homogeneity of data, i.e., $n$ observations come from a single group. To overcome the drawbacks, `rrmix` has been developed to fit reduced-rank mixture models, which *simultaneously* take into account the *joint structure of the multivariate response*, possibility of *low-rank structure* of coefficient matrix and *heterogeneity of data*. ## Reduced-rank mixture models in multivariate regression Suppose the $n$ observations belong to $K$ different subpopulations, implying the heterogeneity of data. The marginal distribution of $\vec{y}_i$ can be assumed to be \begin{equation}\label{mixmodel} \vec{y}_i|\vec{x}_i\sim \sum_{k=1}^K \pi_k N_q(B_k^\top\vec{x}_i,\Sigma_k) \end{equation} for $i=1,\cdots,n$ with the Gaussian assumption. Accordingly, the objective function to maximize is the log-likelihood of a $K$-component finite Gaussian mixture model, i.e., \begin{equation*} \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^n \log \left[ \sum_{k=1}^K \pi_k \phi_q(\vec{y}_i;B_k^\top\vec{x}_i,\Sigma_k) \right],~~~\vec{y}_i \in \mathbb{R}^q, \end{equation*} where $\boldsymbol{\theta}=(\pi_1,B_1,\Sigma_1,\cdots,\pi_K,B_K,\Sigma_K)^\top$ is a collection of the unknown parameters with $\sum_{k=1}^K \pi_k=1$ and $\phi_q(\cdot;\mu,\Sigma)$ denotes the density function of $N_q(\mu,\Sigma)$. Under a reduced rank regression framework, our objective function is the penalized log-likelihood which can be written as \begin{equation} \label{F} \mathcal{F}(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}) - \mathcal{P}_{\boldsymbol{\lambda}}(\boldsymbol{\theta}), \end{equation} where $\boldsymbol{\lambda}=(\lambda_1,\cdots,\lambda_K)^\top$ is the tuning parameter. For example, the rank penalty is \begin{equation} \label{rp} \mathcal{P}_{\boldsymbol{\lambda}}(\boldsymbol{\theta}) = \frac{1}{2}\sum_{k=1}^K \lambda_k^2 \text{rank}(B_k) \end{equation} and the adaptive nuclear norm penalty is \begin{equation} \label{annp} \mathcal{P}_{\boldsymbol{\lambda}}(\boldsymbol{\theta}) = \sum_{k=1}^K \lambda_k ||XB_k||_{*w} \end{equation} where $||\cdot||_{*w}=\sum_{i=1}^{p \wedge q} w_id_i(\cdot)$, $n \wedge q = \min(n,q)$, $w_i = d_i^{-\gamma}(X\hat{B}_L)$ following Zou (2006), $d_i(\cdot)$ denotes the $i$th largest singular value of a matrix, and $\gamma$ is a nonnegative constant. The rank-penalized and adaptive nuclear norm penalized estimation have originally been developed in Bunea et al. (2011) and Chen et al. (2013) respectively, for non-mixture cases, and adapted for an extension to mixture models in Kang et al. (2022+). ## Introduction to `rrMixture` `rrMixture` is an R package for fitting reduced-rank mixture models in multivariate regression via an iterative algorithm. It allows users to * find an optimal number of mixture components $K$ for a given data set; * estimate the parameter $\boldsymbol{\theta}=(\pi_1,B_1,\Sigma_1,\cdots,\pi_K,B_K,\Sigma_K)^\top$ which maximizes $\mathcal{F}(\boldsymbol{\theta})$; * find the best tuning parameter(s) $\boldsymbol{\lambda}=(\lambda_1,\cdots,\lambda_K)^\top$ and/or $\gamma$; and * determine the ranks of $K$ coefficient matrices. For simplicity, we assume $\lambda_1=\cdots=\lambda_K=\lambda$ and try to find an optimal value of $\lambda$. As of version 0.1-2, available methods are * `FR`: full-ranked estimation with $\mathcal{P}_{\boldsymbol{\lambda}}(\boldsymbol{\theta})=0$. * `RP`: rank-penalized estimation with $\mathcal{P}_{\boldsymbol{\lambda}}(\boldsymbol{\theta}) = \frac{1}{2}\lambda^2\sum_{k=1}^K \text{rank}(B_k)$. * `ANNP`: adaptive nuclear norm penalized estimation with $\mathcal{P}_{\boldsymbol{\lambda}}(\boldsymbol{\theta}) = \lambda \sum_{k=1}^K ||XB_k||_{*w}$. This document shows how to make use of `rrMixture` functionalities. `rrMixture` also provides a set of tools to summarize and visualize the estimation results. ## Getting started: Installation `rrMixture` is available from CRAN. Install the package and load the library: ```{r setup, eval = FALSE, message = FALSE} install.packages("rrMixture") library(rrMixture) ``` ```{r, message = FALSE, echo = FALSE} library(rrMixture) ``` ## A real data example: `tuna` data To describe the usage of `rrMixture`, we here consider a real data set which is available within the R package `bayesm`. It contains the volume of weekly sales (`Move`) for seven of the top 10 U.S. brands in the canned tuna product category for $n=338$ weeks between September 1989 and May 1997 along with with a measure of the display activity (`Nsale`) and the log price of each brand (`Lprice`). See Chevalier et al. (2003) for details. The goal is to study the effect of prices and promotional activities on sales for these 7 products; therefore, we have the following $X$ and $Y$ matrices and thus $n=338$, $q=7$, and $p=15$. Try `?tuna` for details. ```{r} # Load and pre-process a data set library(bayesm) data(tuna) tunaY <- log(tuna[, c("MOVE1", "MOVE2", "MOVE3", "MOVE4", "MOVE5", "MOVE6", "MOVE7")]) tunaX <- tuna[, c("NSALE1", "NSALE2", "NSALE3", "NSALE4", "NSALE5", "NSALE6", "NSALE7", "LPRICE1", "LPRICE2", "LPRICE3", "LPRICE4", "LPRICE5", "LPRICE6", "LPRICE7")] tunaX <- cbind(intercept = 1, tunaX) ``` ## Parameter estimation using `rrmix()` `rrmix()` function estimates $\boldsymbol{\theta}$ via an EM algorithm using either the full-ranked, rank penalized, or adaptive nuclear norm penalized mixture models *when $K$, $\lambda$, and $\gamma$ are given*; Later we will discuss another function for finding optimal values of these parameters. For `rrmix()`, we set several arguments such as: * `K`: number of mixture components. * `X` and `Y`: $X$ and $Y$ matrices of data. * `est`: estimation method to use, either `"FR"` (full-ranked), `"RP"` (rank penalized), or `"ANNP"` (adaptive nuclear norm penalized). * `lambda` and `gamma`: tuning parameter(s) to set. `lambda` is only used for `"RP"` and `"ANNP"`, and `gamma` is only used for `"ANNP"`. In order to implement the EM algorithm, we need starting values of $\boldsymbol{\theta}$ which can be set by the following arguments: * `n.init`: number of repetitions of initialization to try. Note that the final estimates of $\boldsymbol{\theta}$ depends on the starting values, so we try to get starting values `n.init` times and use the one with the largest log-likelihood as the final starting values. In order to assign initial mixture membership to each observation, two methods are used: K-means and random clustering. * `seed` (optional): seed number for reproducibility of starting values. An example with `tuna` data: ```{r, cache = TRUE} # Parameter estimation with `rrmix()` using the rank penalized method tuna.mod <- rrmix(K = 2, X = tunaX, Y = tunaY, est = "RP", lambda = 3, seed = 100, n.init = 100) ``` We can find the estimated parameters, $\hat{\boldsymbol{\theta}}=(\hat\pi_1,\hat{B}_1,\hat\Sigma_1,\cdots,\hat\pi_K,\hat{B}_K,\hat\Sigma_K)^\top$, with the command `tuna.mod$para`. ```{r} # Estimated parameters tuna.mod$para ``` Besides the estimated parameter $\hat{\boldsymbol{\theta}}$, one of the fundamental tasks in reduced rank estimation is *rank determination*. The estimated ranks of $K$ coefficient matrices can be found by `tuna.mod$est.rank` as follows. In this case with $\lambda = 3$, both coefficient matrices turned out to have full rank since $\min(p,q)=7$. ```{r} # Estimated ranks of coefficient matrices tuna.mod$est.rank ``` The package provides `summary()` and `plot()` methods for the `rrmix` object. With the `summary()` function, we can see summarized information of the fitted model. We can see that the EM algorithm is terminated after 57 iterations and the BIC of the fitted model is 3115.056. ```{r} # Summarize the estimation results summary(tuna.mod) ``` The `plot()` function shows the log-likelihood and penalized log-likelihood values at each iteration as can be seen below. The ascent property of the penalized log-likelihood can be observed. ```{r, fig.width = 7, fig.height = 5} # Visualize the estimation results plot(tuna.mod) ``` ## Parameter tuning with `tune.rrmix()` As shown above, the `rrmix()` function estimates parameters with fixed $K$ (number of mixture components), $\lambda$ and $\gamma$ (tuning parameters). In practice, choosing proper values of $K$, $\lambda$ and $\gamma$ is important. This can be done using the `tune.rrmix()` function by performing a grid search over user-specified parameter ranges. Notice that the full-ranked method (`"FR"`) has no tuning parameters, the rank penalized method (`"RP"`) only has one tuning parameter, $\lambda$. The adaptive nuclear norm penalized method (`"ANNP"`) has two tuning parameters, $\lambda$ and $\gamma$. Suppose one wants to tune $\lambda$ for the rank penalized method with fixed $K=2$. We can try the following with a set of candidate values of $\lambda$: ```{r, cache = TRUE, results = 'hide'} # Parameter tuning with `tune.rrmix()` using the rank penalized method tuna.tune1 <- tune.rrmix(K = 2, X = tunaX, Y = tunaY, est = "RP", lambda = exp(seq(0, log(20), length = 20)), seed = 100, n.init = 100) ``` The `tune.rrmix` object also has `summary()` and `plot()` methods. Both functions find an optimal tuning parameter that provides the best performance metric - the smallest BIC in this case - among the given set of candidate values. ```{r} # Summarize the results summary(tuna.tune1) ``` When only one parameter is considered for tuning with other parameters being fixed, the parameter is on the x-axis and the performance metric is on the y-axis when using the `plot()` function. In this case, we use a grid of 20 $\lambda$ values equally spaced on the *log* scale, we can set `transform.x = log` for a better illustration. ```{r, fig.width = 7, fig.height = 5} # Visualize the results plot(tuna.tune1, transform.x = log, xlab = "log(lambda)") ``` Now suppose one wants to tune both $K$ and $\lambda$. We can try the following command with setting `K.max` instead of `K`. If `K.max` is 3, the `tune.rrmix()` function try all cases where $K$ is 1 through 3. Note that $K=1$ indicates a non-mixture case. Hence, the `tune.rrmix()` function can automatically examine whether assuming heterogeneity of data is appropriate for a given data set. ```{r, cache = TRUE, results = 'hide'} # Parameter tuning with `tune.rrmix()` using the rank penalized method tuna.tune2 <- tune.rrmix(K.max = 3, X = tunaX, Y = tunaY, est = "RP", lambda = exp(seq(0, log(20), length = 20)), seed = 100, n.init = 100) ``` ```{r} # Summarize the results summary(tuna.tune2) ``` When $K$ and $\lambda$ are tuned, a contour plot of the performance metric (BIC) can be drawn using the `plot()` function with $K$ and $\lambda$ being on the x- and y-axis, respectively. ```{r, fig.width = 7, fig.height = 5} # Visualize the results plot(tuna.tune2, transform.y = log, ylab = "log(lambda)") ``` As we can see above, the best model is the one with $K=2$ and $\lambda = 8.728963$, implying that the data is heterogeneous and consisting of 2 subpopulations. Note also that, if candidate values of $\lambda$ are not pre-specified, i.e., if `lambda = NULL` in `tune.rrmix()`, a data-adaptive range of $\lambda$ for tuning will internally be determined. In this case, users can set the argument `n.lambda` which specifies the number of $\lambda$ values to be explored in the range. Furthermore, if candidate values of $\gamma$ are not specified with `est = "ANNP"` in `tune.rrmix()`, `gamma = 2` will be used by default since it generally showed good performance in Chen et al. (2013). ## Interpretation of final model Let's see how we can interpret the final model determined by `tune.rrmix()`. As mentioned earlier, $\hat{\boldsymbol{\theta}}=(\hat\pi_1,\hat{B}_1,\hat\Sigma_1,\cdots,\hat\pi_K,\hat{B}_K,\hat\Sigma_K)^\top$ can be obtained by `best.mod$para`. ```{r} # The final model best.K <- summary(tuna.tune2)$best.K best.lambda <- summary(tuna.tune2)$best.lambda best.mod <- rrmix(K = best.K, X = tunaX, Y = tunaY, est = "RP", lambda = best.lambda, seed = 100, n.init = 100) summary(best.mod) best.mod$para ``` The estimated ranks of $\hat{B}_1$ and $\hat{B}_2$ are 7 and 4, respectively, as follows. In this case, the first coefficient matrix turned out to have full rank and the second coefficient matrix had low-rank structure since full rank is $\min(p,q)=7$. ```{r} # Estimated ranks of coefficient matrices best.mod$est.rank ``` We can also get information on the membership of mixture components using the following command. The final model suggests that, among $n=338$ observations, 299 observations belong to the first mixture component and the remaining 39 observations belong to the second mixture component. ```{r} # Membership information of mixture components best.mod$ind best.mod$n.est ``` ## Comparison with existing R package `rrpack` The package `rrpack` allows users to use a variety of reduced-rank estimation methods in multivariate regression. For example, the `rrr()` function with the argument `penaltySVD = "ann"` provides a solution of the adaptive nuclear norm penalized estimator (Chen et al., 2013). One difference between the `rrr()` function from the `rrpack` package and the `rrmix()` function from the `rrMixture` package is that `rrmix()` incorporates the idea of mixture models. Therefore, when the number of mixture components is $K=1$, `rrmix()` produces the same solution with that of `rrr()` with the same tuning parameter(s). Let's see an example using the `tuna` data. ```{r, message = FALSE} # rrpack package require(rrpack) rfit <- rrr(Y = as.matrix(tunaY), X = as.matrix(tunaX), modstr = list(lambda = 3, gamma = 2), penaltySVD = "ann") # estimated coefficient matrix coef(rfit) # estimated rank of the coefficient matrix rfit$rank ``` ```{r} # rrMixture package rfit2 <- rrmix(K = 1, Y = tunaY, X = tunaX, lambda = 3, gamma = 2, est = "ANNP") # estimated coefficient matrix rfit2$para[[1]]$B # estimated rank of the coefficient matrix rfit2$est.rank ``` We can see that the two sets of estimation results are consistent. ## Wrapping up This document illustrates the usage of the `rrMixture` package. For illustrative purposes, some but not all functions are described. For details of the estimation methods and additional functions of the package, see Kang et al. (2022+) and the R package manual. ## References * Bunea, F., She, Y. & Wegkamp, M. H. (2011), ‘Optimal selection of reduced rank estimators of high-dimensional matrices’, *The Annals of Statistics* **39**(2), 1282–1309. * Chen, K., Dong, H. & Chan, K.-S. (2013), ‘Reduced rank regression via adaptive nuclear norm penalization’, *Biometrika* **100**(4), 901–920. * Chevalier, J. A., Kashyap, A. K. & Rossi, P. E. (2003), ‘Why don’t prices rise during periods of peak demand? evidence from scanner data’, *American Economic Review* **93**(1), 15–37. * Kang, S., Chen, K., & Yao, W. (2022+). ‘Reduced rank estimation in mixtures of multivariate linear regression’. * Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, *Journal of the American statistical association* **101**(476), 1418–1429.