Motivation and model

The lmls package comes with the abdom dataset (which it borrows from the gamlss.data package). The dataset consists of only two variables: the size of 610 fetuses (as measurements of their abdominal circumference taken from ultrasound scans) and their gestational age ranging from 12 to 42 weeks.

(ref:abdom-plot) The abdom dataset containing the gestational age and abdominal circumference of 610 fetuses.

library(lmls)

ggplot(abdom, aes(x, y)) +
  geom_point(color = "darkgray", size = 1) +
  xlab("Age [weeks]") +
  ylab("Size [mm]")

(ref:abdom-plot)

As expected, Figure @ref(fig:abdom-plot) indicates that the size of the babies increases with their age. We can use a simple linear regression model to quantify this relationship:

m1 <- lm(y ~ x, data = abdom)
summary(m1)
#> 
#> Call:
#> lm(formula = y ~ x, data = abdom)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -84.579  -8.105  -0.185   8.064  54.325 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -55.1795     2.0010  -27.58   <2e-16 ***
#> x            10.3382     0.0701  147.49   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 14.63 on 608 degrees of freedom
#> Multiple R-squared:  0.9728, Adjusted R-squared:  0.9728 
#> F-statistic: 2.175e+04 on 1 and 608 DF,  p-value: < 2.2e-16

The simple linear regression model with normally distributed errors is defined as $$y_i = \beta_0 + x_i \beta_1 + \varepsilon_i, \text{ where } \varepsilon_i \overset{\text{i.i.d.}}{\sim}\mathcal{N}(0, \sigma^2).$$ Based on the scatterplot of the data in Figure @ref(fig:abdom-plot), the homoscedasticity (= constant variance) assumption of the simple linear regression model seems implausible. In fact, the variance of the babies’ size seems to increase with their age. We can confirm this by plotting the regression residuals against the babies’ age:

(ref:abdom-resid) The residuals of the simple linear regression model for the abdom dataset show clear heteroscedasticity and some non-linearity.

abdom$resid <- resid(m1)

ggplot(abdom, aes(x, resid)) +
  geom_point(color = "darkgray", size = 1) +
  geom_hline(yintercept = 0, linewidth = 0.5) +
  xlab("Age [weeks]") +
  ylab("Residuals")

(ref:abdom-resid)

It follows from the definition of the simple linear regression model that the response variable y_i is normally distributed with mean μ_i = β₀ + x_iβ₁ and standard deviation σ, yielding $$y_i \overset{\text{ind.}}{\sim}\mathcal{N}(\beta_0 + x_i \beta_1, \sigma^2).$$ From this representation, we can extend the simple linear regression model and use the explanatory variable x_i to predict not only the mean μ_i but also the standard deviation σ_i of the response variable y_i. We translate this idea into the model which is a minimal linear model for location and scale (LMLS). The regression coefficients γ₀ and γ₁ are the intercept and the slope parameter for the log-standard deviation, and the exponential function is the inverse link (or response) function. It ensures that the predictions for the standard deviation are always valid (= positive). This step is necessary, because the linear predictor γ₀ + x_iγ₁ can become zero or negative for some choices of γ₀ and γ₁.

Using the lmls package, we can estimate Model @ref(eq:lmls) for the abdom dataset with a maximum likelihood algorithm running the following line of code:

(ref:abdom-lmls) The LMLS for the abdom dataset with a linear effect of the babies’ age on their average size and a linear effect on the log-standard deviation.

m2 <- lmls(y ~ x, ~ x, data = abdom)

abdom$mu <- predict(m2, type = "response", predictor = "location")
abdom$sigma <- predict(m2, type = "response", predictor = "scale")
abdom$upper <- abdom$mu + 1.96 * abdom$sigma
abdom$lower <- abdom$mu - 1.96 * abdom$sigma

ggplot(abdom, aes(x, y)) +
  geom_point(color = "darkgray", size = 1) +
  geom_line(aes(y = mu), linewidth = 0.7) +
  geom_line(aes(y = upper), linewidth = 0.3) +
  geom_line(aes(y = lower), linewidth = 0.3) +
  xlab("Age [weeks]") +
  ylab("Size [mm]")

(ref:abdom-lmls)

The general LMLS can include multiple explanatory variables, transformations of the explanatory variables, and it does not require the explanatory variables for the mean and the standard deviation to be identical. We define the general LMLS as $$y_i \overset{\text{ind.}}{\sim}\mathcal{N}(\bm{x}_i'\bm{\beta}, (\exp(\bm{z}_i'\bm{\gamma}))^2),$$ where x_i and β are the covariate vector and the vector of regression coefficients for the mean, and z_i and γ are the covariate vector and the vector of regression coefficients for the standard deviation.

Polynomials are a straightforward way to include transformations of explanatory variables in a model. For example, we can improve Model @ref(eq:lmls) for the abdom dataset and add a quadratic effect of the babies’ age on their average size with this command:

m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom)

The AIC drops from 4861.184 to 4802.823 for this model compared to Model @ref(eq:lmls). Figure @ref(fig:abdom-poly-2) illustrates how the quadratic effect improves the fit of the model.

(ref:abdom-poly-2) The LMLS for the abdom dataset with a quadratic effect of the babies’ age on their average size and a linear effect on the log-standard deviation.

(ref:abdom-poly-2)

Statistical inference

The lmls package implements two inference algorithms: a frequentist maximum likelihood (ML) algorithm, which it uses by default, and a Bayesian Markov chain Monte Carlo (MCMC) algorithm. The derivatives that are required for these inference algorithms are derived in the Appendix in Section @ref(sec:appendix).

To simplify the notation in this and the next section, we first define the response vector y = [y₁, …, y_n]′, the design matrices X = [x₁′, …, x_n′]′ and Z = [z₁′, …, z_n′]′, and the standard deviation of the i-th observation σ_i = exp (z_i′γ). Finally, let W be the weight matrix

Maximum likelihood

The ML algorithm combines weighted least squares (WLS) updates for $\hat{\bm{\beta}}$ and Fisher scoring updates for $\hat{\bm{\gamma}}$. We first discuss this update strategy and then take a look at the initialization strategy.

Update strategy

If we know the true γ and hence the true weight matrix W, we can estimate β with WLS: $$\hat{\bm{\beta}}^{\text{(WLS)}}= (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1} \mathbf{X}'\mathbf{W}\bm{y}.$$

On the other hand, if we know the true β, we can estimate γ with the Fisher scoring algorithm. Fisher scoring is an iterative algorithm for ML estimation, similar to Newton’s method. The k-th update of the Fisher scoring algorithm is defined as $$\hat{\bm{\gamma}}^{(k)} = \hat{\bm{\gamma}}^{(k - 1)} + (\mathcal{I}(\hat{\bm{\gamma}}^{(k - 1)}))^{-1} s(\hat{\bm{\gamma}}^{(k - 1)}),$$ where ℐ is the Fisher information and s is the score of γ.

In most cases, we know neither the true β nor the true γ, but we can estimate them with an iterative algorithm alternating between a Fisher scoring update for $\hat{\bm{\gamma}}$ and a WLS update for $\hat{\bm{\beta}}$. The other vector of regression coefficients is kept fixed at each step.

Initialization strategy

For the iterative algorithm to work well, we need to find good starting values. To initialize $\hat{\bm{\beta}}$, the ordinary least squares (OLS) estimator $\hat{\bm{\beta}}^{\text{(OLS)}}= (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\bm{y}$ is a natural choice. Note that $\hat{\bm{\beta}}^{\text{(OLS)}}$ is unbiased for the LMLS, because $$ \mathbb{E}(\hat{\bm{\beta}}^{\text{(OLS)}}) = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbb{E}(\bm{y}) = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{X}\bm{\beta} = \bm{\beta}. $$ Under mild regularity conditions, $\hat{\bm{\beta}}^{\text{(OLS)}}$ is also consistent. However, it is not the best linear unbiased estimator (BLUE), because the homoscedasticity requirement of the Gauss-Markov theorem does not hold.

To initialize $\hat{\bm{\gamma}}$, we first estimate log σ_i with $\hat{s}_i = \log{|y_i - \bm{x}_i'\hat{\bm{\beta}}^{\text{(OLS)}}|} + 0.635$ and then regress $\hat{\bm{s}}= [\hat{s}_1, \dots, \hat{s}_n]'$ on the design matrix Z with OLS to obtain $\hat{\bm{\gamma}}^{(0)} = (\mathbf{Z}'\mathbf{Z})^{-1} \mathbf{Z}'\hat{\bm{s}}$.

The motivation for ŝ_i as an estimator for log σ_i is as follows: Consider and Setting the RHS of Equation @ref(eq:bias1) equal to the RHS of Equation @ref(eq:bias2) and taking expectations yields $$\mathbb{E}(\log{|y_i - \mu_i|}) - \log{\sigma_i} = \underbrace{1/2 \cdot (\psi(1/2) + \log 2)}_{\approx -0.635},$$ where ψ is the digamma function. The term ψ(1/2) + log 2 follows from the fact that a χ²(ν) distribution is identical to a Gamma (k = ν/2, θ = 2) distribution, and that for X ∼ Gamma (k, θ), we have 𝔼(log X) = ψ(k) + log θ. Rearranging the equation to $$\mathbb{E}\underbrace{(\log{|y_i - \mu_i|} + 0.635)}_{= s_i} = \log{\sigma_i}$$ shows that s_i is an unbiased estimator for log σ_i. Unfortunately, we do not know the true μ_i in practice and have to use the unbiased estimator $\bm{x}_i'\hat{\bm{\beta}}^{\text{(OLS)}}$ as an approximation instead.

Complete algorithm

Estimate $\hat{\bm{\beta}}^{\text{(OLS)}}= (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\bm{y}$.
Initialize $\hat{\bm{\gamma}}^{(0)} = (\mathbf{Z}'\mathbf{Z})^{-1} \mathbf{Z}'\hat{\bm{s}}$, where $\hat{\bm{s}}= [\hat{s}_i] = \log{|y_i - \bm{x}_i'\hat{\bm{\beta}}^{\text{(OLS)}}|} + 0.635$.
Initialize $\hat{\bm{\beta}}^{(0)} = (\mathbf{X}'\hat{\mathbf{W}}^{(0)}\mathbf{X})^{-1} \mathbf{X}'\hat{\mathbf{W}}^{(0)}\bm{y}$, where $\hat{\mathbf{W}}^{(0)}$ is the weight matrix for $\hat{\bm{\gamma}}^{(0)}$.
Set k = 1.
Repeat the following steps until convergence:
1. Update $\hat{\bm{\gamma}}^{(k)} = \hat{\bm{\gamma}}^{(k - 1)} + (\mathcal{I}(\hat{\bm{\gamma}}^{(k - 1)}))^{-1} s(\hat{\bm{\gamma}}^{(k - 1)})$ keeping $\hat{\bm{\beta}}^{(k - 1)}$ fixed.
2. Update $\hat{\bm{\beta}}^{(k)} = (\mathbf{X}'\hat{\mathbf{W}}^{(k)}\mathbf{X})^{-1} \mathbf{X}'\hat{\mathbf{W}}^{(k)}\bm{y}$, where $\hat{\mathbf{W}}^{(k)}$ is the weight matrix for $\hat{\bm{\gamma}}^{(k)}$.
3. Increase k by 1.

Markov chain Monte Carlo

To estimate an LMLS with MCMC, we can call the mcmc() function on an existing model object. The sampler requires a model object that contains the design matrices, so we need to make sure the lmls() function was called with the argument light = FALSE. Finally, we can use the summary() function with the argument type = "mcmc" to output some summary statistics of the posterior samples.

m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom, light = FALSE)
m3 <- mcmc(m3)

summary(m3, type = "mcmc")
#> 
#> Call:
#> lmls(location = y ~ poly(x, 2), scale = ~x, data = abdom, light = FALSE)
#> 
#> Deviance residuals:
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> -3.188000 -0.700400 -0.063480 -0.004337  0.611500  4.060000 
#> 
#> Location coefficients (identity link):
#>                Mean    2.5%     50%   97.5%
#> (Intercept)  226.72  225.61  226.71  227.83
#> poly(x, 2)1 2160.34 2130.05 2160.87 2190.74
#> poly(x, 2)2  -99.46 -125.90  -99.24  -73.85
#> 
#> Scale coefficients (log link):
#>                Mean    2.5%     50% 97.5%
#> (Intercept) 1.36740 1.16883 1.37487 1.569
#> x           0.04206 0.03513 0.04190 0.049
#> 
#> Residual degrees of freedom: 605
#> Log-likelihood: -2399.48
#> AIC: 4808.95
#> BIC: 4831.02

The posterior samples for one regression coefficient, γ₁ in this case, can be extracted and plotted as follows:

(ref:abdom-mcmc-3) Trace plot and histogram of the posterior samples for γ₁.

samples <- as.data.frame(m3$mcmc$scale)
samples$iteration <- 1:nrow(samples)

p1 <- ggplot(samples, aes(iteration, x)) +
  geom_line() +
  xlab("Iteration") +
  ylab(expression(gamma[1]))

p2 <- ggplot(samples, aes(x, after_stat(density))) +
  geom_histogram(bins = 15) +
  xlab(expression(gamma[1])) +
  ylab("Density")

p1 + p2

(ref:abdom-mcmc-3)

MCMC algorithm

The MCMC algorithm assumes flat priors for β and γ and works as follows:

Initialize β⁽⁰⁾ and γ⁽⁰⁾ with the ML estimates.
Set k = 1.
Repeat the following steps num_samples times:
1. Sample β^(k) from the full conditional in a Gibbs update step (see Section @ref(sec:fullcond)).
2. Sample γ^(k) with the Riemann manifold Metropolis-adjusted Langevin algorithm (MMALA) from Girolami and Calderhead (2011) using the Fisher-Rao metric tensor.
3. Increase k by 1.

Full conditional of β

The full conditional of β used in the Gibbs update step is given by p(β ∣ ⋅ ) ∝ exp (−1/2 ⋅ (y − Xβ)′W(y − Xβ)) ⋅ p(β) ⋅ p(γ), where the weight matrix W is defined as in Equation @ref(eq:wmat). The priors p(β) and p(γ) can be ignored, because we assume them to be flat. It can be shown with some tedious but simple linear algebra that where $\hat{\bm{\beta}}^{\text{(WLS)}}$ is the WLS estimator for β using the weight matrix W. As the second addend in the last equation is independent of β, the full conditional reduces to $$p(\bm{\beta}\mid \cdot\,) \propto \exp(-1/2 \cdot (\bm{\beta}- \hat{\bm{\beta}}^{\text{(WLS)}})' \mathbf{X}'\mathbf{W}\mathbf{X}(\bm{\beta}- \hat{\bm{\beta}}^{\text{(WLS)}})),$$ which implies the following multivariate normal distribution: $$\bm{\beta}\mid \cdot\, \sim \mathcal{N}(\hat{\bm{\beta}}^{\text{(WLS)}}, (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}).$$

Comparison with other R packages

There are a number of R packages with similar capabilities as lmls. The popular mgcv package (Wood 2017), which is typically used to estimate generalized additive models (GAMs), added support for multi-predictor models including the LMLS with a normally distributed response variable in version 1.8. The gamlss package implements generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos 2005) in a very general way, and the bamlss package (Umlauf, Klein, and Zeileis 2018) is a Bayesian alternative to the gamlss package.

Compared to these packages, the scope of the lmls package is much narrower. Its functionality is limited to the LMLS with a normally distributed response variable. Other response distributions or the regularization of covariate effects are not supported. Instead, lmls aims to be…

… user-friendly. The few exported functions are intuitive and simple.
… stable. As lmls implements only one single, narrow model class, it can make use of specific and robust inference algorithms.
… fast. In fact, lmls seems to be 3.5 to 4 times faster than mgcv and gamlss. See below for a small benchmark on the abdom dataset.
… lightweight. lmls is written in pure R and depends only on the generics package.
… comprehensible. lmls was used as a teaching material for a university course. The code is modular and accessible, even to R beginners.
… well-tested. Most of the code is covered by unit and integration tests.

Finally, we compare the performance of the lmls package on the abdom dataset with mgcv and gamlss using the microbenchmark package. The results of the benchmark are shown in Figure @ref(fig:abdom-bench-2).

library(gamlss)
library(mgcv)

bench <- microbenchmark::microbenchmark(
  lmls = lmls(y ~ poly(x, 2), ~ x, data = abdom),
  mgcv = gam(list(y ~ poly(x, 2), ~ x), family = gaulss(), data = abdom),
  gamlss = gamlss(y ~ poly(x, 2), ~ x, data = abdom)
)

(ref:abdom-bench-2) The lmls package is about 3.5 to 4 times faster than mgcv and gamlss on the abdom dataset.

(ref:abdom-bench-2)

Appendix: Score and Fisher information

The inference algorithms from Section @ref(sec:inference) require the score (= the gradient of the log-likelihood) and the Fisher information (= the covariance of the score) with respect to β and γ.

Location coefficients β

The score of the location coefficients β is $$ s(\bm{\beta}) = \frac{\partial\ell}{\partial\bm{\beta}} = \mathbf{X}'\bm{q}, $$ where q is an n-dimensional column vector with the elements q_i = (y_i − μ_i)/σ_i². The corresponding Fisher information is given by ℐ(β) = Cov (s(β)) = Cov (X′q) = X′Cov (q)X, where the diagonal elements of the covariance matrix Cov (q) are $$ \operatorname{Var}(q_i) = \operatorname{Var}\biggl(\frac{y_i - \mu_i}{\sigma^2_i}\biggr) = \frac{1}{\sigma^2_i} \cdot \operatorname{Var}\underbrace{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)}_{\sim \mathcal{N}(0, 1)} = \frac{1}{\sigma^2_i}, $$ and the off-diagonal elements are Cov (q_i, q_j) = 0 for i ≠ j, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of β:

crossprod(X / scale)

Here, scale is the vector [σ₁, …, σ_n]. This code works, because R stores matrices in column-major order and recycles shorter vectors for operations like element-wise division.

Scale coefficients γ

The score of the scale coefficients γ is $$ s(\bm{\gamma}) = \frac{\partial\ell}{\partial\bm{\gamma}} = \mathbf{Z}'\bm{r}, $$ where r is an n-dimensional column vector with the elements r_i = ((y_i − μ_i)/σ_i)² − 1. The corresponding Fisher information is given by ℐ(γ) = Cov (s(γ)) = Cov (Z′r) = Z′Cov (r)Z, where the diagonal elements of the covariance matrix Cov (r) are $$ \operatorname{Var}(r_i) = \operatorname{Var}\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2 - 1\biggr) = \operatorname{Var}\underbrace{\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2\biggr)}_{\sim \chi^2(1)} = 2, $$ and the off-diagonal elements are Cov (r_i, r_j) = 0 for i ≠ j, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of γ:

2 * crossprod(Z)

Mixed Fisher information of β and γ

The inference algorithms from Section @ref(sec:inference) update the location coefficients β and the scale coefficients γ separately. Would a joint update maybe work better? Let us take a look at the Fisher information of the stacked regression coefficients $$ \mathcal{I}\biggl( \begin{bmatrix} \bm{\beta}\\ \bm{\gamma} \end{bmatrix} \biggr) = \begin{bmatrix} \mathcal{I}(\bm{\beta}) & \operatorname{Cov}(s(\bm{\beta}), s(\bm{\gamma})) \\ \operatorname{Cov}(s(\bm{\gamma}), s(\bm{\beta})) & \mathcal{I}(\bm{\gamma}) \end{bmatrix}. $$ We are still missing the off-diagonal elements Cov (s(β), s(γ)) = Cov (s(γ), s(β))′ = Cov (X′q, Z′r) = X′Cov (q, r)Z, where the diagonal elements of the cross-covariance matrix Cov (q, r) are The third equality holds because (y_i − μ_i)/σ_i is a standard normal random variable and hence uncorrelated with its square. The off-diagonal elements of Cov (q, r) are Cov (q_i, r_j) = 0 for i ≠ j, due to the independence assumption of the LMLS. It follows that Cov (s(β), s(γ)) = Cov (s(γ), s(β)) = 0, which means there is no additional information we can make use of for a joint update of β and γ (at least not in terms of the Fisher information).

References

Girolami, Mark, and Ben Calderhead. 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2): 123–214. https://doi.org/10.1111/j.1467-9868.2010.00765.x.

Pedersen, Thomas Lin. 2020. patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

Rigby, R. A., and D. M. Stasinopoulos. 2005. “Generalized Additive Models for Location, Scale and Shape.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 54 (3): 507–54. https://doi.org/10.1111/j.1467-9876.2005.00510.x.

Umlauf, Nikolaus, Nadja Klein, and Achim Zeileis. 2018. “BAMLSS: Bayesian Additive Models for Location, Scale, and Shape (and Beyond).” Journal of Computational and Graphical Statistics 27 (3): 612–27. https://doi.org/10.1080/10618600.2017.1407325.

Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wood, S. N. 2017. Generalized Additive Models: An Introduction with R. 2nd ed. Chapman; Hall/CRC.

Location-Scale Regression and the lmls Package