The lmls package comes with the abdom dataset (which it borrows from the gamlss.data package). The dataset consists of only two variables: the size of 610 fetuses (as measurements of their abdominal circumference taken from ultrasound scans) and their gestational age ranging from 12 to 42 weeks.
(ref:abdom-plot) The abdom dataset containing the gestational age and abdominal circumference of 610 fetuses.
library(lmls)
ggplot(abdom, aes(x, y)) +
geom_point(color = "darkgray", size = 1) +
xlab("Age [weeks]") +
ylab("Size [mm]")
As expected, Figure @ref(fig:abdom-plot) indicates that the size of the babies increases with their age. We can use a simple linear regression model to quantify this relationship:
m1 <- lm(y ~ x, data = abdom)
summary(m1)
#>
#> Call:
#> lm(formula = y ~ x, data = abdom)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -84.579 -8.105 -0.185 8.064 54.325
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -55.1795 2.0010 -27.58 <2e-16 ***
#> x 10.3382 0.0701 147.49 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 14.63 on 608 degrees of freedom
#> Multiple R-squared: 0.9728, Adjusted R-squared: 0.9728
#> F-statistic: 2.175e+04 on 1 and 608 DF, p-value: < 2.2e-16
The simple linear regression model with normally distributed errors is defined as $$y_i = \beta_0 + x_i \beta_1 + \varepsilon_i, \text{ where } \varepsilon_i \overset{\text{i.i.d.}}{\sim}\mathcal{N}(0, \sigma^2).$$ Based on the scatterplot of the data in Figure @ref(fig:abdom-plot), the homoscedasticity (= constant variance) assumption of the simple linear regression model seems implausible. In fact, the variance of the babies’ size seems to increase with their age. We can confirm this by plotting the regression residuals against the babies’ age:
(ref:abdom-resid) The residuals of the simple linear regression model for the abdom dataset show clear heteroscedasticity and some non-linearity.
abdom$resid <- resid(m1)
ggplot(abdom, aes(x, resid)) +
geom_point(color = "darkgray", size = 1) +
geom_hline(yintercept = 0, linewidth = 0.5) +
xlab("Age [weeks]") +
ylab("Residuals")
It follows from the definition of the simple linear regression model that the response variable yi is normally distributed with mean μi = β0 + xiβ1 and standard deviation σ, yielding $$y_i \overset{\text{ind.}}{\sim}\mathcal{N}(\beta_0 + x_i \beta_1, \sigma^2).$$ From this representation, we can extend the simple linear regression model and use the explanatory variable xi to predict not only the mean μi but also the standard deviation σi of the response variable yi. We translate this idea into the model which is a minimal linear model for location and scale (LMLS). The regression coefficients γ0 and γ1 are the intercept and the slope parameter for the log-standard deviation, and the exponential function is the inverse link (or response) function. It ensures that the predictions for the standard deviation are always valid (= positive). This step is necessary, because the linear predictor γ0 + xiγ1 can become zero or negative for some choices of γ0 and γ1.
Using the lmls package, we can estimate Model @ref(eq:lmls) for the abdom dataset with a maximum likelihood algorithm running the following line of code:
(ref:abdom-lmls) The LMLS for the abdom dataset with a linear effect of the babies’ age on their average size and a linear effect on the log-standard deviation.
m2 <- lmls(y ~ x, ~ x, data = abdom)
abdom$mu <- predict(m2, type = "response", predictor = "location")
abdom$sigma <- predict(m2, type = "response", predictor = "scale")
abdom$upper <- abdom$mu + 1.96 * abdom$sigma
abdom$lower <- abdom$mu - 1.96 * abdom$sigma
ggplot(abdom, aes(x, y)) +
geom_point(color = "darkgray", size = 1) +
geom_line(aes(y = mu), linewidth = 0.7) +
geom_line(aes(y = upper), linewidth = 0.3) +
geom_line(aes(y = lower), linewidth = 0.3) +
xlab("Age [weeks]") +
ylab("Size [mm]")
The general LMLS can include multiple explanatory variables, transformations of the explanatory variables, and it does not require the explanatory variables for the mean and the standard deviation to be identical. We define the general LMLS as $$y_i \overset{\text{ind.}}{\sim}\mathcal{N}(\bm{x}_i'\bm{\beta}, (\exp(\bm{z}_i'\bm{\gamma}))^2),$$ where xi and β are the covariate vector and the vector of regression coefficients for the mean, and zi and γ are the covariate vector and the vector of regression coefficients for the standard deviation.
Polynomials are a straightforward way to include transformations of explanatory variables in a model. For example, we can improve Model @ref(eq:lmls) for the abdom dataset and add a quadratic effect of the babies’ age on their average size with this command:
The AIC drops from 4861.184 to 4802.823 for this model compared to Model @ref(eq:lmls). Figure @ref(fig:abdom-poly-2) illustrates how the quadratic effect improves the fit of the model.
(ref:abdom-poly-2) The LMLS for the abdom dataset with a quadratic effect of the babies’ age on their average size and a linear effect on the log-standard deviation.
The lmls
package implements two inference algorithms: a
frequentist maximum likelihood (ML) algorithm, which it uses by default,
and a Bayesian Markov chain Monte Carlo (MCMC) algorithm. The
derivatives that are required for these inference algorithms are derived
in the Appendix in Section @ref(sec:appendix).
To simplify the notation in this and the next section, we first define the response vector y = [y1, …, yn]′, the design matrices X = [x1′, …, xn′]′ and Z = [z1′, …, zn′]′, and the standard deviation of the i-th observation σi = exp (zi′γ). Finally, let W be the weight matrix
The ML algorithm combines weighted least squares (WLS) updates for $\hat{\bm{\beta}}$ and Fisher scoring updates for $\hat{\bm{\gamma}}$. We first discuss this update strategy and then take a look at the initialization strategy.
If we know the true γ and hence the true weight matrix W, we can estimate β with WLS: $$\hat{\bm{\beta}}^{\text{(WLS)}}= (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1} \mathbf{X}'\mathbf{W}\bm{y}.$$
On the other hand, if we know the true β, we can estimate γ with the Fisher scoring algorithm. Fisher scoring is an iterative algorithm for ML estimation, similar to Newton’s method. The k-th update of the Fisher scoring algorithm is defined as $$\hat{\bm{\gamma}}^{(k)} = \hat{\bm{\gamma}}^{(k - 1)} + (\mathcal{I}(\hat{\bm{\gamma}}^{(k - 1)}))^{-1} s(\hat{\bm{\gamma}}^{(k - 1)}),$$ where ℐ is the Fisher information and s is the score of γ.
In most cases, we know neither the true β nor the true γ, but we can estimate them with an iterative algorithm alternating between a Fisher scoring update for $\hat{\bm{\gamma}}$ and a WLS update for $\hat{\bm{\beta}}$. The other vector of regression coefficients is kept fixed at each step.
For the iterative algorithm to work well, we need to find good starting values. To initialize $\hat{\bm{\beta}}$, the ordinary least squares (OLS) estimator $\hat{\bm{\beta}}^{\text{(OLS)}}= (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\bm{y}$ is a natural choice. Note that $\hat{\bm{\beta}}^{\text{(OLS)}}$ is unbiased for the LMLS, because $$ \mathbb{E}(\hat{\bm{\beta}}^{\text{(OLS)}}) = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbb{E}(\bm{y}) = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{X}\bm{\beta} = \bm{\beta}. $$ Under mild regularity conditions, $\hat{\bm{\beta}}^{\text{(OLS)}}$ is also consistent. However, it is not the best linear unbiased estimator (BLUE), because the homoscedasticity requirement of the Gauss-Markov theorem does not hold.
To initialize $\hat{\bm{\gamma}}$, we first estimate log σi with $\hat{s}_i = \log{|y_i - \bm{x}_i'\hat{\bm{\beta}}^{\text{(OLS)}}|} + 0.635$ and then regress $\hat{\bm{s}}= [\hat{s}_1, \dots, \hat{s}_n]'$ on the design matrix Z with OLS to obtain $\hat{\bm{\gamma}}^{(0)} = (\mathbf{Z}'\mathbf{Z})^{-1} \mathbf{Z}'\hat{\bm{s}}$.
The motivation for ŝi as an estimator for log σi is as follows: Consider and Setting the RHS of Equation @ref(eq:bias1) equal to the RHS of Equation @ref(eq:bias2) and taking expectations yields $$\mathbb{E}(\log{|y_i - \mu_i|}) - \log{\sigma_i} = \underbrace{1/2 \cdot (\psi(1/2) + \log 2)}_{\approx -0.635},$$ where ψ is the digamma function. The term ψ(1/2) + log 2 follows from the fact that a χ2(ν) distribution is identical to a Gamma (k = ν/2, θ = 2) distribution, and that for X ∼ Gamma (k, θ), we have 𝔼(log X) = ψ(k) + log θ. Rearranging the equation to $$\mathbb{E}\underbrace{(\log{|y_i - \mu_i|} + 0.635)}_{= s_i} = \log{\sigma_i}$$ shows that si is an unbiased estimator for log σi. Unfortunately, we do not know the true μi in practice and have to use the unbiased estimator $\bm{x}_i'\hat{\bm{\beta}}^{\text{(OLS)}}$ as an approximation instead.
To estimate an LMLS with MCMC, we can call the mcmc()
function on an existing model object. The sampler requires a model
object that contains the design matrices, so we need to make sure the
lmls()
function was called with the argument
light = FALSE
. Finally, we can use the
summary()
function with the argument
type = "mcmc"
to output some summary statistics of the
posterior samples.
m3 <- lmls(y ~ poly(x, 2), ~ x, data = abdom, light = FALSE)
m3 <- mcmc(m3)
summary(m3, type = "mcmc")
#>
#> Call:
#> lmls(location = y ~ poly(x, 2), scale = ~x, data = abdom, light = FALSE)
#>
#> Deviance residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -3.188000 -0.700400 -0.063480 -0.004337 0.611500 4.060000
#>
#> Location coefficients (identity link):
#> Mean 2.5% 50% 97.5%
#> (Intercept) 226.72 225.61 226.71 227.83
#> poly(x, 2)1 2160.34 2130.05 2160.87 2190.74
#> poly(x, 2)2 -99.46 -125.90 -99.24 -73.85
#>
#> Scale coefficients (log link):
#> Mean 2.5% 50% 97.5%
#> (Intercept) 1.36740 1.16883 1.37487 1.569
#> x 0.04206 0.03513 0.04190 0.049
#>
#> Residual degrees of freedom: 605
#> Log-likelihood: -2399.48
#> AIC: 4808.95
#> BIC: 4831.02
The posterior samples for one regression coefficient, γ1 in this case, can be extracted and plotted as follows:
(ref:abdom-mcmc-3) Trace plot and histogram of the posterior samples for γ1.
samples <- as.data.frame(m3$mcmc$scale)
samples$iteration <- 1:nrow(samples)
p1 <- ggplot(samples, aes(iteration, x)) +
geom_line() +
xlab("Iteration") +
ylab(expression(gamma[1]))
p2 <- ggplot(samples, aes(x, after_stat(density))) +
geom_histogram(bins = 15) +
xlab(expression(gamma[1])) +
ylab("Density")
p1 + p2
The MCMC algorithm assumes flat priors for β and γ and works as follows:
num_samples
times:
The full conditional of β used in the Gibbs update step is given by p(β ∣ ⋅ ) ∝ exp (−1/2 ⋅ (y − Xβ)′W(y − Xβ)) ⋅ p(β) ⋅ p(γ), where the weight matrix W is defined as in Equation @ref(eq:wmat). The priors p(β) and p(γ) can be ignored, because we assume them to be flat. It can be shown with some tedious but simple linear algebra that where $\hat{\bm{\beta}}^{\text{(WLS)}}$ is the WLS estimator for β using the weight matrix W. As the second addend in the last equation is independent of β, the full conditional reduces to $$p(\bm{\beta}\mid \cdot\,) \propto \exp(-1/2 \cdot (\bm{\beta}- \hat{\bm{\beta}}^{\text{(WLS)}})' \mathbf{X}'\mathbf{W}\mathbf{X}(\bm{\beta}- \hat{\bm{\beta}}^{\text{(WLS)}})),$$ which implies the following multivariate normal distribution: $$\bm{\beta}\mid \cdot\, \sim \mathcal{N}(\hat{\bm{\beta}}^{\text{(WLS)}}, (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}).$$
There are a number of R packages with similar capabilities as lmls. The popular mgcv package (Wood 2017), which is typically used to estimate generalized additive models (GAMs), added support for multi-predictor models including the LMLS with a normally distributed response variable in version 1.8. The gamlss package implements generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos 2005) in a very general way, and the bamlss package (Umlauf, Klein, and Zeileis 2018) is a Bayesian alternative to the gamlss package.
Compared to these packages, the scope of the lmls package is much narrower. Its functionality is limited to the LMLS with a normally distributed response variable. Other response distributions or the regularization of covariate effects are not supported. Instead, lmls aims to be…
Finally, we compare the performance of the lmls package on the abdom dataset with mgcv and gamlss using the microbenchmark package. The results of the benchmark are shown in Figure @ref(fig:abdom-bench-2).
library(gamlss)
library(mgcv)
bench <- microbenchmark::microbenchmark(
lmls = lmls(y ~ poly(x, 2), ~ x, data = abdom),
mgcv = gam(list(y ~ poly(x, 2), ~ x), family = gaulss(), data = abdom),
gamlss = gamlss(y ~ poly(x, 2), ~ x, data = abdom)
)
(ref:abdom-bench-2) The lmls package is about 3.5 to 4 times faster than mgcv and gamlss on the abdom dataset.
The inference algorithms from Section @ref(sec:inference) require the score (= the gradient of the log-likelihood) and the Fisher information (= the covariance of the score) with respect to β and γ.
The score of the location coefficients β is $$ s(\bm{\beta}) = \frac{\partial\ell}{\partial\bm{\beta}} = \mathbf{X}'\bm{q}, $$ where q is an n-dimensional column vector with the elements qi = (yi − μi)/σi2. The corresponding Fisher information is given by ℐ(β) = Cov (s(β)) = Cov (X′q) = X′Cov (q)X, where the diagonal elements of the covariance matrix Cov (q) are $$ \operatorname{Var}(q_i) = \operatorname{Var}\biggl(\frac{y_i - \mu_i}{\sigma^2_i}\biggr) = \frac{1}{\sigma^2_i} \cdot \operatorname{Var}\underbrace{\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)}_{\sim \mathcal{N}(0, 1)} = \frac{1}{\sigma^2_i}, $$ and the off-diagonal elements are Cov (qi, qj) = 0 for i ≠ j, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of β:
Here, scale
is the vector [σ1, …, σn].
This code works, because R stores matrices in column-major order and
recycles shorter vectors for operations like element-wise division.
The score of the scale coefficients γ is $$ s(\bm{\gamma}) = \frac{\partial\ell}{\partial\bm{\gamma}} = \mathbf{Z}'\bm{r}, $$ where r is an n-dimensional column vector with the elements ri = ((yi − μi)/σi)2 − 1. The corresponding Fisher information is given by ℐ(γ) = Cov (s(γ)) = Cov (Z′r) = Z′Cov (r)Z, where the diagonal elements of the covariance matrix Cov (r) are $$ \operatorname{Var}(r_i) = \operatorname{Var}\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2 - 1\biggr) = \operatorname{Var}\underbrace{\biggl(\biggl(\frac{y_i - \mu_i}{\sigma_i}\biggr)^2\biggr)}_{\sim \chi^2(1)} = 2, $$ and the off-diagonal elements are Cov (ri, rj) = 0 for i ≠ j, due to the independence assumption of the LMLS. In R, we can use the following efficient implementation of the Fisher information of γ:
The inference algorithms from Section @ref(sec:inference) update the location coefficients β and the scale coefficients γ separately. Would a joint update maybe work better? Let us take a look at the Fisher information of the stacked regression coefficients $$ \mathcal{I}\biggl( \begin{bmatrix} \bm{\beta}\\ \bm{\gamma} \end{bmatrix} \biggr) = \begin{bmatrix} \mathcal{I}(\bm{\beta}) & \operatorname{Cov}(s(\bm{\beta}), s(\bm{\gamma})) \\ \operatorname{Cov}(s(\bm{\gamma}), s(\bm{\beta})) & \mathcal{I}(\bm{\gamma}) \end{bmatrix}. $$ We are still missing the off-diagonal elements Cov (s(β), s(γ)) = Cov (s(γ), s(β))′ = Cov (X′q, Z′r) = X′Cov (q, r)Z, where the diagonal elements of the cross-covariance matrix Cov (q, r) are The third equality holds because (yi − μi)/σi is a standard normal random variable and hence uncorrelated with its square. The off-diagonal elements of Cov (q, r) are Cov (qi, rj) = 0 for i ≠ j, due to the independence assumption of the LMLS. It follows that Cov (s(β), s(γ)) = Cov (s(γ), s(β)) = 0, which means there is no additional information we can make use of for a joint update of β and γ (at least not in terms of the Fisher information).