---
title: "Iterative linearised INLA method"
author: "Finn Lindgren and Man Ho Suen"
output:
  - rmarkdown::html_vignette
  - rmarkdown::pdf_document
vignette: >
  %\VignetteIndexEntry{Iterative linearised INLA method}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
header-includes:
  - \newcommand{\bm}[1]{\boldsymbol{#1}}
  - \newcommand{\wt}[1]{\widetilde{#1}}
  - \newcommand{\ol}[1]{\overline{#1}}
  - \newcommand{\wh}[1]{\widehat{#1}}
  - \newcommand{\OO}{\mathcal{O}}
  - \DeclareMathOperator*{\argmax}{arg\,max}
  - \DeclareMathOperator*{\tr}{tr}
  - \newcommand{\pN}{\mathsf{N}}
  - \newcommand{\pCov}{\mathsf{Cov}}
  - \newcommand{\pE}{\mathsf{E}}
  - \newcommand{\pKL}[2]{\mathsf{KL}\left(#1\,\middle\|\,#2\right)}
  - \newcommand{\mmd}{\mathrm{d}}
  - \newcommand{\md}{\,\mmd}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, include = FALSE}
library(inlabru)
library(ggplot2)
```

## The INLA method for linear predictors

The INLA method is used to compute fast approximative posterior distribution
for Bayesian generalised additive models. The hierarchical structure of such a 
model with latent Gaussian components $\bm{u}$, covariance parameters
$\bm{\theta}$, and measured response variables $\bm{y}$, can be written as
$$
\begin{aligned}
\bm{\theta} &\sim p(\bm{\theta}) \\
\bm{u}|\bm{\theta} &\sim \mathcal{N}\!\left(\bm{\mu}_u, \bm{Q}(\bm{\theta})^{-1}\right) \\
\bm{\eta}(\bm{u}) &= \bm{A}\bm{u} \\
\bm{y}|\bm{u},\bm{\theta} & \sim p(\bm{y}|\bm{\eta}(\bm{u}),\bm{\theta})
\end{aligned}
$$
where typically each linear predictor element, $\eta_i(\bm{u})$, is linked to a
location parameter of the distribution for observation $y_i$, for each $i$,
via a (non-linear) link function $g^{-1}(\cdot)$. In the R-INLA implementation,
the observations are assumed to be conditionally independent, given $\bm{\eta}$
and $\bm{\theta}$.

## Approximate INLA for non-linear predictors

The premise for the inlabru method for non-linear predictors is to build on
the existing implementation, and only add a linearisation step. The properties of
the resulting approximation will depend on the nature of the non-linearity.

Let $\wt{\bm{\eta}}(\bm{u})$ be a non-linear predictor, i.e. a deterministic function of $\bm{u}$, $$ \wt{\bm{\eta}} (\bm{u}) = \textsf{fcn} (\bm{u}), $$
and let
$\ol{\bm{\eta}}(\bm{u})$ be the 1st order Taylor approximation at $\bm{u}_0$,
$$
\ol{\bm{\eta}}(\bm{u})
= \wt{\bm{\eta}}(\bm{u}_0) + \bm{B}(\bm{u} - \bm{u}_0)
= \left[\wt{\bm{\eta}}(\bm{u}_0) - \bm{B}\bm{u}_0\right] + \bm{B}\bm{u}
,
$$
where $\bm{B}$ is the derivative matrix for the non-linear predictor, evaluated
at $\bm{u}_0$.
Hence, we define
$$
\begin{aligned}
\bm{y} | \bm{u}, {\bm{\theta}} &\overset{d}{=} \bm{y} | \wt{\bm{\eta}}(\bm{u}), {\bm{\theta}} \\
&\sim p (\bm{y} | g^{-1}[\wt{\bm{\eta}}(\bm{u})], {\bm{\theta}})\\
\end{aligned}
$$
The non-linear observation model
$p(\bm{y}|g^{-1}[\wt{\bm{\eta}}(\bm{u})],\bm{\theta})$
is approximated by replacing the non-linear predictor with its linearisation,
so that the linearised model is defined by

$$
\ol{p}(\bm{y}|\bm{u},\bm{\theta})
=
p(\bm{y}|\ol{\bm{\eta}}(\bm{u}),\bm{\theta})
=
p(\bm{y}|g^{-1}[\ol{\bm{\eta}}(\bm{u})],\bm{\theta})
\approx
p(\bm{y}|g^{-1}[\wt{\bm{\eta}}(\bm{u})],\bm{\theta})
=
p(\bm{y}|\wt{\bm{\eta}}(\bm{u}),\bm{\theta})
=
\wt{p}(\bm{y}|\bm{u},\bm{\theta})
$$
The non-linear model posterior is factorised as
$$
\wt{p}(\bm{\theta},\bm{u}|\bm{y}) = \wt{p}(\bm{\theta}|\bm{y})\wt{p}(\bm{u}|\bm{y},\bm{\theta}),
$$
and the linear model approximation is factorised as
$$
\ol{p}(\bm{\theta},\bm{u}|\bm{y}) = \ol{p}(\bm{\theta}|\bm{y})\ol{p}(\bm{u}|\bm{y},\bm{\theta}).
$$

### Fixed point iteration

The remaining step of the approximation is how to choose the linearisation
point $\bm{u}_*$.
For a given linearisation point $\bm{v}$, INLA will compute the posterior mode for $\bm{\theta}$,
$$
\wh{\bm{\theta}}_{\bm{v}} = \argmax_{\bm{\theta}} \ol{p}_\bm{v} ( {\bm{\theta}} | \bm{y} ),
$$
and the joint conditional posterior mode for $\bm{u}$,
$$
\wh{\bm{u}}_{\bm{v}} = \argmax_{\bm{u}} \ol{p}_\bm{v} ( \bm{u} | \bm{y}, \wh{\bm{\theta}}_{\bm{v}} ) .
$$

Define the Bayesian estimation functional^[
Potential other choices for $f(\cdot)$ include the posterior expectation
$\ol{E}(\bm{u}|\bm{y})$ and the marginal conditional modes, 
$$
\left\{\argmax_{u_i} \ol{p}_{\bm{v}}(u_i|\bm{y}),\,i=1,\dots,n\right\},
$$
which was used in `inlabru` up to version 2.1.15, which caused problems for some nonlinear models.
From version 2.2.3, the joint conditional mode is used,
$$
\argmax_{\bm{u}} \ol{p}_{\bm{v}}(\bm{u}|\bm{y},\wh{\bm{\theta}}_{\bm{v}}),
$$
where $\wh{\bm{\theta}}=\argmax_{\bm{\theta}} \ol{p}_{\bm{v}}(\bm{\theta}|\bm{y})$.
]
$$
f(\ol{p}_{\bm{v}}) = (\wh{\bm{\theta}}_{\bm{v}},\wh{\bm{u}}_{\bm{v}})
$$
and let $f(p)=(\wh{\bm{\theta}},\wh{\bm{u}})$ denote the corresponding posterior modes for the true posterior distribution,
$$
\begin{aligned}
    \wh{{\bm{\theta}}} &= \argmax_{\bm{\theta}} p ( {\bm{\theta}} | \bm{y} ), \\
    \wh{\bm{u}} &= \argmax_{\bm{u}} p (\bm{u} | \bm{y}, \wh{{\bm{\theta}}}).
\end{aligned}
$$

The fixed point $(\bm{\theta}_*,\bm{u}_*)=f(\ol{p}_{\bm{u}_*})$ should ideally be close to $(\wh{\bm{\theta}},\wh{\bm{u}})$, i.e. close to the true marginal/conditional posterior mode. We can achieve this for the conditional
latent mode, so that $\bm{u}_*=\argmax_{\bm{u}} p (\bm{u} | \bm{y}, \wh{\bm{\theta}}_{\bm{u}_*})$.

We therefore seek the latent vector $\bm{u}_*$ that generates the fixed point of the
functional, so that $(\bm{\theta}_*,\bm{u}_*)=f(\ol{p}_{\bm{u}_*})$.


One key to the fixed point iteration is that the observation model is linked to
$\bm{u}$ only through the non-linear predictor $\wt{\bm{\eta}}(\bm{u})$, since
this leads to a simplified line search method below.

0. Let $\bm{u}_0$ be an initial linearisation point for the latent variables 
   obtained from the initial INLA call. Iterate the following steps for $k=0,1,2,...$
1. Compute the predictor linearisation at $\bm{u}_0$.
2. Compute the linearised INLA posterior $\ol{p}_{\bm{u}_0}(\bm{\theta}|\bm{y})$.
3. Let $(\bm{\theta}_1,\bm{u}_1)=(\wh{\bm{\theta}}_{\bm{u}_0},\wh{\bm{u}}_{\bm{u}_0})=f(\ol{p}_{\bm{u}_0})$ be the initial candidate for new
   linearisation point.
4. Let $\bm{v}_\alpha=(1-\alpha)\bm{u}_1+\alpha\bm{u}_0$, and find the value
   $\alpha$ minimises $\|\wt{\eta}(\bm{v}_\alpha)-\ol{\eta}(\bm{u}_1)\|$.
5. Set the new linearisation point $\bm{u}_0$ equal to $\bm{v}_\alpha$ and repeat from step 1,
   unless the iteration has converged to a given tolerance.

A potential improvement of step 4 might be to also take into account the prior
distribution for $\bm{u}$ as a minimisation penalty, to avoid moving further than
would be indicated by a full likelihood optimisation.

#### Line search

In step 4, we would ideally want $\alpha$ to be  
$$
\argmax_{\alpha} \left[\ln p(\bm{u}|\bm{y},\bm{\theta}_1)\right]_{\bm{u}=\bm{v}_\alpha}.
$$
However, since this requires access to the internal likelihood and prior density
evaluation code, we instead use a simpler alternative. We consider norms of the form $\|\wt{\eta}(\bm{v}_\alpha)-\ol{\eta}(\bm{u}_1)\|$
that only depend on the nonlinear and linearised predictor expressions, and other known quantities, given $\bm{u}_0$, such as the current INLA estimate of the component wise predictor variances.

Let $\sigma_i^2 = \mathrm{Var}_{\bm{u}\sim \ol{p}(\bm{u}|\bm{y},\bm{\theta}_1)}(\ol{\bm{\eta}}_i(\bm{u}))$
be the current estimate of the posterior variance for each predictor element $i$.
We then define an inner product on the space of predictor vectors as
$$
\langle \bm{a},\bm{b} \rangle_V
=
\sum_i \frac{a_i b_i}{\sigma_i^2} .
$$
The squared norm for the difference between the predictor vectors $\wt{\bm{\eta}}(\bm{v}_\alpha)$
and $\ol{\bm{\eta}}(\bm{u}_1)$,with respect to this inner product, is defined as
$$
\| \wt{\bm{\eta}}(\bm{v}_\alpha) - \ol{\bm{\eta}}(\bm{u}_1)\|^2_V
=
\sum_i \frac{|\wt{\bm{\eta}}_i(\bm{v}_\alpha)-\ol{\bm{\eta}}_i(\bm{u}_1)|^2}{\sigma_i^2} .
$$
Using this norm as the target loss function for the line search avoids many potentially
expensive evaluations of the true posterior conditional log-density. We evaluate
$\wt{\bm{\eta}}_1=\wt{\bm{\eta}}(\bm{u}_1)$ and make use of the linearised predictor
information. Let $\wt{\bm{\eta}}_\alpha=\wt{\bm{\eta}}(\bm{v}_\alpha)$ and $\ol{\bm{\eta}}_\alpha=\ol{\bm{\eta}}(\bm{v}_\alpha)=(1-\alpha)\wt{\bm{\eta}}(\bm{u}_0)+\alpha\ol{\bm{\eta}}(\bm{u}_1)$.
In other words, $\alpha=0$ corresponds to the previous linear predictor, and $\alpha=1$
is the current estimate from INLA.
An exact line search would minimise $\|\wt{\bm{\eta}}_\alpha-\ol{\bm{\eta}}_1\|$. 
Instead, we define a quadratic approximation to the
non-linear predictor as a function of $\alpha$,
$$
\breve{\bm{\eta}}_\alpha =
\ol{\bm{\eta}}_\alpha + \alpha^2 (\wt{\bm{\eta}}_1 - \ol{\bm{\eta}}_1)
$$
and minimise the quartic polynomial in $\alpha$,
$$
\begin{aligned}
\|\breve{\bm{\eta}}_\alpha-\ol{\bm{\eta}}_1\|^2
&=
\| (\alpha-1)(\ol{\bm{\eta}}_1 - \ol{\bm{\eta}}_0) + \alpha^2 (\wt{\bm{\eta}}_1 - \ol{\bm{\eta}}_1) \|^2
.
\end{aligned}
$$
If initial expansion and contraction steps are carried out, leading to an initial
guess of $\alpha=\gamma^k$, where $\gamma>1$ is a scaling factor (see `?bru_options`, `bru_method$factor`) and $k$ is the
(signed) number of expansions and contractions, the quadratic expression is replaced by
$$
\begin{aligned}
\|\breve{\bm{\eta}}_\alpha-\ol{\bm{\eta}}_1\|^2
&=
\| (\alpha-1)(\ol{\bm{\eta}}_1 - \ol{\bm{\eta}}_0) + \frac{\alpha^2}{\gamma^{2k}} (\wt{\bm{\eta}}_{\gamma^k} - \ol{\bm{\eta}}_{\gamma^k}) \|^2
,
\end{aligned}
$$
which is minimised on the interval $\alpha\in[\gamma^{k-1},\gamma^{k+1}]$.

## Posterior non-linearity checks

Whereas the inlabru optimisation method leads to an estimate where $\| \wt{\bm{\eta}} (\bm{u}_*) - \ol{\bm{\eta}}(\bm{u}_*)\|=0$ for a specific $\bm{u}_*$, the overall posterior approximation accuracy depends on the degree of nonlinearity in the vicinity of $\bm{u}_*$.
There are two main options for evaluating this nonlinearity, using sampling from the approximate posterior distribution.
The first option is 
$$
\begin{aligned}
\sum_i \frac{E_{\bm{u}\sim \ol{p}(\bm{u}|\bm{y})}\left[
|\ol{\bm{\eta}}_i(\bm{u})-\wt{\bm{\eta}}_i(\bm{u})|^2\right]}{\mathrm{Var}_{\bm{u}\sim \ol{p}(\bm{u}|\bm{y})}(\ol{\bm{\eta}}_i(\bm{u}))} ,
\end{aligned}
$$ 
which is the posterior expectation of the component-wise variance-normalised squared deviation between the non-linear and linearised predictor. Note that the normalising variance includes the variability induced by the posterior uncertainty for $\bm{\theta}$, whereas the $\|\cdot\|_V$ norm used for the line search used only the posterior mode.
Another option is 
$$
E_{(\bm{u},\bm{\theta})\sim \ol{p}(\bm{u},\bm{\theta}|\bm{y})} \left[\ln \frac{\ol{p}(\bm{u} |\bm{y},{\bm{\theta}})}{\wt{p}(\bm{u}|\bm{y},{\bm{\theta}})}\right]
=
E_{\bm{\theta}\sim \ol{p}(\bm{\theta}|\bm{y})} \left\{
E_{\bm{u}\sim \ol{p}(\bm{u}|\bm{y},\bm{\theta})}
\left[\ln \frac{\ol{p}(\bm{u} |\bm{y},{\bm{\theta}})}{\wt{p}(\bm{u}|\bm{y},{\bm{\theta}})}\right]
\right\}
$$ 
which is the Kullback–Leibler divergence for the conditional posterior densities, $\pKL{\ol{p}}{\wt{p}}$, integrated over the approximate posterior
distribution for $\bm{\theta}$.  Implementing this would require access to the likelihood and prior distribution details.  The next section explores this in more detail.

#### Accuracy

We wish to assess how accurate the approximation is. Thus, we compare $\wt{p}(\bm{u} | \bm{y}, \bm{\theta} )$ and $\ol{p}(\bm{u} |\bm{y},\bm{\theta})$. 
With Bayes' theorem, 
$$
\begin{aligned}
    p(\bm{u}|\bm{y},{\bm{\theta}}) &= \frac{p(\bm{u},\bm{y}|{\bm{\theta}})}{p(\bm{y}|{\bm{\theta}})} \\
    &= \frac{p(\bm{y}|\bm{u},{\bm{\theta}}) p(\bm{u}|{\bm{\theta}})}{p(\bm{y}|{\bm{\theta}})},
\end{aligned}
$$
where $p(\bm{u}|\bm{\theta})$ is a Gaussian density and $p(\bm{y}|\bm{\theta})$ is a scaling factor that doesn't depend on $\bm{u}$. We can therefore focus on the behaviour of $\ln p(\bm{y}|\bm{\theta},\bm{u})$ for the exact and linearised observation models.

Recall that the observation likelihood only depends on $\bm{u}$ through $\bm{\eta}$.
Using a Taylor expansion with respect to $\bm{\eta}$ and $\bm{\eta}^*=\wt{\bm{\eta}}(\bm{u}_*)$,
$$
\begin{aligned}
    \ln p(\bm{y}|\bm{\eta},\bm{\theta}) &=
    \ln p (\bm{y}|{\bm{\theta}},\bm{\eta}^*))  \\ 
    &\qquad + \sum_i \left.\frac{\partial}{\partial\eta_i} \ln p (\bm{y} | {\bm{\theta}}, \bm{\eta}) \right|_{\bm{\eta}^*}\cdot (\eta_i - \eta^*_i) \\
    &\qquad + \frac{1}{2}\sum_{i,j} \left.\frac{\partial^2}{\partial\eta_i\partial\eta_j} \ln p (\bm{y} | {\bm{\theta}}, \bm{\eta}) \right|_{\bm{\eta}^*}\cdot (\eta_i - \eta^*_i) (\eta_j - \eta^*_j) + \OO(\|\bm{\eta}-\bm{\eta}^*\|^3),
\end{aligned}
$$
Similarly, for each component of $\wt{\bm{\eta}}$,
$$
\begin{aligned}
\wt{\eta}_i(\bm{u}) &= \eta^*_i 
+\left[\left.\nabla_{u}\wt{\eta}_i(\bm{u})\right|_{\bm{u}_*}\right]^\top (\bm{u} - \bm{u}_*)
\\&\quad
+\frac{1}{2}(\bm{u} - \bm{u}_*)^\top\left[\left.\nabla_{u}\nabla_{u}^\top\wt{\eta}_i(\bm{u})\right|_{\bm{u}_*}\right] (\bm{u} - \bm{u}_*) + \OO(\|\bm{u}-\bm{u}^*\|^3)
\\&= \eta_i^* + b_i(\bm{u}) + h_i(\bm{u}) + \OO(\|\bm{u}-\bm{u}_*\|^3)
\\&= \ol{\eta}_i(\bm{u}) + h_i(\bm{u}) + \OO(\|\bm{u}-\bm{u}_*\|^3)
\end{aligned}
$$

where $\nabla_u\nabla_u^\top$ is the Hessian with respect to $\bm{u}$, $b_i$ are linear in $\bm{u}$, and $h_i$ are quadratic in $\bm{u}$.
Combining the two expansions and taking the difference between the full and linearised log-likelihoods, we get
$$
\begin{aligned}
    \ln \wt{p}(\bm{y}|\bm{u},\bm{\theta}) -
    \ln \ol{p}(\bm{y}|\bm{u},\bm{\theta})
    &=
    \sum_i \left.\frac{\partial}{\partial\eta_i} \ln p (\bm{y} | {\bm{\theta}}, \bm{\eta}) \right|_{\bm{\eta}^*}\cdot h_i(\bm{u}) + \OO(\|\bm{u}-\bm{u}_*\|^3)
\end{aligned}
$$
Note that the log-likelihood Hessian difference contribution only involves third order $\bm{u}$ terms
and higher, so the expression above includes all terms up to second order.

Let
$$
g_i^*=\left.\frac{\partial}{\partial\eta_i} \ln p (\bm{y} | {\bm{\theta}}, \bm{\eta}) \right|_{\bm{\eta}^*}
$$
and
$$
\bm{H}^*_i = \left.\nabla_{u}\nabla_{u}^\top\wt{\eta}_i(\bm{u})\right|_{\bm{u}_*} .
$$
and form the sum of their products, $\bm{G}=\sum_i g_i^*\bm{H}_i^*$.
Then
$$
\begin{aligned}
    \ln \wt{p}(\bm{y}|\bm{u},\bm{\theta}) -
    \ln \ol{p}(\bm{y}|\bm{u},\bm{\theta})
    &=
    \frac{1}{2}
    \sum_i g_i^* (\bm{u}-\bm{u}_*)^\top \bm{H}_i^* (\bm{u}-\bm{u}_*) + \OO(\|\bm{u}-\bm{u}_*\|^3)
    \\&=
    \frac{1}{2}
    (\bm{u}-\bm{u}_*)^\top \bm{G} (\bm{u}-\bm{u}_*) + \OO(\|\bm{u}-\bm{u}_*\|^3).
\end{aligned}
$$
With $\bm{m}=\pE_\ol{p}(\bm{u}|\bm{y},\bm{\theta})$ and $\bm{Q}^{-1}=\pCov_\ol{p}(\bm{u},\bm{u}|\bm{y},\bm{\theta})$, we obtain
$$
\begin{aligned}
\pE_{\ol{p}}\left[ \nabla_{\bm{u}}  \left\{\ln \wt{p}(\bm{y}|\bm{u},\bm{\theta}) -
    \ln \ol{p}(\bm{y}|\bm{u},\bm{\theta})\right\}\right]
 &\approx
    \bm{G}(\bm{m}-\bm{u}_*) ,
    \\
\pE_{\ol{p}}\left[ \nabla_{\bm{u}}\nabla_{\bm{u}}^\top  \left\{\ln \wt{p}(\bm{y}|\bm{u},\bm{\theta}) -
    \ln \ol{p}(\bm{y}|\bm{u},\bm{\theta})\right\}\right]
 &\approx
    \bm{G} ,
    \\
\pE_{\ol{p}}\left[    \ln \wt{p}(\bm{y}|\bm{u},\bm{\theta}) -
    \ln \ol{p}(\bm{y}|\bm{u},\bm{\theta})\right]
 &\approx
    \frac{1}{2}
    \tr(\bm{G}\bm{Q}^{-1}) + \frac{1}{2} (\bm{m}-\bm{u}_*)\bm{G}(\bm{m}-\bm{u}_*)^\top .
\end{aligned}
$$
For each $\bm{\theta}$ configuration in the INLA output, we can extract both $\bm{m}$ and the sparse precision matrix $\bm{Q}$ for the Gaussian approximation. The non-sparsity structure of $\bm{G}$ is contained in the non-sparsity of $\bm{Q}$, which allows the use of Takahashi recursion (`inla.qinv(Q)`) to compute the corresponding $\bm{Q}^{-1}$ values needed to evaluate the trace $\tr(\bm{G}\bm{Q}^{-1})$.
Thus, to implement a numerical approximation of this error analysis only needs special access to the log-likelihood derivatives $g_i^*$, as $H_i^*$ can in principle be evaluated numerically.

For a given $\bm{\theta}$,
$$
\begin{aligned}
\pKL{\ol{p}}{\wt{p}} &=
E_{\ol{p}}\left[\ln\frac{\ol{p}(\bm{u}|\bm{y},\bm{\theta})}{\wt{p}(\bm{u}|\bm{y},\bm{\theta})}\right]
\\&=
E_{\ol{p}}\left[
\ln\frac{\ol{p}(\bm{y}|\bm{u},\bm{\theta})}{\wt{p}(\bm{y}|\bm{u},\bm{\theta})}
\right]
-
\ln\frac{\ol{p}(\bm{y}|\bm{\theta})}{\wt{p}(\bm{y}|\bm{\theta})} .
\end{aligned}
$$
The first term was approximated above. The second term can also be approximated using the derived quantities (to be continued...).

Summary: The form of the observation likelihood discrepancy shows that, given a linearised posterior $\pN(\bm{m},\bm{Q}^{-1})$, a Gaussian approximation to the  nonlinear model posterior, $\pN(\wt{\bm{m}},\wt{\bm{Q}}^{-1})$, can be obtained from $\wt{\bm{Q}}=\bm{Q}-\bm{G}$ and $\wt{\bm{Q}}\wt{\bm{m}}=\bm{Q}\bm{m}-\bm{G}\bm{u}_*$. The K-L divergence becomes
$$
\begin{aligned}
\pKL{\ol{p}}{\wt{p}} &\approx
\frac{1}{2}
\left[
\ln\det(\bm{Q})-
\ln\det(\bm{Q}-\bm{G})
-\tr\left(\bm{G}\bm{Q}^{-1}\right)
+
(\bm{m}-\bm{u}_*)^\top\bm{G}(\bm{Q}-\bm{G})^{-1}\bm{G}(\bm{m}-\bm{u}_*)
\right] .
\end{aligned}
$$
When the INLA posterior has mean $\bm{m}=\bm{u}_*$, e.g. for models with additive Gaussian observation noise, and $\bm{\theta}=\wh{\bm{\theta}}_{\bm{u}_*}$, the last term vanishes.

Note: by implementing the K-L divergence accuracy metric, a by-product would be
improved posterior estimates based on $\wt{\bm{m}}$ and $\wt{\bm{Q}}$.

### Well-posedness and initialisation
On a side note, one might be concerned about initialisation at, or convergence to, a saddle point. Although it is not implemented in inlabru, we want to talk about the technicality how we define the initial linearisation point $u_0$.

Generally speaking, any values of $\bm{u}_0$ work except the case that the gradient evaluated at $\bm{u}_0$ is $\bm{0}$ because the linearisation point will never move away if the prior mean is also $\bm{0}$. In general, this tends to be a saddle point problem. In some cases the problem can be handled by changing the predictor parameterisation or just changing the initialisation point using the `bru_initial` option. However, for true saddle point problems, it indicates that the predictor parameterisation may lead to a multimodal posterior distribution or is ill-posed in some other way. This is a more fundamental problem that cannot be fixed by changing the initialisation point.

In these examples, where $\beta$ and $\bm{u}$ are latent Gaussian components, the predictors 1, 3, and 4 would typically be safe, but predictor 2 is fundamentally non-identifiable.
$$ 
\begin{aligned}
\bm{\eta}_1 &= \bm{u}, \\
\bm{\eta}_2 &= \beta \bm{u}, \\
\bm{\eta}_3 &= e^\beta \bm{u}, \\
\bm{\eta}_4 &= F_\beta^{-1} ( \Phi(z_\beta)) \bm{u}, \quad z_{\beta} \sim \mathsf{N}(0,1) .
\end{aligned}
$$
Note that for $\bm{\eta}_3$ and $\bm{\eta}_4$, the partial derivatives with respect to $\beta$ are zero for $\bm{u}=\bm{0}$. However, the first inlabru iteration will give a non-zero estimate of $\bm{u}$, so that subsequent iteration will involve both $\beta$ and $\bm{u}$.