Statistical analyses are typically concerned with modelling and estimating the distribution of some measured variable of interest Y, called the outcome, possibly conditional on the value of one or several endogenous variables X, called predictors. In the absence of endogenous variables, this process is usually called distribution fitting, and in the presence of endogenous variables it is called regression. Classical regression, such as via generalized linear models (GLMs), is concerned with the influence of endogenous variables on the mean of the outcome, i.e., E(Y|X) = f(X), and often links other parameters of the conditional outcome distribution to its mean. A gentle introduction to generalized linear models can be found in Dobson and Barnett (2018). An implementation of GLMs is available in the stats R package, which is part of R itself (R Core Team 2023). Some models also allow specification of additional parameters of the conditional outcome distribution, such as Generalized Additive Models for Location, Scale and Shape (Stasinopoulos and Rigby 2007). More recently, deep distributional regression has been proposed, which allows for flexible specification of individual outcome distribution parameters (Rügamer et al. 2023).
Statistical methods (such as those described and implemented in the previously mentioned papers) often require complete data, that is full information on all observations (X, Y) of interest. In this paper, we describe an R-package that allows for distributional regression in three common observation schemes that do not provide complete data. First of all, data with interval censoring applied to the outcome Y refers to the case where only lower and upper bounds for Y are observed, instead of the actual value. Next, truncated data misses observations for which the outcome Y falls out of a certain lower and upper truncation bound. We consider the case of random truncation, where these truncation bounds are also random variables that may vary for each observation. Finally, we consider a combination of the two, randomly truncated interval censoring.
The three scenarios can be combined into a single general scheme: instead of observing the real-valued target variable Y (with μ-density fθ and c.d.f. Fθ, where μ is a sigma-finite measure on ℝ and θ is a parameter vector in some parameter space Θ), we observe the vector (M, V, L, U), which satisfies L ≤ M ≤ V ≤ U and L < U. Its coordinates have the following interpretation: the last two coordinates, which satisfy −∞ ≤ L < U ≤ ∞, encode truncation: we only happen to observe (M, V, L, U) if L < Y ≤ U; in particular, a non-truncated observations means that L = −∞ and U = ∞. The first two coordinates, which satisfy L ≤ M ≤ V ≤ U and which may be ∓∞, encode censoring: the observation (M, V) = (m, v) means that the target variable Y satisfies Y ∈ (m, v] if m < v and Y = m if m = v; the latter corresponds to an uncensored observation of Y.
It is instructive to focus on the simpler problem of distribution parameter estimation before proceeding with distributional regression. Suppose we observe an independent sample 𝒥 = {(mi, vi, li, ui) : i = 1, …, n} of (M, V, L, U). Suggested by standard maximum (conditional) likelihood approaches for truncated (Dörre and Emura 2019) and censored observations (Sun 2006), we suggest to estimate θ by maximizing the objective function
where we use the notation Fθ((l, u]) = Fθ(u) − Fθ(l). A detailed motivation for this approach under suitable conditions ensuring that the censoring is non-informative is given in Section @ref(motivation-cml-likelihood) below. For later purposes, it is helpful to attach a weight wi to each observation (mi, vi, li, ui). Denoting the resulting sample by ℑ = {(mi, vi, li, ui, wi)}, we aim at maximizing the weighted sum of the (conditional) log-likelihoods
A practical example of random truncation arises when modelling the reporting delay of claims in general insurance. The target variable Y is the reporting delay of an accident happening at accident time T0, which is hence reported to the insurer at calendar time Y + T0. The truncation bounds (L, U) for Y will be equal to (0, τ − T0) with τ the current calendar time.
Combined random truncation with interval censoring can occur when modelling failure times when only survival data at two (or more) maintenance appointments some time after purchase is captured, and only for items that are sold. The target variable Y is the failure time of an item. Item condition (failed / functional) can be observed at maintenance times M0 and M1, which may vary for each item. For each maintained item, the production time P0 and the purchase time P1 is also known. Only items that are functional at purchase time P1 are observed at the maintenance times. This gives rise to truncation bounds (L, U) = (P1 − P0, ∞) and censoring interval bounds (M, V) ∈ {(P1 − P0, M0 − P0), (M0 − P0, M1 − P0), (M1 − P0, ∞)}, depending on the item condition at times M0 and M1.
In the setting of distributional regression, weighted samples ℑ (M, V, L, U, W) have associated predictors X ∈ 𝔛, resulting in observations of the shape ℑreg = {(mi, vi, li, ui, wi, xi) : i = 1, …, n}. We are interested in estimating a regression function g : 𝔛 → Θ given a sample ℑreg, a parameterized family ℱ = {Fθ ∣ θ ∈ Θ} and a family 𝒢 of functions from 𝔛 to Θ. It is assumed that there exists a fuction g ∈ 𝒢 such that the conditional distribution of Y|X = x is Fg(x). Distributional regression can be formulated as the maximization problem
Compared to Equation @ref(eq:cml-likelihood), the global parameter θ is replaced by the regression function g evaluated at the associated predictors x.
It is instructive to start by considering an untruncated, censored observation where l = −∞, u = ∞ and m < v. The only information we obtain from the observation (m, v, l, u) is then that Y ∈ (m, v]. For deriving the relevant likelihood contribution, we may follow the stochastic approach to interval censored observations described in : let (C1, C2) denote a random vector in ℝ2 that is independent of the target variable Y and which satisfies P(C1 < C2) = 1. Let
and define new random variables (M, V) = f(Y, C1, C2) by
Note that D can be reconstructed from (M, V): we have D = 0 if M = −∞, D = 1 if −∞ < M < V < ∞ and D = 2 if V = ∞.
It is instructive to proceed with the case where (C1, C2) and hence (M, V) is discrete. Then, for (m, v) ∈ supp(M, V) ∩ ℝ2, we have
Likewise, for (m, v) ∈ supp(M, V) ∩ ({−∞} × ℝ), we obtain
and finally, for (m, v) ∈ supp(M, V) ∩ (ℝ × {∞}),
If we assume that the distribution of the censoring variable (C1, C2) is non-informative, i.e., its distribution does not depend on θ, the likelihood of observing (M, V) = (m, v) is equal to Fθ((m, v]), up to a factor that does not depend on θ. A similar argumentation can be used in the non-discrete case. Overall, noting that F∞((−∞, ∞]) = 1, we have motivated the likelihood contribution Fθ((m, v]) ⋅ 1(m < v) for a censored, untruncated observation in @ref(eq:cml-likelihood-nw).
Next, consider an uncensored, truncated observation (m, v, l, u) where y = m = v; we may hence identify such an observation with (y, l, u). We may then proceed as in and assume that (L, U) is independent of Y and satisfies L ≤ U, with L possibly equal to −∞ and U possibly equal to ∞. Further, (L, U) shall have a density f(L, U) with respect to some dominating sigma-finite measure ν. Truncation means that we only happen to observe (Y, L, U) if L < Y ≤ U. As a consequence, any observed value with M = V can be regarded as being drawn from the (μ ⊗ ν)-density
Subsequently, we write (Y(t), L(t), U(t)) for a random vector following the above density, i.e.,
Conditioning this density on (L(t), U(t)) = (l, u), we arrive at an expression that does not involve the nuisance density f(L, U):
Overall, we arrive at the (conditional) log-likelihood contribution log fθ(y) − log Fθ((l, u]) for an uncensored, truncated observation in @ref(eq:cml-likelihood-nw).
Finally, truncation and censoring can occur at the same time, i.e., we have l ≤ m < v ≤ u with either l ≠ −∞ or u ≠ ∞. In accordance with the previous two cases, we make the assumption that Y, (C1, C2) and (L, U) are mutually independent and satisfy C1 < C2 and L < U. Define and
For simplicity, assume that all random variables are discrete. For any observation (m, v, l, u), one of the following four cases is met
In case l < m < v < u, we have
by the independence assumption. The factor in front does not depend on θ and is irrelevant for the (conditional) likelihood contribution. Likewise, in case l = m < v < u, we have
By definition of (M, V), the event in the numerator is the disjoint union of the following two sets:
By independence, we obtain that
Again, the factor in front of the fraction is independent of θ and is irrelevant for the likelihood. The two cases l < m < v = u and l = m < v = u can be treated similarly; in all cases, the likelihood contribution is equal to Fθ((m, v])/Fθ((l, u]) times a factor that does not depend on θ.
The package serves two main goals: fitting distributions to randomly truncated non-informatively interval censored data and performing (deep) distributional regression with randomly truncated non-informatively interval censored data. Four main components are integrated with each other to facilitate the analysis goals
Each of these components is described one by one in the following sections.
A sample ℑ = {(m, v, l, u, w)i}
is represented as a tibble (from package tibble). The
core function to create this tibble is trunc_obs()
. A
tibble created by trunc_obs()
consists of five columns:
x
: If observed, the exact value of the random variable,
referred to as Y in Section
@ref(introduction). Otherwise NA
.xmin
: Lower interval censoring bound (M in Section @ref(introduction)) for
the observation. If the observation is not censored, xmin
is equal to x
.xmax
: Upper interval censoring bound (V in Section @ref(introduction)) for
the observation. If the observation is not censored, xmax
is equal to x
.tmin
: Lower truncation bound (L in Section @ref(introduction)).
Only observations with x
≥ t
m
i
n
are observed. Can be −∞ to indicate no
lower truncation.tmax
: Upper truncation bound (U in Section @ref(introduction)).
Only observations with x
≤ t
m
a
x
are observed. Can be ∞ to indicate no
upper truncation.w
: The weight associated with the observation. Defaults
to 1.Note that, unlike in Section @ref(introduction), the lower bounds of
intervals in trunc_obs
are included, that is, we allow for
x
≥ t
m
i
n
rather than x
> t
m
i
n
,
and that the unknown variable of interest is called x
instead of Y. For continuous random variables,
the formulas are equivalent to the half-open formulation. For discrete
random variables, x
m
i
n
and t
m
i
n
may have to be appropriately shifted, e.g., by replacing x
m
i
n
by x
m
i
n
− 0.5
for integer valued variables. The following code defines a sample of
size 1 without truncation and censoring, with the realized value of
1.3.
## # A tibble: 1 × 6
## x xmin xmax tmin tmax w
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.3 1.3 1.3 -Inf Inf 1
Simulating randomly truncated and interval censored data from a standard normal distribution with 80% of the observations randomly interval censored and random uniform truncation L ∼ Unif[−2, 0] and U ∼ Unif[0, 2] can be simulated as follows
set.seed(123)
N <- 1000L
x <- rnorm(N)
is_censored <- rbinom(N, size = 1L, prob = 0.8) == 1L
c_lower <- runif(sum(is_censored), min = -2.0, max = 0.0)
c_upper <- c_lower + runif(sum(is_censored), min = 0, max = 1.0)
x_lower <- x
x_upper <- x
x_lower[is_censored] <- dplyr::case_when(
x[is_censored] <= c_lower ~ -Inf,
x[is_censored] <= c_upper ~ c_lower,
TRUE ~ c_upper
)
x_upper[is_censored] <- dplyr::case_when(
x[is_censored] <= c_lower ~ c_lower,
x[is_censored] <= c_upper ~ c_upper,
TRUE ~ Inf
)
t_lower <- runif(N, min = -2.0, max = 0.0)
t_upper <- runif(N, min = 0.0, max = 2.0)
is_observed <- t_lower <= x & x <= t_upper
obs <- trunc_obs(
xmin = pmax(x_lower, t_lower)[is_observed],
xmax = pmin(x_upper, t_upper)[is_observed],
tmin = t_lower[is_observed],
tmax = t_upper[is_observed]
)
Observations look like:
## # A tibble: 5 × 6
## x xmin xmax tmin tmax w
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA -0.479 1.15 -1.93 1.15 1
## 2 NA -0.177 1.79 -0.210 1.79 1
## 3 -0.556 -0.556 -0.556 -0.957 0.791 1
## 4 NA -0.379 0.616 -0.379 0.616 1
## 5 NA 0.0575 1.45 -0.437 1.45 1
The total number of observations is smaller than the base population of 1000 due to truncation:
## [1] 623
The total number of censored observations is roughly 0.8 ⋅ n
r
o
w
(
o
b
s
)
.
## [1] 496
In addition to the trunc_obs()
constructor function,
there are functions as_trunc_obs()
for coercion,
truncate_obs()
for artificially changing truncation bounds,
and repdel_obs()
for computing randomly truncated reporting
delay observations from general insurance claims data containing
accident date, reporting delay and evaluation date information. The
latter takes inputs of the form (Tacc, D, τ)
where Tacc < τ are
accident dates with corresponding reporting delays D ≥ 0 and τ is the calendar date of
observation. It returns the sample (x
m
i
n
= x
m
a
x
= D, t
m
i
n
= 0, t
m
a
x
= τ − Tacc, w
= 1)
suitable for estimating the reporting delay distribution where a claim
is only observed if it has been reported by the evaluation date, i.e.,
Tacc + D ≤ τ.
Such an analysis was performed using reservr in .
Distribution families are implemented using the R6
class system (Chang
2021). They inherit from the class Distribution
and feature a common interface to
A Distribution
object represents a distribution family
ℱ supported on a subset of the real
line and parameterized by a fixed finite-dimensional parameter space
Θ. The family may be a
singleton, in which case it is rather a distribution than a distribution
family.
reservr provides a set of basic distribution families, optionally with some fixed parameters, as well as transformations of distribution families that take one or more underlying distribution families. At the time of writing, these are:
Generator function | Description |
---|---|
dist_bdegp(n, m, u, epsilon) |
A Blended Dirac Erlang Generalized Pareto distribution family, see Section @ref(dist-bdegp) |
dist_beta(shape1, shape2, ncp) |
A (non-central) Beta distribution family |
dist_binomial(size, prob) |
A Binomial distribution family |
dist_dirac(point) |
A Dirac distribution family with full mass at
point |
dist_discrete(size, probs) |
A discrete distribution family with fixed support {1, …, size} and P(X = k) = probsk |
dist_erlangmix(shapes, scale, probs) |
An Erlang mixture distribution family, see Section @ref(dist-erlangmix) |
dist_exponential(rate) |
An Exponential distribution family |
dist_gamma(shape, rate) |
A Gamma distribution family |
dist_genpareto(u, sigmau, xi) |
A Generalized Pareto Distribution family |
dist_genpareto1(u, sigmau, xi) |
A Generalized Pareto Distribution family with the tail index ξ constrained to (0, 1) |
dist_lognormal(meanlog, sdlog) |
A Log-Normaldistribution family |
dist_negbinomial(size, mu) |
A negative Binomial distribution family |
dist_normal(mean, sd) |
A Normal distribution family |
dist_pareto(shape, scale) |
A Pareto Type I distribution family, i.e., a Generalized Pareto
distribution family with u = 0 |
dist_poisson(lambda) |
A Poisson distribution family |
dist_uniform(min, max) |
A uniform distribution family |
dist_weibull(shape, scale) |
A Weibull distribution family |
Transformation function | Description |
---|---|
dist_blended(dists, probs, breaks, bandwidths) |
A Blended mixture distribution family, see Section @ref(dist-blended) |
dist_mixture(dists, probs) |
A general Mixture distribution family, see Section @ref(dist-mixture) |
dist_translate(dist, offset, multiplier) |
The affine transformation family consisting of all distributions of
m u l t i p l i e r ⋅ X + o f f s e t
with X ∼ F ∈ d i s t |
dist_trunc(dist, min, max) |
The truncated distribution family consisting of all distributions of
X|(m i n ≤ X ≤ m a x )
with X ∼ F ∈ d i s t |
Parameters of distribution families can either be fixed to a constant
value, or free. Free parameters (placeholders) are those that
should be estimated from data whereas fixed parameters are held
constant. Most Distribution
methods have an argument
with_params
to provide values for the free parameters and
need fully specified parameters to work. For example, generating samples
from a distribution is only possible if it is fully parameterized using
fixed parameters and the with_params
argument of
Distribution$sample()
.
We have now defined dist
to be a normal distribution
family with standard deviation 1 and
free mean. Since not all parameters required for a normal distribution
are fixed, dist$sample()
will error if not provided with a
mean
parameter.
## Error in (function (n, mean = 0, sd = 1) : invalid arguments
The with_params
argument can be used both to provide
free parameters and to override fixed parameters, if necessary.
## [1] 0.01874617
## [1] 0.03749234
The two observations were drawn from a standard normal and a normal
distribution with mean zero and standard deviation 2, respectively. Since the chosen seed was
identical, the second sample is exactly double the first sample.
Whenever the output length is greater than one, such as when taking more
than one sample, with_params
can optionally contain
individual parameters for each entry.
## [1] 0.009373085 0.907873729 1.314334725
The three observations were drawn from 𝒩(μ = 0, σ = 0.5), 𝒩(μ = 1, σ = 0.5) and 𝒩(μ = 2, σ = 0.5), respectively.
Distribution
s have a set of fields and methods related
to managing parameters:
default_params
gets or sets the list
of all parameters and their fixed values, NULL
represents a
free parameter. Component families are included as
Distribution
objects.get_params()
gets the list of all parameters and their
fixed values, traversing component distribution families.get_placeholders()
gets the list of free parameters
with NULL
as values.param_bounds
gets or sets the domain
of all regular family parameters as an Interval
object.
Setting a bound via the param_bounds
active binding allows
restricting the natural parameter space of a family.get_param_bounds()
returns the bounds of all free
parameters as a list of Interval
s, traversing component
distribution families.get_param_constraint()
returns NULL
or a
function that evaluates constraints on the parameter set. The function
must return a vector of constraint values (that need to be equal to
0 for valid parameters) or a list with
elements constraints
and jacobian
. When
returning a list, the jacobian
element should contain the
jacobian of the constraint function. Used in
nloptr::slsqp(heq=)
for estimation. An example is that
mixture families require the probs
parameters to sum to
1 in addition to the box constraint
that each parameter is in [0, 1]. Note
that box constraints are handled by param_bounds
and need
not be specified as a constraint function.get_components()
returns a list of component families
for transformations or mixtures. The list is empty for basic
families.Here is an example for a normal family with fixed standard deviation σ = 1 and a mixture distribution family with two components, one of which is specified as a normal distribution family:
dist <- dist_normal(sd = 1.0)
mix <- dist_mixture(dists = list(dist_normal(), NULL))
dist$default_params
## $mean
## NULL
##
## $sd
## [1] 1
## $dists
## $dists[[1]]
## A NormalDistribution with 2 dof
##
## $dists[[2]]
## NULL
##
##
## $probs
## $probs[[1]]
## NULL
##
## $probs[[2]]
## NULL
## List of 1
## $ mean: NULL
## List of 2
## $ dists:List of 2
## ..$ :List of 2
## .. ..$ mean: NULL
## .. ..$ sd : NULL
## ..$ : NULL
## $ probs:List of 2
## ..$ : NULL
## ..$ : NULL
## List of 2
## $ mean:Classes 'Interval', 'R6' (-Inf, Inf)
## $ sd :Classes 'Interval', 'R6' (0, Inf)
## List of 2
## $ dists:List of 1
## ..$ : NULL
## $ probs:List of 1
## ..$ :Classes 'Interval', 'R6' [0, 1]
## List of 1
## $ mean:Classes 'Interval', 'R6' (-Inf, Inf)
## List of 2
## $ dists:List of 1
## ..$ :List of 2
## .. ..$ mean:Classes 'Interval', 'R6' (-Inf, Inf)
## .. ..$ sd :Classes 'Interval', 'R6' (0, Inf)
## $ probs:List of 2
## ..$ :Classes 'Interval', 'R6' [0, 1]
## ..$ :Classes 'Interval', 'R6' [0, 1]
## NULL
## function (params)
## list()
## [[1]]
## A NormalDistribution with 2 dof
##
## [[2]]
## NULL
The basic distribution functions (density, probability, hazard and
quantile function, as well as random number generation) are provided by
each distribution family. In general, the argument
with_params
can be used to both specify missing parameters
(placeholders) and to override fixed distribution parameters. If the
provided parameters are vectors of length greater than 1, they must
conform to the input dimension (e.g. length(x)
for
density
). In this case, the parameters are “vectorized” in
the sense that the ith output
element will be computed using the ith entry from the parameter
list.
density(x, log = FALSE, with_params = list())
computes
the (log-)density.probability(q, lower.tail = TRUE, log.p = FALSE, with_params = list()
computes the (log-)cumulative distribution function or (log-)survival
function.hazard(x, log = FALSE. with_params = list())
computes
the (log-)hazard function.quantile(p, lower.tail = TRUE, log.p = FALSE, with_params = list())
computes upper or lower quantiles.sample(n, with_params = list())
generates a random
sample of size n
. (with_params
can contain
length n
vectors in this case).In addition to the basic functions, there are several supporting functions useful for, e.g., estimation of parameters.
export_functions(name, with_params = list())
exports
{d,p,q,r}<name>
functions adhering to the common R
convention for distribution functions.get_type()
returns one of "continuous"
,
"discrete"
, or "mixed"
depending on whether
the distribution family has a density with respect to the Lebesgue
measure, the counting measure, or the sum of the Lebesgue measure with
one or many point measures.is_continuous()
and is_discrete()
testing
for the particular type.has_capability(caps)
gives information on whether a
specific implementation provides some or all of the features described.
Possible capabilities are "sample"
, "density"
,
"probability"
, "quantile"
,
"diff_density"
, "diff_probability"
,
"tf_logdensity"
, "tf_logprobability"
.require_capability(caps)
errors if the specified
capabilities are not implemented for the family at hand.is_discrete_at(x, with_params = list())
returns a
logical vector indicating whether the distribution has a point mass at
x
.is_in_support(x, with_params = list())
returns a
logical vector indicating whether the distribution has any mass at
x
.When working with larger data or many calls to distribution
functions, such as when performing a fit, it can be beneficial to
just-in-time compile specialized functions that avoid overhead for
dealing with the generic structure of distributions and their
parametrization. Distribution
s offer a set of “compiler”
functions that return simplified, faster, versions of the basic
distribution functions, or that analytically compute gradients. Those
functions are not necessarily implemented for all
Distribution
classes, but will be automatically used by,
e.g., fit_dist()
if useful. The input structure for
param_matrix
can be obtained by
flatten_params_matrix(dist$get_placeholders())
where
dist
is the Distribution
object in
question.
compile_density()
compiles a fast function with
signature (x, param_matrix, log = FALSE)
that will compute
the density with fixed parameters hard-coded and taking the free
parameters as a matrix with defined layout instead of a nested
list.compile_probability()
compiles a fast replacement for
probability
with signature
(q, param_matrix, lower.tail = TRUE, log.p = FALSE)
.compile_probability_interval()
compiles a fast function
with signature (qmin, qmax, param_matrix, log.p = FALSE)
computing P(X ∈ [q
m
i
n
, q
m
a
x
])
or its logarithm efficiently. This expression is necessary for computing
truncation probabilities.compile_sample()
compiles a fast replacement for
sample
with signature (n, param_matrix)
.diff_density(x, log = FALSE, with_params = list())
computes the (log-)gradients of the density function with respect to
free distribution family parameters, useful for maximum likelihood
estimation.diff_probability(q, lower.tail = TRUE, log.p = FALSE, with_params = list())
computes the (log-)gradients of the cumulative density function with
respect to free distribution family parameters. This is useful for
conditional maximum likelihood estimation in the presence of random
truncation or non-informative interval censoring.## mean sd
## [1,] NA NA
denscmp <- dist$compile_density()
if (requireNamespace("bench", quietly = TRUE)) {
bench::mark(
dist$density(-2:2, with_params = list(mean = 0.0, sd = 1.0)),
denscmp(-2:2, matrix(c(0.0, 1.0), nrow = 5L, ncol = 2L, byrow = TRUE)),
dnorm(-2:2, mean = rep(0.0, 5L), sd = rep(1.0, 5L))
)
}
## # A tibble: 3 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
## 1 dist$density(-2:2, with_params =… 26.25µs 28.65µs 34492. 0B 17.3
## 2 denscmp(-2:2, matrix(c(0, 1), nr… 3.97µs 4.6µs 212255. 0B 0
## 3 dnorm(-2:2, mean = rep(0, 5L), s… 1.62µs 1.92µs 501748. 0B 50.2
Use of distribution families from within tensorflow networks requires specialized implementations using the tensorflow APIs instead of regular R functions. These are tailored to the needs of maximizing (conditional) likelihoods of weighted, censored and randomly truncated data. Details on working with tensorflow can be found in Section @ref(tensorflow).
tf_compile_params(input, name_prefix = "")
creates
keras layers that take an input
layer and
transform it into a valid parametrization of the distribution
family.tf_is_discrete_at()
returns a
tensorflow-ready version of
is_discrete_at()
.tf_logdensity()
returns a
tensorflow-ready version of
compile_density()
with implied
log = TRUE
.tf_logprobability()
returns a
tensorflow-ready version pf
compile_probability_interval()
with implied
log.p = TRUE
.tf_make_constants()
creates a list of constant tensors
for all fixed distribution family parameters.Some of the distribution families available in reservr have tailored algorithms for parameter estimation, or are not commonly known. This section contains mathematical definitions of those function families.
A mixture distribution family is defined by a fixed number k of component families {ℱi}i = 1k via the set of distributions
An Erlang mixture distribution family is defined by its number of components k as a mixture of Erlang distributions (Gamma distributions with integer shape parameter) with common scale parameter. If Γα, θ denotes a Gamma distribution with shape α and scale θ, the erlang mixture family with k components can be defined as follows:
Note that for k → ∞, Erlang mixtures are dense in the space of distributions on (0, ∞) with respect to weak convergence (Lee and Lin 2012), making them a useful modeling choice for general positive continuous distributions. However, the tail index of all Erlang mixture distributions is always zero due to the exponential decay of Gamma densities.
A Blended distribution is defined in as follows: Given two underlying distributions P, Q on ℝ with cdfs F(⋅) = P((−∞, ⋅]) and G(⋅) = Q((−∞, ⋅]), respectively, and parameters κ ∈ ℝ, ε ∈ (0, ∞), p1, p2 ∈ [0, 1], p1 + p2 = 1 such that F(κ) > 0 and G(κ) < 1, we define the Blended Distribution B = Blended (P, Q; p, κ, ε) of P and Q with blending interval [κ − ε, κ + ε] and mixture probabilities p via its cdf FB:
The following illustration shows the components of a Blended(𝒩(μ = −1, σ = 1), Exp(λ = 1); p = (0.5, 0.5), κ = 0, ε = 1) distribution.
dist1 <- dist_normal(mean = -1.0, sd = 1.0)
dist2 <- dist_exponential(rate = 1.0)
distb <- dist_blended(
dists = list(dist1, dist2),
breaks = list(0.0),
bandwidths = list(1.0),
probs = list(0.5, 0.5)
)
The transformation of the original component distributions (𝒩 and Exp)
can be illustrated by first right- and left-truncating at κ = 0 respectively, and then
applying the blending transformations pκ, ε
and qκ, ε.
The latter distributions can be obtained in reservr by
setting the probability weights of the blended distribution to p = (1, 0) and p = (0, 1) respectively.
Intermediate truncated distributions are obtained via
trunc_dist()
, with κ as upper or lower bound
respectively.
distt1 <- dist_trunc(dist1, min = -Inf, max = 0.0)
distt2 <- dist_trunc(dist2, min = 0.0, max = Inf)
distb1 <- distb$clone()
distb1$default_params$probs <- list(1.0, 0.0)
distb2 <- distb$clone()
distb2$default_params$probs <- list(0.0, 1.0)
We show the resulting density at each of the steps, and the final blended density obtained by weighting the blended component densities.
The definition of a blended distribution leads to the definition of a blended distribution family by allowing P, Q, κ and ε to vary:
Given two families ℱ, 𝒢 of distributions on ℝ, and parameters κ ∈ ℝ, ε ∈ (0, ∞), we define the Blended Distribution family as the family of Distributions
Blended distribution families can be generalized to a number of components k by letting κ and ε become vectors of dimension k − 1 such that κi + εi ≤ κi + 1 − εi + 1 for i = 1, …, k − 2. Compared to piecewise distribution families obtained by mixture of truncated distribution families with supports (−∞, κ] and [κ, ∞) such as those commonly used for extreme value modelling, blended distribution families exhibit a continuous density within the blending region (κ − ε, κ + ε).
reservr provides an implementation via
dist_blended()
, with limited support for more than two
component families.
Using the construction of a Blended distribution family, we can define the Blended Dirac Erlang Generalized Pareto (BDEGP) family as follows, see .
Given parameters n ∈ ℕ, m ∈ ℕ, κ ∈ ℝ and ε ∈ (0, ∞), we define the Blended Dirac Erlang Generalized Pareto family as the family of distributions
where δk is the dirac distribution at k and GPD is the generalized Pareto distribution. Note the constraint on the tail index ξ ∈ [0, 1), guaranteeing finite expectation.
This distribution family has three features making it useful in modelling very general heavy-tailed distributions on (0, ∞):
This section describes the functions for the problem of estimating a parameter θ ∈ Θ given a sample ℑ and a parameterized family ℱ = {Fθ ∣ θ ∈ Θ}. Sometimes, the conditional log-likelihood in @ref(eq:cml-likelihood) can be directly maximized, yielding an estimate for θ. This is the default behavior in reservr if no specialized estimation routine for the provided family ℱθ is defined. Depending on whether there are box constraints, nonlinear constraints or no constraints on the parameter space Θ, different implementations of nonlinear optimization algorithms from nloptr (Johnson 2007), in particular truncated Newton (Dembo and Steihaug 1983) for unconstrained families, L-BFGS (Liu and Nocedal 1989) for box-constrained families and SLSQP (Kraft 1994) for general constrained families are employed.
In addition to the naive direct optimization approach, some families lend themselves to specialized estimation algorithms which usually show faster convergence due to making use of special structures in the parameter space Θ.
Estimating distribution parameters from truncated observations is
handled using the generic fit()
method. It delegates to
fit_dist()
, which is also generic with signature:
dist
: The distribution family to be fitobs
: The trunc_obs
object, or a vector of
observed valuesstart
: Starting parameters, as a list compatible with
dist$get_placeholders()
.At the time of writing there are specialized algorithms for six types of families:
u
(estimated by the minimum of xmin
over the
sample)offset
and
multiplier
(transform the sample via $\tfrac{\cdot-\text{offset}}{\text{multiplier}}$
and fit the component distribution family to the transformed
sample)min
or upper bound max
(estimated by the minimum of
xmin
, for min
, and the maximum of
xmax
, for max
, over the sample)If not present, the start
parameter is obtained via the
fit_dist_start()
generic. This generic implements a family
specific method of generating valid starting values for all placeholder
parameters. A notable implementation is
fit_dist_start.ErlangMixtureDistribution()
for Erlang
mixture distribution families. If the shape parameters are free, there
are different initialization strategies that can be chosen using
additional arguments to fit_dist_start()
:
init = "shapes"
paired with
shapes = c(...)
manually specifies starting shape
parameters αinit = "fan"
paired with spread = d
uses
α = (1, 1 + d, …, 1 + (k − 1) ⋅ d)
with a default of d = 1
resulting in α = (1, …, k)init = "kmeans"
uses 1-dimensional K-means based
clustering of the sample observations such that each cluster corresponds
to a unique shapeinit = "cmm"
uses the centralized method of moments
procedure described in Re-using dist <- dist_normal(sd = 1.0)
from above and
the generated sample obs
, we can fit the free parameter
mean
:
## List of 3
## $ params:List of 1
## ..$ mean: num 0.0822
## $ opt :List of 5
## ..$ par : Named num 0.0822
## .. ..- attr(*, "names")= chr "mean"
## ..$ value : num 341
## ..$ iter : int 7
## ..$ convergence: int 1
## ..$ message : chr "NLOPT_SUCCESS: Generic success return value."
## $ logLik:Class 'logLik' : -341 (df=1)
Using the function plot_distributions()
we can also
assess the quality of the fit.
plot_distributions(
true = dist,
fitted = dist,
empirical = dist_empirical(0.5 * (obs$xmin + obs$xmax)),
.x = seq(-5, 5, length.out = 201),
plots = "density",
with_params = list(
true = list(mean = 0.0, sd = 1.0),
fitted = the_fit$params
)
)
Here, the density labelled empirical
corresponds to a
kernel density estimate with automatic bandwidth selection.
We follow with an example of fitting an ErlangMixture(3) distribution family using
various initialization strategies. Note that both, "kmeans"
and "cmm"
use the random number generator for internal
K-means clustering. This necessitates setting a constant seed before
running fit_dist_start()
and fit()
to ensure
the chosen starting parameters are the same for both calls.
dist <- dist_erlangmix(list(NULL, NULL, NULL))
params <- list(
shapes = list(1L, 4L, 12L),
scale = 2.0,
probs = list(0.5, 0.3, 0.2)
)
set.seed(1234)
x <- dist$sample(100L, with_params = params)
set.seed(32)
init_true <- fit_dist_start(dist, x, init = "shapes",
shapes = as.numeric(params$shapes))
init_fan <- fit_dist_start(dist, x, init = "fan", spread = 3L)
init_kmeans <- fit_dist_start(dist, x, init = "kmeans")
init_cmm <- fit_dist_start(dist, x, init = "cmm")
rbind(
flatten_params(init_true),
flatten_params(init_fan),
flatten_params(init_kmeans),
flatten_params(init_cmm)
)
## shapes[1] shapes[2] shapes[3] scale probs[1] probs[2] probs[3]
## [1,] 1 4 12 1.590800 0.43 0.33 0.24
## [2,] 1 4 7 2.688103 0.55 0.32 0.13
## [3,] 1 5 13 1.484960 0.43 0.36 0.21
## [4,] 2 10 24 1.010531 0.56 0.27 0.17
## List of 4
## $ params :List of 3
## ..$ probs :List of 3
## .. ..$ : num 0.43
## .. ..$ : num 0.33
## .. ..$ : num 0.24
## ..$ shapes:List of 3
## .. ..$ : num 1
## .. ..$ : num 4
## .. ..$ : num 13
## ..$ scale : num 1.59
## $ params_hist: list()
## $ iter : int 1
## $ logLik :Class 'logLik' : -290 (df=6)
## 'log Lik.' -292.0026 (df=6)
## 'log Lik.' -289.2834 (df=6)
## 'log Lik.' -293.1273 (df=6)
It should be noted that the different initialization methods had a
considerable impact on the outcome in the example due to the discrete
nature of Erlang mixture distribution shape parameters and thus the
combinatorial difficulty of picking optimal shapes α. The fit()
result for
Erlang mixture distribution families contains an element named
"params_hist"
. This can be populated by passing
trace = TRUE
to fit()
and will record
parameters after all ECME steps in the ECME-based estimation algorithms
from and . The element "iter"
contains the number of full
ECME-Iterations that were performed.
The maximization problem @ref(eq:cml-regression-likelihood) is
delegated to tensorflow, which supplies ample
stochastic optimization algorithms. Functions in
reservr are necessary to create a suitable output layer
for tensorflow that maps onto Θ and to provide an implementation
of the (negative) log-likelihood in @ref(eq:cml-regression-likelihood)
as a loss function. These two tasks are combined in
tf_compile_model()
. The function returns an object of class
reservr_keras_model
, which can be used for the estimation
procedure.
Given input layers inputs
and an intermediate output
layer intermediate_output
as well as a family of
distributions dist
, the function
dist
defined by
@ref(eq:cml-regression-likelihood) as $l(g) =
-\tfrac{1}{\#(\mathfrak{I}_{\mathrm{reg}})}
\ell(g|\mathfrak{I}_{\mathrm{reg}})$, optionally disabling
censoring
or truncation
for efficiency.intermediate_output
onto the parameter space Θ of dist
using
Distribution$tf_compile_params()
. This step adds additional
degrees of freedom to the overall model, and the approach is described
in keras3::compile()
on the underlying
keras.src.models.model.Model
.The following example defines a linear model with homoskedasticity assumption and fits it using 100 iterations of the Adam optimization algorithm (Kingma and Ba 2015). First, we simulate data (Y, X) from the model defined by X ∼ Unif(10, 20) and Y|X = x ∼ 𝒩(μ = 2x, σ = 1).
set.seed(1431L)
keras3::set_random_seed(1432L)
dataset <- tibble::tibble(
x = runif(100, min = 10, max = 20),
y = 2 * x + rnorm(100)
)
Next, we specify the distribution family ℱ = {𝒩(μ, σ = 1)|μ ∈ ℝ}, incorporating the homoskedasticity assumption.
Using keras, we define an empty neural network, just taking x as an input and performing no transformation.
Then, tf_compile_model()
adapts the input layer to the
free parameter space Θ = ℝ.
This introduces two parameters to the function family 𝒢 and implies the functional relationship
μ = g(x) := θ1 ⋅ x + θ0.
Since our sample is fully observed, we disable censoring and truncation,
leading to the simplified loss
where fμ(y) is the density of 𝒩(μ = μ, σ = 1) evaluated at y.
nnet <- tf_compile_model(
inputs = list(nnet_input),
intermediate_output = nnet_output,
dist = dist,
optimizer = keras3::optimizer_adam(learning_rate = 0.1),
censoring = FALSE,
truncation = FALSE
)
nnet$dist
nnet$model
The fit can now be performed, modifying the parameters (weights) of
nnet
in-place. Note that the argument y
of fit
accepts a trunc_obs
object. In our example, the vector
y
is silently converted to an untruncated, uncensored
trunc_obs
object. fit()
returns the
keras_training_history
of the underlying call to
fit()
on the keras.src.models.model.Model
.
nnet_fit <- fit(
nnet,
x = dataset$x,
y = dataset$y,
epochs = 100L,
batch_size = 100L,
shuffle = FALSE,
verbose = FALSE
)
The training history can be plotted, displaying the loss by epoch (black circles), and a blue smoothing line.
# Fix weird behavior of keras3
nnet_fit$params$epochs <- max(nnet_fit$params$epochs, length(nnet_fit$metrics$loss))
plot(nnet_fit)
The predict()
method of reservr_keras_model
takes input tensors and returns the predicted distribution parameters as
a list compatible with dist$get_placeholders()
. We can thus
extract the only parameter mean
and compare it to an OLS
fit for the same dataset:
pred_params <- predict(nnet, data = list(keras3::as_tensor(dataset$x, keras3::config_floatx())))
lm_fit <- lm(y ~ x, data = dataset)
dataset$y_pred <- pred_params$mean
dataset$y_lm <- predict(lm_fit, newdata = dataset, type = "response")
library(ggplot2)
ggplot(dataset, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(y = y_pred), color = "blue") +
geom_line(aes(y = y_lm), linetype = 2L, color = "green")
Since a reservr_keras_model
includes the underlying
keras.src.models.model.Model
, its parameters can also be
extracted and compared to the OLS coefficients
coef_nnet <- rev(as.numeric(nnet$model$get_weights()))
coef_lm <- unname(coef(lm_fit))
str(coef_nnet)
str(coef_lm)
We now discuss a more complex example involving censoring, using the
right-censored ovarian
dataset bundled with the
survival package (R Core Team 2023). Our goal is to predict
the rate parameter of an exponential survival time distribution in
cancer patients given four features X = (a
g
e
, r
e
s
i
d
.
d
s
, r
x
, e
c
o
g
.
p
s
)
collected in the study. The variables r
e
s
i
d
.
d
s
, r
x
and e
c
o
g
.
p
s
are indicator variables coded in {1, 2}. a
g
e
is
a continuous variable with values in (38, 75). Due to the different scale of the
a
g
e
variable, it is useful to separate it from the other variables in order
to perform normalization. Normalization using
keras3::layer_normalization()
transforms its input
variables to zero mean and unit variance. This step is not necessary for
the categorical features.
set.seed(1219L)
tensorflow::set_random_seed(1219L)
keras3::config_set_floatx("float32")
dist <- dist_exponential()
ovarian <- survival::ovarian
dat <- list(
y = trunc_obs(
xmin = ovarian$futime,
xmax = ifelse(ovarian$fustat == 1, ovarian$futime, Inf)
),
x = list(
age = keras3::as_tensor(ovarian$age, keras3::config_floatx(), shape = nrow(ovarian)),
flags = k_matrix(ovarian[, c("resid.ds", "rx", "ecog.ps")] - 1.0)
)
)
Next, we define the input layers and shapes, conforming to our input
predictor list dat$x
.
nnet_inputs <- list(
keras3::keras_input(shape = 1L, name = "age"),
keras3::keras_input(shape = 3L, name = "flags")
)
age
will be normalized and then concatenated to the
other features, stored in flags
, resulting in a
4-dimensional representation. We then add two hidden ReLU-layers each
with 5 neurons to the network and
compile the result, adapting the 5-dimensional hidden output to the
parameter space Θ = (0, ∞) for
the rate parameter of an exponential distribution. This is accomplished
using a dense layer with 1 neuron and
the softplus activation function.
hidden1 <- keras3::layer_concatenate(
keras3::layer_normalization(nnet_inputs[[1L]]),
nnet_inputs[[2L]]
)
hidden2 <- keras3::layer_dense(
hidden1,
units = 5L,
activation = keras3::activation_relu
)
nnet_output <- keras3::layer_dense(
hidden2,
units = 5L,
activation = keras3::activation_relu
)
nnet <- tf_compile_model(
inputs = nnet_inputs,
intermediate_output = nnet_output,
dist = dist,
optimizer = keras3::optimizer_adam(learning_rate = 0.01),
censoring = TRUE,
truncation = FALSE
)
nnet$model
For stability reasons, the default weight initialization is not optimal. To circumvent this, we estimate a global exponential distribution fit on the observations and initialize the final layer weights such that the global fit is the initial prediction of the network.
str(predict(nnet, dat$x))
global_fit <- fit(dist, dat$y)
tf_initialise_model(nnet, params = global_fit$params, mode = "zero")
str(predict(nnet, dat$x))
Finally, we can train the network and visualize the predictions.
nnet_fit <- fit(
nnet,
x = dat$x,
y = dat$y,
epochs = 100L,
batch_size = nrow(dat$y),
shuffle = FALSE,
verbose = FALSE
)
nnet_fit$params$epochs <- max(nnet_fit$params$epochs, length(nnet_fit$metrics$loss))
plot(nnet_fit)
ovarian$expected_lifetime <- 1.0 / predict(nnet, dat$x)$rate
A plot of expected lifetime by (a
g
e
, r
x
)
shows that the network learned longer expected lifetimes for lower a
g
e
and for treatment group (r
x
) 2. The global
fit is included as a dashed blue line.
Individual predictions and observations can also be plotted on a subject level.
We presented reservr, a package that supports distribution parameter estimation and distributional regression using R. Both tasks are supported for samples with or without interval censoring and with or without random truncation, a more general form of truncation than what typical packages support. The package includes facilities for (1) description of randomly truncated non-informatively interval censored samples, (2) definition of distribution families to consider, (3) global distribution parameter estimation under an i.i.d. assumption on the sample and (4) distributional regression - employing the tensorflow package for flexibility and speed.
The Author would like to thank Axel Bücher for proofreading and valuable comments on an earlier version of this article.