Introduction

This vignette provides complementary information to the R Documentation for the ReturnCurves package. It summarises the key methodologies implemented in the package and is heavily based on the works of Murphy-Barltrop, Wadsworth, and Eastoe (2023) and Murphy-Barltrop, Wadsworth, and Eastoe (2024); for full details we refer the user to these articles.

The ReturnCurves package aims at estimating the p-probability return curve (Murphy-Barltrop, Wadsworth, and Eastoe 2023), for small p > 0, while implementing pointwise and smooth approaches to estimate the so called angular dependence function first introduced by Wadsworth and Tawn (2013).

library(ReturnCurves)

To illustrate the functionality of the package, we use the data set airdata which contained air pollution data collected from Marylebone, London (UK). The data set contains 1427 daily measurements of air pollutant concentrations of NOx and PM10.

data(airdata)

Marginal transformation

The estimation of the angular dependence function and/or of the return curve is implemented for a bivariate vector (X_E, Y_E) marginally distributed as standard exponential, i.e, X_E, Y_E ∼ Exp(1). Thus, the original data (X, Y) needs to be marginally transformed, which is achieved via the probability integral transform. We follow the procedure of S. G. Coles and Tawn (1991) where the empirical cumulative distribution function F̃ is fitted below a threshold u, and a generalised Pareto distribution (GPD) is fitted above, giving the following estimate of the marginal cumulative distribution function (cdf) of X or Y: where σ̂ and ξ̂ are the estimated scale and shape parameters of the GPD. Exponential margins are obtained by applying −log (1 − F̂(⋅)) to each margin, where F̂(⋅) is estimated separately for each margin.

This is done with the function margtransf which takes as inputs a matrix containing the original data, a vector of the marginal quantiles used to fit the GPD and a boolean value constrainedshape which decides whether ξ > −1 if set to TRUE (Default), or ξ ∈ ℝ if set to FALSE.

Function margtransf returns an object of S4 class margtrasnf.class with 6 attributes:

data: matrix with the data on the original margins
qmarg: vector of marginal quantiles used to fit the GPD
constrainedshape: whether ξ > −1 (TRUE) or ξ ∈ ℝ (FALSE)
parameters: matrix containing estimates of parameters (σ̂, ξ̂)
thresh: vector containing threshold u above which the GPD is fitted
dataexp: matrix with the data on standard exponential margins

# qmarg and constrainedshape set to the default values
expdata <- margtransf(data = airdata, qmarg = rep(0.95, 2), constrainedshape = T)

# attributes of the S4 object
str(expdata)
#> Formal class 'margtransf.class' [package "ReturnCurves"] with 6 slots
#>   ..@ data            : num [1:1427, 1:2] 154 132 120 105 175 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:2] "nox" "pm10"
#>   ..@ qmarg           : num [1:2] 0.95 0.95
#>   ..@ constrainedshape: logi TRUE
#>   ..@ parameters      : num [1:2, 1:2] 58.3333 -0.0521 9.984 -0.1428
#>   ..@ thresh          : num [1:2] 320.5 54.4
#>   ..@ dataexp         : num [1:1427, 1:2] 0.594 0.415 0.369 0.29 0.747 ...

# head of the data on standard exponential margins
head(expdata@dataexp)
#>           [,1]      [,2]
#> [1,] 0.5941359 0.1035535
#> [2,] 0.4150041 0.4307285
#> [3,] 0.3692105 0.1596506
#> [4,] 0.2900289 0.1827463
#> [5,] 0.7465895 0.2808641
#> [6,] 1.1461374 0.9017124

It is possible to plot an S4 object of margtrasnf.class with plot. By setting argument which = "hist", histograms of each variable on original and standard exponential margins can be seen:

plot(expdata, which = "hist")

To visualise the time series of each variable on original and standard exponential margins, we need to set which = "ts":

plot(expdata, which = "ts")

The joint distribution on original and standard exponential margins can be accessed with which = "joint":

plot(expdata, which = "joint")

Finally, it is possible to plot all these together by setting which = "all", which is the default for this argument.

plot(expdata, which = "all") # or just plot(expdata)

Assessing the marginal tail fits

When transforming the data onto standard exponential variables as in equation , it is assumed that above a threshold u data follows a GPD. It is possible to assess if this is a reasonable assumption through checking if there is an agreement between model and empirical GPD quantiles. This is done via QQ plots in the ReturnCurves package by plotting points $\left(F_{GPD}^{-1}\left(\frac{i}{n_{exc} + 1}\right) + u, X^{GPD}_{(i)} + u\right),$ where X_(i)^GPD denotes the i-th ordered increasing statistic (i = 1, …, n) of the exceedances, i.e., X^GPD = (X − u ∣ X > u), n_exc denotes the sample size of these exceedances, and F_GPD⁻¹ denotes the inverse of the cumulative distribution function of a GPD. Finally, the uncertainty on the empirical quantiles is quantified using a bootstrap approach. If temporal dependence is present in the data, then a block bootstrap approach is required, i.e., blocksize > 1.

This is done using the function marggpd function which takes as inputs an S4 object of class margtransf.class, the size of blocks of the bootstrap procedure and the corresponding number of samples, and the significance level α for the tolerance intervals. It then returns an S4 object of class marggpd.class with an extra attribute marggpd containing a list with:

model: a list containing the model quantiles for each variable,
empirical: a list containing the empirical quantiles for each variable,
lower: a list containing the lower bounds of the tolerance intervals for each variable,
upper: a list containing the upper bounds of the tolerance intervals for each variable.

# nboot and alpha are set to the default values
# blocksize is set to 10 to account for temporal dependence
uncgpd <- marggpd(margdata = expdata, blocksize = 10, nboot = 250, alpha = 0.05)

# attributes of the S4 object
str(uncgpd)
#> Formal class 'marggpd.class' [package "ReturnCurves"] with 5 slots
#>   ..@ margdata :Formal class 'margtransf.class' [package "ReturnCurves"] with 6 slots
#>   .. .. ..@ data            : num [1:1427, 1:2] 154 132 120 105 175 ...
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : NULL
#>   .. .. .. .. ..$ : chr [1:2] "nox" "pm10"
#>   .. .. ..@ qmarg           : num [1:2] 0.95 0.95
#>   .. .. ..@ constrainedshape: logi TRUE
#>   .. .. ..@ parameters      : num [1:2, 1:2] 58.3333 -0.0521 9.984 -0.1428
#>   .. .. ..@ thresh          : num [1:2] 320.5 54.4
#>   .. .. ..@ dataexp         : num [1:1427, 1:2] 0.594 0.415 0.369 0.29 0.747 ...
#>   ..@ blocksize: num 10
#>   ..@ nboot    : num 250
#>   ..@ alpha    : num 0.05
#>   ..@ marggpd  :List of 4
#>   .. ..$ model    :List of 2
#>   .. .. ..$ : num [1:71] 321 322 323 324 325 ...
#>   .. .. ..$ : num [1:71] 54.6 54.7 54.9 55 55.1 ...
#>   .. ..$ empirical:List of 2
#>   .. .. ..$ : num [1:71] 321 321 322 322 323 ...
#>   .. .. ..$ : num [1:71] 54.5 54.5 54.5 54.9 55 ...
#>   .. ..$ lower    :List of 2
#>   .. .. ..$ : num [1:71] 321 321 321 321 322 ...
#>   .. .. ..$ : num [1:71] 54.5 54.5 54.5 54.5 54.5 ...
#>   .. ..$ upper    :List of 2
#>   .. .. ..$ : num [1:71] 322 323 324 324 332 ...
#>   .. .. ..$ : num [1:71] 54.5 54.9 55.1 55.1 55.2 ...

# head of the list elements of slot marggpd for variable X
head(uncgpd@marggpd$model[[1]])
#> [1] 321.2739 322.1004 322.9382 323.7876 324.6489 325.5225
head(uncgpd@marggpd$empirical[[1]])
#> [1] 320.5833 321.4167 321.8333 322.4167 323.0417 323.5000
head(uncgpd@marggpd$lower[[1]])
#> [1] 320.5833 320.5833 321.4167 321.4167 321.8333 322.4167
head(uncgpd@marggpd$upper[[1]])
#> [1] 321.8333 323.0417 323.5000 323.5000 331.8750 332.8750

It is possible to plot an S4 object of marggpd.class with plot, where the QQ plots with the model and empirical quantiles for each variable are shown. The points should lie close to the line y = x; for a good fit and agreement between these quantiles, the line y = x should mainly lie within the (1 − α)% tolerance intervals.

plot(uncgpd)

Estimation of the angular dependence function

In bivariate extremes, interest may lie in studying regions where both variables are extreme or where only one is extreme. For this, methods that aim at characterising the joint tail behaviour in both scenarios, such as the one introduced by Wadsworth and Tawn (2013), are required. Given standard exponentially distributed variables X_E and Y_E and a slowly varying function ℒ(⋅; ω) at infinity, the joint tail behaviour of (X_E, Y_E) is captured through λ(ω) via the assumption which can be rewritten as where ω ∈ [0, 1] and λ(ω) ≥ max {ω, 1 − ω} is called the angular dependence function (ADF). In the case of asymptotic dependence (see for instance, S. Coles, Heffernan, and Tawn (1999)), λ(ω) = max {ω, 1 − ω}, for all ω ∈ [0, 1].

Lastly, defining a min-projection variable at ω, $T_\omega = \min\left\{\frac{X_E}{\omega}, \, \frac{Y_E}{1-\omega}\right\},$ equation implies that for any ω ∈ [0, 1] and t > 0. In other words, for all ω ∈ [0, 1] and, as u_ω → ∞, T_ω¹ := (T_ω − u_ω ∣ T_ω > u_ω) ∼ Exp(λ(ω)). Estimation of the ADF can be done in different ways; Murphy-Barltrop, Wadsworth, and Eastoe (2024) present a few.

For the ReturnCurves package, two approaches are implemented: a pointwise estimator using the Hill estimator (Hill 1975), λ̂_H, and a smoother estimator based on Bernstein-Bézier polynomials estimated via composite likelihood methods, λ̂_CL. For the latter, Murphy-Barltrop, Wadsworth, and Eastoe (2024) propose using a family of Bernstein-Bézier polynomials to improve the estimation of the ADF. Given k ∈ ℕ, it is assumed that λ(ω) = λ(ω; β) can be represented by the following family of functions:

As T_ω¹ is exponentially distributed when u_ω → ∞, the parameter vector β can be estimated using a composite likelihood function defined as where ∣t_ω¹∣ represents the cardinality of set t_ω¹ := {t_ω − u_ω ∣ t_ω ∈ t_ω, t_ω > u_ω} for some large values u_ω, and Ω is a finite subset spanning the interval [0, 1]. The estimator of the ADF through composite likelihood methods is given by λ(⋅; β̂_CL) where β̂_CL maximises equation .

Finally, Murphy-Barltrop, Wadsworth, and Eastoe (2024) showed that incorporating knowledge of the conditional extremes (Heffernan and Tawn 2004) parameters α_y ∣ x and α_x ∣ y improves the estimation of the ADF. In particular, the authors show that, in order to satisfy theoretical properties of λ(ω), λ(ω) = max {ω, 1 − ω} for all ω ∈ [0, α_x ∣ y¹] ∪ [α_y ∣ x¹, 1] with α_x ∣ y¹ = α_x ∣ y/(1 + α_x ∣ y) and α_y ∣ x¹ = 1/(1 + α_y ∣ x). Thus, after estimating the conditional extremes parameters α_y ∣ x and α_x ∣ y through maximum likelihood estimation, we can set λ(ω) = max {ω, 1 − ω} for ω ∈ [0, α̂_x ∣ y¹) ∪ (α̂_y ∣ x¹, 1]. Then, for the Hill estimator, λ(ω) = λ̂_H for ω ∈ [α̂_x ∣ y¹, α̂_y ∣ x¹]. For the composite likelihood estimator, a rescaling of equation is needed to ensure continuity at α̂_x ∣ y¹ and α̂_y ∣ x¹, as defined below: λ(ω) = λ(ω; β) is assumed to be represented by an element of ℬ_k¹ on [α̂_x ∣ y¹, α̂_y ∣ x¹]. Finally, the estimators used for estimation are processed in order to satisfy theoretical conditions on λ as identified in Murphy-Barltrop, Wadsworth, and Eastoe (2024).

Estimation of the ADF can be done using the function adf_est which takes as inputs:

an S4 object of class margtransf.class representing the marginal transformation of the data,
a sequence of rays w in [0, 1],
a string method indicating which estimator to get, λ_H or λ_CL,
and a boolean value constrained which decides whether to incorporate conditional extremes parameters α_y ∣ x and α_x ∣ y in the estimation.

Additional arguments can be defined outside of the default values; these include marginal quantiles for the min-projection variable T¹ at ray ω, marginal quantiles to fit the conditional extremes method if constrained=TRUE, and, if method= "cl", the polynomial degree k, the initial values for β for the composite maximum likelihood procedure, and the convergence tolerance. Convergence is declared when the difference of log-likelihood values between iterations does not exceed the value of tol. This repeated optimisation helps to avoid convergence to local maxima, although does not guarantee finding the global maximum.

Function adf_est returns an object of S4 class adf_est.class with 11 attributes, where the first 9 are the inputs of the function and the last 2 are vectors:

interval: contains the maximum likelihood estimates from the conditional extremes model α̂_x ∣ y¹ and α̂_y ∣ x¹ if constrained = TRUE. Otherwise, it returns the values 0 and 1; this has no meaningful interpretation as the estimation is performed in an unconstrained interval.
adf: contains the estimates of λ(ω).

# Estimation using Hill estimator without conditional extremes parameters
whill <- seq(0, 1, by = 0.001)
## q and constrained are set to the default values here
lambdah <- adf_est(margdata = expdata, w = whill, method = "hill", 
                   q = 0.95, constrained = F)

# Estimation using Hill estimator with conditional extremes parameters
## q and qalphas are set to the default values
lambdah2 <- adf_est(margdata = expdata, w = whill, method = "hill", q = 0.95,
                    qalphas = rep(0.95, 2), constrained = T)

# Estimation using CL method without conditional extremes parameters
## w, q and constrained are set to the default values here
lambdacl <- adf_est(margdata = expdata, w = seq(0, 1, by = 0.01), method = "cl",
                    q = 0.95, constrained = F)

# Estimation using CL method with conditional extremes parameters
## w, q and qalphas are set to the default values
lambdacl2 <- adf_est(margdata = expdata, w = seq(0, 1, by = 0.01), method = "cl",
                     q = 0.95, qalphas = rep(0.95, 2), constrained = T)

# attributes of the S4 object
str(lambdah)
#> Formal class 'adf_est.class' [package "ReturnCurves"] with 11 slots
#>   ..@ dataexp    : num [1:1408, 1:2] 0.594 0.415 0.369 0.29 0.747 ...
#>   ..@ w          : num [1:1001] 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 ...
#>   ..@ method     : chr "hill"
#>   ..@ q          : num 0.95
#>   ..@ qalphas    : num [1:2] 0.95 0.95
#>   ..@ k          : num 7
#>   ..@ constrained: logi FALSE
#>   ..@ tol        : num 1e-04
#>   ..@ par_init   : num [1:6] 0 0 0 0 0 0
#>   ..@ interval   : num [1:2] 0 1
#>   ..@ adf        : num [1:1001] 1 0.999 0.998 0.997 0.996 0.995 0.994 0.993 0.992 0.991 ...

# head of the vector with adf estimates for the first estimator
head(lambdah@adf)
#> [1] 1.000 0.999 0.998 0.997 0.996 0.995

It is possible to plot an S4 object of adf_est.class with plot, where a comparison of the estimated ADF and its lower bound, max {ω, 1 − ω}, is shown.

# plot of the ADF estimation based on the unconstrained Hill estimator
plot(lambdah)

Goodness-of-fit of the angular dependence function

After estimation of the ADF, it is important to assess its goodness-of-fit. Noting that T_ω¹ = (T_ω − u_ω ∣ T_ω > u_ω) ∼ Exp(λ(ω)) ⇔ λ(ω)T_ω¹ ∼ Exp(1), we can investigate whether there is agreement between model and empirical exponential quantiles, or not. This is done in the ReturnCurves package through QQ plots by plotting points (F_E⁻¹(i/(n + 1)), T_(i)¹), where F_E⁻¹ denotes the inverse of the cumulative distribution function of a standard exponential distribution and T_(i)⁻¹ is the i-th ordered increasing statistic, i = 1, …, n. The uncertainty of the empirical quantiles is quantified using a bootstrap approach. If temporal dependence is present in the data, a block bootstrap approach should be used, i.e. blocksize > 1.

The assessment of the goodness-of-fit of λ(ω) can be done using the function adf_gof which takes an S4 object of class adf_est.class, a ray ω to be considered, the size of the blocks for the bootstrap procedure and the corresponding number of samples, and the significance level α for the tolerance intervals as inputs. In turn, it returns an S4 object of class adf_gof.class with an extra attribute gof containing a list with the model and empirical quantiles, and the lower and upper bounds of the tolerance interval.

We note that this function is implemented to evaluate the fit at a single ray ω; therefore, we recommend repeating the procedure for a few rays to have a better representation. In addition, if the ray provided by the user was not used for the estimation of the ADF, then the closest ω in the grid is used instead.


# Goodness of fit of the adf for twp rays w
rays <- c(0.25, 0.75)
## nboot and alpha are set to the default values
## blocksize is set to 10 to account for temporal dependence
gofh <- sapply(rays, adf_gof, adf = lambdah, blocksize = 10, nboot = 250, alpha = 0.05)

# attributes of the S4 object
str(gofh[[1]])
#> Formal class 'adf_gof.class' [package "ReturnCurves"] with 6 slots
#>   ..@ adf      :Formal class 'adf_est.class' [package "ReturnCurves"] with 11 slots
#>   .. .. ..@ dataexp    : num [1:1408, 1:2] 0.594 0.415 0.369 0.29 0.747 ...
#>   .. .. ..@ w          : num [1:1001] 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 ...
#>   .. .. ..@ method     : chr "hill"
#>   .. .. ..@ q          : num 0.95
#>   .. .. ..@ qalphas    : num [1:2] 0.95 0.95
#>   .. .. ..@ k          : num 7
#>   .. .. ..@ constrained: logi FALSE
#>   .. .. ..@ tol        : num 1e-04
#>   .. .. ..@ par_init   : num [1:6] 0 0 0 0 0 0
#>   .. .. ..@ interval   : num [1:2] 0 1
#>   .. .. ..@ adf        : num [1:1001] 1 0.999 0.998 0.997 0.996 0.995 0.994 0.993 0.992 0.991 ...
#>   ..@ ray      : num 0.25
#>   ..@ blocksize: num 10
#>   ..@ nboot    : num 250
#>   ..@ alpha    : num 0.05
#>   ..@ gof      :List of 4
#>   .. ..$ model    : num [1:70] 0.0142 0.0286 0.0432 0.058 0.073 ...
#>   .. ..$ empirical: num [1:70] 0.0177 0.0297 0.0393 0.0612 0.0918 ...
#>   .. ..$ lower    : num [1:70] 0.0177 0.0177 0.0177 0.0297 0.0297 ...
#>   .. ..$ upper    : num [1:70] 0.0393 0.1414 0.1929 0.195 0.2016 ...

# head of the list elements of slot gof
head(gofh[[1]]@gof$model)
#> [1] 0.01418463 0.02857337 0.04317217 0.05798726 0.07302514 0.08829261
head(gofh[[1]]@gof$empirical)
#> [1] 0.01769958 0.02967577 0.03930040 0.06115842 0.09180755 0.10471095
head(gofh[[1]]@gof$lower)
#> [1] 0.01769958 0.01769958 0.01769958 0.02967577 0.02967577 0.03930040
head(gofh[[1]]@gof$upper)
#> [1] 0.0393004 0.1414311 0.1928653 0.1950342 0.2015994 0.2015994

As before, it is possible to plot an S4 object of adf_gof.class with plot, where the QQ plot with the model and empirical quantiles is shown. The points should lie close to the line y = x; for a good fit and agreement between these quantiles, the line y = x should mainly lie within the (1 − α)% tolerance intervals.

library(gridExtra)
grid.arrange(plot(gofh[[1]]), plot(gofh[[2]]), ncol = 2)

Estimation of the return curve

Given a probability p and the joint survivor function Pr(X > x, Y > y) of the bivariate vector (X, Y), the p-probability return curve is defined as The interest lies in values of p close to 0 as these are the ones characterising rare joint exceedances events. Given any point (x, y) ∈ RC(p), the event {X > x, Y > y} is expected to happen once each return period 1/p, on average. This is equivalent to having an expected value of np points in the region (x, ∞) × (y, ∞) in a sample size of n from (X, Y).

Since the probability p is close to 0, methods that can accurately capture the behaviour of the joint tail are necessary in order to realistically extrapolate and estimate RC(p) for values of p outside of the observation period. Murphy-Barltrop, Wadsworth, and Eastoe (2023) consider a couple of methods to achieve this, one of which uses the ADF λ(ω) given in equation to characterise the joint tail behaviour.

Estimation of RC(p) is done with standard exponentially distributed variables; therefore, the first step is to transform the original data onto standard exponential margins using equation , and then, after estimation of RC(p), back transform them onto the original margins. Estimates of RC(p) are obtained through estimates of t and u from equation , and rays ω. In particular, the value of t > 0 can be obtained by first estimating u as the (1 − p^*)-th quantile of T_ω, where p^* > p, is a small probability, and then ensuring that Pr(T_ω > t + u) = p. Since u is estimated as the (1 − p^*)-th quantile of T_ω, we have that Pr(T_ω > u) = p^*; thus, which leads to t = −log (p/p^*)/λ̂(ω). Finally, the estimates of the return curve $\hat{\text{RC}}(p)$ can be obtained by setting (x, y) := (ω(t + u), (1 − ω)(t + u)).

In the ReturnCurves package, estimation of the return curve is done through function rc_est which shares the same inputs as function adf_est with an additional argument p representing the curve survival probability. This probability value should be smaller than 1 − q, where q is the quantile for the min-projection variables T_ω¹, and, when applicable, smaller than 1 − q_α, where q_α are the quantiles used in the conditional extremes method.

Function rc_est returns an S4 object of class rc_est.class with 14 attributes, with a list and a matrix in the last 2 slots:

interval: vector with the maximum likelihood estimates from the conditional extremes model α̂_x ∣ y¹ and α̂_y ∣ x¹ if constrained = TRUE. Otherwise, it returns the values 0 and 1; this has no meaningful interpretation as the estimation is performed in an unconstrained interval.
rc: matrix with the estimates of the return curve on the original margins.

n <- dim(airdata)[1] 
prob <- 10/n
# Estimation using Hill estimator without conditional extremes parameters
whill <- seq(0, 1, by = 0.001)
## q and constrained are set to the default values here
rch <- rc_est(margdata = expdata, w = whill, p = prob, method = "hill", 
              q = 0.95, constrained = F)

# Estimation using Hill estimator with conditional extremes parameters
## q and qalphas are set to the default values
rch2 <- rc_est(margdata = expdata, w = whill, p = prob, method = "hill", q = 0.95,
               qalphas = rep(0.95, 2), constrained = T)

# Estimation using CL method without conditional extremes parameters
## w, q and constrained are set to the default values here
rccl <- rc_est(margdata = expdata, w = seq(0, 1, by = 0.01), p = prob, method = "cl", 
               q = 0.95, constrained = F)

# Estimation using CL method with conditional extremes parameters
## w, q and qalphas are set to the default values
rccl2 <- rc_est(margdata = expdata, w = seq(0, 1, by = 0.01), p = prob, method = "cl", 
                q = 0.95, qalphas = rep(0.95, 2), constrained = T)

# attributes of the S4 object
str(rch)
#> Formal class 'rc_est.class' [package "ReturnCurves"] with 14 slots
#>   ..@ data            : num [1:1408, 1:2] 154 132 120 105 175 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:2] "nox" "pm10"
#>   ..@ qmarg           : num [1:2] 0.95 0.95
#>   ..@ constrainedshape: logi TRUE
#>   ..@ w               : num [1:1001] 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 ...
#>   ..@ p               : num 0.00701
#>   ..@ method          : chr "hill"
#>   ..@ q               : num 0.95
#>   ..@ qalphas         : num [1:2] 0.95 0.95
#>   ..@ k               : num 7
#>   ..@ constrained     : logi FALSE
#>   ..@ tol             : num 0.001
#>   ..@ par_init        : num [1:6] 0 0 0 0 0 0
#>   ..@ interval        : num [1:2] 0 1
#>   ..@ rc              : num [1:1001, 1:2] 0 26.2 29.6 33.1 34.2 ...

# head of the vector with adf estimates for the first estimator
head(rch@rc)
#>          [,1]     [,2]
#> [1,]  0.00000 71.54081
#> [2,] 26.19885 71.50107
#> [3,] 29.61038 71.50107
#> [4,] 33.08634 71.50107
#> [5,] 34.20411 71.50107
#> [6,] 36.17626 71.50107

It is possible to plot an S4 object of rc_est.class with plot, where the original data is plotted with the estimated line for the return curve $\hat{\text{RC}}(p).$

# plot of the ADF estimation based on the unconstrained Hill estimator
plot(rch)

Uncertainty of the return curve estimates

Murphy-Barltrop, Wadsworth, and Eastoe (2023) propose a procedure to assess the uncertainty of the return curve estimates. For large positive m ∈ ℕ, let define a set of angles. For each θ ∈ Θ, the line L_θ := {(x, y) ∈ ℝ₊² ∣ tan (θ) > 0} intersects the estimated $\hat{\text{RC}}(p)$ exactly once, i.e. $\{(\hat{x}_\theta, \hat{y}_\theta)\}:= \hat{\text{RC}}(p) \cap L_\theta$ where $(\hat{x}_\theta, \hat{y}_\theta) \in \hat{\text{RC}}(p).$ Moreover, let d̂_θ := (x̂_θ² + ŷ_θ²)^1/2 denote the L₂-norm of the point estimate. Uncertainty in the return curve estimates is quantified using the distribution of d̂_θ at each angle θ ∈ Θ as follows: for k = 1, …, nboot:

Finally, given θ ∈ Θ, empirical estimates of the mean, median and (1 − α)% confidence intervals for d̂_θ can be obtained using the sample of d̂_θ, k. These are available through function rc_unc, which takes as inputs:

retcurve: an S4 object of class rc_est.class containing the return curve estimates,
blocksize: size of blocks for the block bootstrap procedure; if no temporal dependence is present, then set blocksize = 1 (default),
nboot: number of bootstrap samples to be taken,
nangles: number of angles m,
alpha: significance level to compute the (1 − α)% confidence intervals.

Function rc_unc returns an S4 object of class rc_unc.class with 6 attributes, where the last slot unc contains a list with:

median: a vector containing the empirical estimates of the median return curve
mean: a vector containing the empirical estimates of the mean return curve
lower: a vector containing the lower bound of the confidence interval
upper: a vector containing the upper bound of the confidence interval

For simplicity, just the uncertainty of the return curve obtained using the unconstrained Hill estimator is computed here.

# nangles and alpha set to default
# nboot set to 50 for simplicity
# blocksize is set to 10 to account for temporal dependence
rch_unc <- rc_unc(rch, blocksize = 10, nboot = 50, nangles = 150, alpha = 0.05)

# attributes of the S4 object
str(rch_unc)
#> Formal class 'rc_unc.class' [package "ReturnCurves"] with 6 slots
#>   ..@ retcurve :Formal class 'rc_est.class' [package "ReturnCurves"] with 14 slots
#>   .. .. ..@ data            : num [1:1408, 1:2] 154 132 120 105 175 ...
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : NULL
#>   .. .. .. .. ..$ : chr [1:2] "nox" "pm10"
#>   .. .. ..@ qmarg           : num [1:2] 0.95 0.95
#>   .. .. ..@ constrainedshape: logi TRUE
#>   .. .. ..@ w               : num [1:1001] 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 ...
#>   .. .. ..@ p               : num 0.00701
#>   .. .. ..@ method          : chr "hill"
#>   .. .. ..@ q               : num 0.95
#>   .. .. ..@ qalphas         : num [1:2] 0.95 0.95
#>   .. .. ..@ k               : num 7
#>   .. .. ..@ constrained     : logi FALSE
#>   .. .. ..@ tol             : num 0.001
#>   .. .. ..@ par_init        : num [1:6] 0 0 0 0 0 0
#>   .. .. ..@ interval        : num [1:2] 0 1
#>   .. .. ..@ rc              : num [1:1001, 1:2] 0 26.2 29.6 33.1 34.2 ...
#>   ..@ blocksize: num 10
#>   ..@ nboot    : num 50
#>   ..@ nangles  : num 150
#>   ..@ alpha    : num 0.05
#>   ..@ unc      :List of 4
#>   .. ..$ median: num [1:150, 1:2] 0.651 1.303 1.955 2.607 3.259 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:150] "50%" "50%" "50%" "50%" ...
#>   .. .. .. ..$ : NULL
#>   .. ..$ mean  : num [1:150, 1:2] 0.651 1.302 1.953 2.605 3.257 ...
#>   .. ..$ lower : num [1:150, 1:2] 0.598 1.197 1.795 2.394 2.994 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:150] "2.5%" "2.5%" "2.5%" "2.5%" ...
#>   .. .. .. ..$ : NULL
#>   .. ..$ upper : num [1:150, 1:2] 0.701 1.403 2.105 2.807 3.51 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:150] "97.5%" "97.5%" "97.5%" "97.5%" ...
#>   .. .. .. ..$ : NULL

# head of the list elements of slot unc
head(rch_unc@unc$median)
#>          [,1]     [,2]
#> 50% 0.6513796 71.36459
#> 50% 1.3028624 71.36278
#> 50% 1.9545895 71.36096
#> 50% 2.6067022 71.35914
#> 50% 3.2593421 71.35732
#> 50% 3.9126513 71.35550
head(rch_unc@unc$mean)
#>           [,1]     [,2]
#> [1,] 0.6508434 71.31305
#> [2,] 1.3017981 71.31163
#> [3,] 1.9530050 71.31020
#> [4,] 2.6046055 71.30878
#> [5,] 3.2567411 71.30736
#> [6,] 3.9095536 71.30593
head(rch_unc@unc$lower)
#>           [,1]     [,2]
#> 2.5% 0.5983115 66.26336
#> 2.5% 1.1967084 66.26124
#> 2.5% 1.7953204 66.25913
#> 2.5% 2.3942772 66.25701
#> 2.5% 2.9937089 66.25488
#> 2.5% 3.5937459 66.25276
head(rch_unc@unc$upper)
#>            [,1]     [,2]
#> 97.5% 0.7013612 76.16913
#> 97.5% 1.4028569 76.16830
#> 97.5% 2.1046304 76.16719
#> 97.5% 2.8068154 76.16556
#> 97.5% 3.5095750 76.16394
#> 97.5% 4.2130623 76.16231

It is possible to plot an instance of the S4 class rc_unc.class with function plot; this takes the S4 object and an extra argument which as inputs. If which = "rc" (default), then the estimated return curve is plotted, setting which = "median" shows the empirical median estimates of the return curve, while setting which = "mean" shows the empirical mean estimates of the return curve. All plots show the uncertainty associated with the estimated return curve in dashed lines. Finally, by setting which = "all", plots the estimated return curve, the empirical median and mean estimates and the associated uncertainty.

library(gridExtra)
grid.arrange(plot(rch_unc, which = "rc"), plot(rch_unc, which = "median"), 
             plot(rch_unc, which = "mean"), plot(rch_unc, which = "all"), nrow = 2)

Goodness-of-fit of the return curve estimates

It is important to assess the goodness-of-fit of the return curve estimates, given that the true return curve is unknown in reality. This is implemented in the ReturnCurves package based on the approach proposed by Murphy-Barltrop, Wadsworth, and Eastoe (2023).

Given the return curve RC(p), the probability of lying in a survival region (x, ∞) × (y, ∞) is p. Given the same set of angles Θ as in equation , for each θ_j ∈ Θ, the empirical probability p̂_j of lying in (x̂_{θ_j}, ∞) × (ŷ_{θ_j}, ∞), where (x̂_{θ_j}, ŷ_{θ_j}) is the corresponding point in $\hat{\text{RC}}(p),$ is given by the proportion of points in that region. The goodness-of-fit of the estimated return curve is then assessed via a bootstrap procedure; for each angle θ_j ∈ Θ, the original data set is bootstrapped and empirical probability estimates p̂_j are obtained. When temporal dependence is present in the data, a block bootstrap approach should be taken and the size of the blocks must be defined. We note that for each j, nboot empirical probabilities are estimated and, so the median and the (1 − α)% pointwise confidence intervals for the probabilities can be obtained by taking the 50%, (α/2)% and (1 − α/2)% quantiles of the set of empirical probabilities for each j, respectively.

The goodness-of-fit for an estimated return curve is implemented through function rc_gof. This shares the same input arguments as the rc_unc function and returns an S4 object with 5 attributes with the last slot gof containing a list with:

median: a vector with the median of the empirical probabilities,
lower: a vector with the lower bound of the confidence interval,
upper: a vector with the upper bound of the confidence interval.

For simplicity, just the goodness-of-fit of the return curve obtained using the unconstrained Hill estimator is computed here.

# nboot, nangles and alpha set to default
# blocksize is set to 10 to account for temporal dependence
rch_gof <- rc_gof(rch, blocksize = 10, nboot = 250, nangles = 150, alpha = 0.05)

# attributes of the S4 object
str(rch_gof)
#> Formal class 'rc_gof.class' [package "ReturnCurves"] with 5 slots
#>   ..@ retcurve :Formal class 'rc_est.class' [package "ReturnCurves"] with 14 slots
#>   .. .. ..@ data            : num [1:1408, 1:2] 154 132 120 105 175 ...
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : NULL
#>   .. .. .. .. ..$ : chr [1:2] "nox" "pm10"
#>   .. .. ..@ qmarg           : num [1:2] 0.95 0.95
#>   .. .. ..@ constrainedshape: logi TRUE
#>   .. .. ..@ w               : num [1:1001] 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 ...
#>   .. .. ..@ p               : num 0.00701
#>   .. .. ..@ method          : chr "hill"
#>   .. .. ..@ q               : num 0.95
#>   .. .. ..@ qalphas         : num [1:2] 0.95 0.95
#>   .. .. ..@ k               : num 7
#>   .. .. ..@ constrained     : logi FALSE
#>   .. .. ..@ tol             : num 0.001
#>   .. .. ..@ par_init        : num [1:6] 0 0 0 0 0 0
#>   .. .. ..@ interval        : num [1:2] 0 1
#>   .. .. ..@ rc              : num [1:1001, 1:2] 0 26.2 29.6 33.1 34.2 ...
#>   ..@ blocksize: num 10
#>   ..@ nboot    : num 250
#>   ..@ alpha    : num 0.05
#>   ..@ gof      :List of 3
#>   .. ..$ median: Named num [1:150] 0.00639 0.00639 0.00639 0.00639 0.00639 ...
#>   .. .. ..- attr(*, "names")= chr [1:150] "50%" "50%" "50%" "50%" ...
#>   .. ..$ lower : Named num [1:150] 0.00213 0.00213 0.00213 0.00213 0.00213 ...
#>   .. .. ..- attr(*, "names")= chr [1:150] "2.5%" "2.5%" "2.5%" "2.5%" ...
#>   .. ..$ upper : Named num [1:150] 0.0119 0.0119 0.0119 0.0119 0.0119 ...
#>   .. .. ..- attr(*, "names")= chr [1:150] "97.5%" "97.5%" "97.5%" "97.5%" ...

# head of the list elements of slot gof
head(rch_gof@gof$median)
#>         50%         50%         50%         50%         50%         50% 
#> 0.006392045 0.006392045 0.006392045 0.006392045 0.006392045 0.006392045
head(rch_gof@gof$lower)
#>        2.5%        2.5%        2.5%        2.5%        2.5%        2.5% 
#> 0.002130682 0.002130682 0.002130682 0.002130682 0.002130682 0.002130682
head(rch_gof@gof$upper)
#>      97.5%      97.5%      97.5%      97.5%      97.5%      97.5% 
#> 0.01191406 0.01191406 0.01191406 0.01191406 0.01191406 0.01191406

It is possible to plot an instance of the S4 class rc_gof.class with function plot, where a comparison between the true probability p (in red) and the empirical median estimates (in black) is shown. Ideally, p should be contained in the confidence region, shaded in grey. Finally, in practice, the value of p should be within the range of the data and not too extreme, given the nature of empirical probabilities.

plot(rch_gof)

References

Coles, S. G., and J. A. Tawn. 1991. “Modelling Extreme Multivariate Events.” Journal of the Royal Statistical Society. Series B (Methodological) 53 (2): 377–92. https://doi.org/10.1111/j.2517-6161.1991.tb01830.x.

Coles, Stuart, Janet Heffernan, and Jonathan Tawn. 1999. “Dependence Measures for Extreme Value Analyses.” Extremes 2: 339–65.

Heffernan, J. E., and J. A. Tawn. 2004. “A Conditional Approach for Multivariate Extreme Values (with Discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (3): 497–546. https://doi.org/https://doi.org/10.1111/j.1467-9868.2004.02050.x.

Hill, B. M. 1975. “A Simple General Approach to Inference About the Tail of a Distribution.” The Annals of Statistics 3 (5): 1163–74. https://doi.org/10.1214/aos/1176343247.

Murphy-Barltrop, C. J. R., J. L. Wadsworth, and E. F. Eastoe. 2023. “New Estimation Methods for Extremal Bivariate Return Curves.” Environmetrics 34 (August). https://doi.org/10.1002/env.2797.

———. 2024. “Improving Estimation for Asymptotically Independent Bivariate Extremes via Global Estimators for the Angular Dependence Function.” https://arxiv.org/abs/2303.13237.

Wadsworth, J. L., and J. A. Tawn. 2013. “A New Representation for Multivariate Tail Probabilities.” Bernoulli 19 (November): 2689–2714. https://doi.org/10.3150/12-BEJ471.

Return Curves Estimation