--- title: "Frequentist Likelihood-Based Inference for Time Series Extremes" author: "Paul Northrop" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Frequentist Likelihood-Based Inference for Time Series Extremes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: lite.bib csl: taylor-and-francis-chicago-author-date.csl --- ```{r, include = FALSE} knitr::opts_chunk$set(comment = "#>", collapse = TRUE) required <- c("exdex") if (!all(unlist(lapply(required, function(pkg) requireNamespace(pkg, quietly = TRUE))))) knitr::opts_chunk$set(eval = FALSE) ``` ```{r, include = FALSE} knitr::opts_chunk$set(comment = "#>", collapse = TRUE) ``` The **lite** package performs likelihood-based inference for stationary time series extremes. This vignette concentrates on frequentist inference, using maximum likelihood estimation. See the vignette [Bayesian Likelihood-Based Inference for Time Series Extremes](lite-2-bayesian.html) for information about Bayesian inference. The general approach follows @FW2012. There are 3 independent parts to the inference, all performed using maximum likelihood estimation. 1. A Bernoulli$(p_u)$ model for whether a given observation exceeds the threshold $u$. 2. A generalised Pareto, GP$(\sigma_u, \xi)$, model for the marginal distribution of threshold excesses. 3. The $K$-gaps model for the extremal index $\theta$, based on inter-exceedance times. For parts 1 and 2, inferences based on a mis-specified independence log-likelihood are adjusted to account for clustering in the data. The adjustment does not affect the point estimates of $(p_u, \sigma_u, \xi)$ but it ensures that estimates of uncertainty associated with these estimates account for dependence in the data. Otherwise, uncertainty in these parameters will tend to be underestimated. Here, we follow @CB2007 to estimate adjusted log-likelihood functions for $p_u$ and for ($\sigma_u, \xi$), using the **chandwich** package [@chandwich]. For details of the log-likelihood adjustments see the [introductory vignette for the chandwich package](https://cran.r-project.org/package=chandwich). We will see below that although there is more than one way to perform this adjustment there is a strong argument in favour of the so-called **vertical** adjustment. A vertical adjustment adjusts on the log-likelihood scale, thereby respecting constraints on the parameter values, whereas the **horizontal** adjustment adjusts on the parameter scale. For part 3, the methodology described in @SD2010 is used, implemented by the function `kgaps` in the **exdex** package [@exdex]. The $K$-gaps model involves a tuning parameter $K$ that defines how distant in time threshold exceedances need to be before they are considered to be from separate clusters of threshold exceedances. In addition to providing an estimator of $\theta$, @SD2010 provides a diagnostic to inform the choice of $K$ and the threshold $u$. We use this diagnostic below, but it would also be advisable also to examine graphically the effect of $(u, K)$ on extremes value inferences, by plotting point estimates and confidence intervals for key quantities such as the generalised Pareto shape parameter $\xi$ or return levels of interest. The (adjusted) log-likelihoods produced in parts 1, 2 and 3 are combined to form a log-likelihood for $(p_u, \sigma_u, \xi, \theta)$ from which extreme value inferences can be made. ## Cheeseboro wind gust data We use an example dataset to illustrate the use of the two main functions in **lite**, that is, `flite` and `returnLevel`. The 744 by 10 matrix `cheeseboro`, available from the **exdex** package, contains hourly maximum wind gusts (in miles per hour) recorded at the Cheeseboro weather station near Thousand Oaks, Southern California, USA during the month of January over the period 2000-2009. Column $i$ contains the hourly values for year $1999+i$. These data are sourced from the [Cheeseboro page](https://raws.dri.edu/cgi-bin/rawMAIN.pl?caCCHB) of the Remote Automated Weather Stations USA Climate Archive. The following plot shows the general behaviour of these data. ```{r cheesy, echo = FALSE, fig.width = 7, fig.height = 5} oldpar <- par(mar = c(4, 4, 1, 1)) cmat <- exdex::cheeseboro NAmat <- matrix(NA, ncol = 10, nrow = 250) cmat <- rbind(cmat, NAmat) cvec <- c(cmat) plot(cvec, xlab = "year", ylab = "hourly wind gusts, mph", axes = FALSE, ylim = c(0, max(cvec, na.rm = TRUE)), type = "l") axis(1, at = c(-1000, seq(from = 372, by = 744 + 250, length = 10), 100000), labels = c(NA, 2000:2009, NA)) axis(2) par(oldpar) ``` For estimating $(p_u, \sigma_u, \xi)$ we treat observations from different years as being in separate clusters. In the code below we do not need to set the argument `cluster` explicitly because when the data are supplied as a matrix the function `flite` sets `cluster` automatically to achieve this. A total of 42 observations are missing in `cheeseboro`. This causes no problems for inferences about $(p_u, \sigma_u, \xi)$. For making inferences about $\theta$ using the $K$-gaps model the `kgaps` function splits the data further into sequences of non-missing values and constructs the sample $K$-gaps separately for each sequence. ## Inferences for model parameters We need to choose values for the tuning parameters threshold $u$ and $K$-gaps run parameter $K$. There is no definitive way to do this but the information matrix test developed in @SD2010 can help to make this choice. Based on the $K$-gaps section of the [introductory vignette for the exdex package](https://cran.r-project.org/package=exdex) we choose $u = 45$mph and $K = 3$. ```{r cheeseboro} library(lite) cdata <- exdex::cheeseboro # Each column of the matrix cdata corresponds to data from a different year # flite() sets cluster automatically to correspond to column (year) cfit <- flite(cdata, u = 45, k = 3) summary(cfit) summary(cfit, adjust = FALSE) ``` The effect of the adjustment of the log-likelihood for clustering is to increase the estimated standard errors for the parameters $p_u$ and $(\sigma_u, \xi)$ in comparison to those obtained with no adjustment. A plot method for objects inheriting from class `"flite"` enables us to inspect the log-likelihood functions for $p_u$, $(\sigma_u, \xi)$ and $\theta$. This first set of plots use the default vertical adjustment to the log-likelihoods for $p_u$ and $(\sigma_u, \xi)$. ```{r vertical, fig.width = 7, fig.height = 5, message = FALSE} plot(cfit) ``` The line superimposed on the on the plot on the top right shows the boundary of the log-likelihood imposed by the constraint $\xi > -\sigma_u / x_{(n)}$, where $x_{(n)}$ is the largest threshold excess. Using the vertical adjustment ensures that this constraint is respected after adjustment. However, if we use one of the horizontal adjustments, **cholesky** or **spectral**, then the adjusted log-likelihood has non-zero density values outside of the correct support. This is why the vertical adjustment is the default. ```{r spectral, fig.width = 7, fig.height = 5, message = FALSE} plot(cfit, which = "gp", adj_type = "spectral") ``` ## Inferences for return levels The $m$-year return level is the value that is exceeded with probability $1/m$ in any given year. To infer this from time series data we need to take account of the (intended) sampling frequency of the data. In the context of the Cheeseboro data, a year is effectively one month because we are making inferences about extremal behaviour in January. If there are no missing values then there are $31 \times 24 = 744$ hourly values per January. We denote this by $n_y$ (number of observations per year). The $m$-year return level is given by $$x_m = u+\frac{\sigma_u}{\xi} \left[ p_u^\xi \left\{ 1-\left( 1-\displaystyle \frac1m\right)^{1/n_{y}\theta} \right\}^{-\xi} - 1 \right].$$ If $m n_y \theta$ is large then $$x_m \approx u+\frac{\sigma_u}{\xi} \left[ (m \, n_{y}\, \theta \, p_u)^\xi - 1 \right].$$ The `returnLevel` function performs inferences about return levels. It uses the former expression for $x_m$. By default it calculates two types of (95\%) confidence intervals: (a) intervals based on approximate large-sample normality of the maximum likelihood estimator for return level, which are symmetric about the point estimate, and (b) profile likelihood-based intervals based on an (adjusted) log-likelihood. ```{r returnlevels, fig.width = 7, fig.height = 5} rl <- returnLevel(cfit, m = 100, level = 0.95, ny = 31 * 24) summary(rl) rl plot(rl) ``` ## References