--- title: "Introduction to tsibble" author: "Earo Wang" biblio-style: authoryear-comp link-citations: yes output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to tsibble} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r initial, echo = FALSE, cache = FALSE, results = 'hide'} knitr::opts_chunk$set( warning = FALSE, message = FALSE, echo = TRUE, collapse = TRUE, fig.width = 7, fig.height = 6, fig.align = 'centre', comment = "#>" ) options(tibble.print_min = 5) ``` The **tsibble** package extends the [tidyverse](https://www.tidyverse.org) to temporal data. Built on top of the [tibble](https://tibble.tidyverse.org/), a tsibble (or `tbl_ts`) is a data- and model-oriented object. Compared to the conventional time series objects in R, for example `ts`, `zoo`, and `xts`, the tsibble preserves time indices as the essential data column and makes heterogeneous data structures possible. Beyond the tibble-like representation, **key** comprised of single or multiple variables is introduced to uniquely identify observational units over time (**index**). The tsibble package aims at managing temporal data and getting analysis done in a fluent workflow. ## Contextual semantics: index and key `tsibble()` creates a tsibble object, and `as_tsibble()` is an S3 method to coerce other objects to a tsibble. An object that a vector/matrix underlies, such as `ts` and `mts`, can be automated to a tsibble using `as_tsibble()` without any specification. If it is a tibble or data frame, `as_tsibble()` requires a little more setup in order to declare the index and key variables. ```{r weather} library(dplyr) library(lubridate) library(tsibble) weather <- nycflights13::weather %>% select(origin, time_hour, temp, humid, precip) weather ``` The `weather` data included in the package `nycflights13` contains the hourly meteorological records (such as temperature, humid and precipitation) over the year of 2013 at three stations (i.e. JFK, LGA and EWR) in New York City. Since the `time_hour` is the only column involving the timestamps, `as_tsibble()` defaults it to the index variable; alternatively, the index can be specified by the argument `index = time_hour` to disable the verbose message. Except for index, a tsibble requires "key", which defines subjects or individuals measured over time. In this example, the `origin` variable is the identifier, which is passed to the argument `key` in `as_tsibble()`. **Each observation should be uniquely identified by index and key** in a valid tsibble. Others---`temp`, `humid` and `precip`---are referred to as measured variables. When creating a tsibble, the key will be sorted first, followed by arranging time from past to recent. ```{r weather-ts, message = TRUE} weather_tsbl <- as_tsibble(weather, key = origin) weather_tsbl ``` An interval is automatically obtained based on the corresponding time representation: * `integer`/`numeric`/`ordered`: either "unit" or "year" (`Y`) * `yearquarter`/`yearqtr`: "quarter" (`Q`) * `yearmonth`/`yearmon`: "month" (`M`) * `yearweek`: "week" (`W`) * `Date`: "day" (`D`) * `difftime`: "week" (`W`), "day" (D), "hour" (`h`), "minute" (`m`), "second" (`s`) * `POSIXct`/`hms`: "hour" (`h`), "minute" (`m`), "second" (`s`), "millisecond" (`us`), "microsecond" (`ms`) * `nanotime`: "nanosecond" (`ns`) That is, a tsibble of monthly intervals expects the `yearmonth`/`yearmon` class in the index column. Neither `Date` nor `POSIXct` gives a monthly tsibble. The print display is data-centric and contextually informative, such as data dimension, time interval, and the number of time-based units. Above displays the `weather_tsbl` its one-hour interval (`[1h]`) and the `origin [3]` as the key along with three time series in the table. ## Data pipeline This tidy data representation most naturally supports thinking of operations on the data as building blocks, forming part of a "data pipeline" in time-based context. Users who are familiar with tidyverse would find it easier to perform common temporal analysis tasks. For example, `index_by()` is the counterpart of `group_by()` in temporal context, but it only groups the time index. `index_by()` + `summarise()` is used to summarise daily highs and lows at each station. As a result, the index is updated to the `date` with one-day interval from the index `time_hour`; two new variables are created and computed for daily maximum and minimum temperatures. ```{r weather-tsum} weather_tsbl %>% group_by_key() %>% index_by(date = ~ as_date(.)) %>% summarise( temp_high = max(temp, na.rm = TRUE), temp_low = min(temp, na.rm = TRUE) ) ``` ## Irregular time interval Note that the tsibble handles regularly-spaced temporal data well, from seconds to years based on its time representation (see `?tsibble`). The option `regular`, by default, is set to `TRUE` in `as_tsibble()`. Specify `regular` to `FALSE` to create a tsibble for the data collected at irregular time interval. Below shows the scheduled date time of the flights in New York: ```{r flights} flights <- nycflights13::flights %>% mutate(sched_dep_datetime = make_datetime(year, month, day, hour, minute, tz = "America/New_York")) ``` The key contains columns `carrier` and `flight` to identify observational units over time, from a passenger's point of view. With `regular = FALSE`, it turns to an irregularly-spaced tsibble, where `[!]` highlights the irregularity. ```{r flights-ts} flights_tsbl <- flights %>% as_tsibble( key = c(carrier, flight), index = sched_dep_datetime, regular = FALSE ) flights_tsbl ``` To regularise an irregular tsibble, it can be achieved with `index_by()` + `summarise()`.