--- title: "Extending diseasystore" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Extending diseasystore} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette gives you the knowledge you need to create your own `diseasystore`. # The diseasy data model To begin, we go through the data model used within the `diseasystores`. It is this data model that enables the automatic coupling of features and powers the package. ## A bitemporal data model The data created by `diseasystores` are so-called "bitemporal" data. This means we have two temporal dimensions. One representing the validity of the record, and one representing the availability of the record. ### `valid_from` and `valid_until` The validity dimension indicates when a given data point is "valid", e.g. a hospitalisation is valid between admission and discharge date. This temporal dimension should be familiar to you is simply "regular" time. We encode the validity information into the columns `valid_from` and `valid_until` such that a record is valid for any time `t` which satisfies `valid_from <= t < valid_until`. For many features, the validity is a single day (such as a test result) and the `valid_until` column will be the day after `valid_from`. By convention, we place these column as the last columns of the table[^1]. ### `from_ts` and `until_ts` `diseasystore` uses `{SCDB}` in the background to store the computed features. `SCDB` implements the second temporal dimension which indicates when a record was present in the data. This information is encoded in the columns `from_ts` and `until_ts`. Normally, you don't see these columns when working with `diseasystore` since they are masked by `SCDB`. However, if you inspect the tables created in the database by diseasystore, you will find they are present. For our purposes, it is sufficient to know that these column gives a time-versioned data base where we can extract previous versions through the `slice_ts` argument. By supplying any time `τ` as `slice_ts`, we get the data as they were available on that date. This allows us to build continuous integration of our features while preserving previously computed features. ## Automatic data-coupling A primary feature of `diseasystore` is its ability to automatically couple and aggregate features. This coupling requires common "key_\*" columns between the features. Any feature in a `diseasystore` therefore must have at least one "key_\*" column. By convention, we place these column as the first columns of the table. ## Features Finally, we come to the main data of the `diseasystore`, namely the features. First, a reminder that "feature" here comes from machine learning and is any individual piece of information. We subdivide features into two categories: "observables" and "stratifications". On most levels, these are indistinguishable, but their purposes differ and hence we need to handle them individually. ### Observables In `diseasystore` any feature whose name either starts with "n_" or ends with "_temperature" are treated as "observables". From a modelling perspective, these observables are typically the metrics you want to model or take as inputs to inform your model. ### Stratifications Conversely, any other feature is a "stratification" feature. These features are the variables used to subdivide your analysis to match the structure of your model (hence why they are called stratification features). A prominent example for most disease models would be a stratification feature like "age_group", since most diseases show a strong dependency on the age of the affected individuals. ### Naming convention While there is no formal requirement for the naming of the observables or stratifications, it is considered best practice to use the same names as other `diseasystores` for features where possible[^2]. This simplifies the process of adapting analyses and disease models to new `diseasystores`. # Creating FeatureHandlers To facilitate the automatic coupling and aggregation of features, we use the `?FeatureHandler` class. Each feature[^3] in the `diseasystore` has an associated `FeatureHandler` which implements the computation, retrieval and aggregation of the feature. ## Computing features The `FeatureHandler` defines a `compute` function which must be on the form: ``` compute = function(start_date, end_date, slice_ts, source_conn) ``` The arguments `start_date` and `end_date` indicates the period for which features should be computed. The `diseasystores` are [dynamically expanded](diseasystore.html#dynamically-expanded), so feature computation is often restricted to limited time intervals as indicated by `start_date` and `end_date`. As mentioned [above](#from_ts-and-until_ts) `slice_ts` specifies what date the should be computed for. E.g. if `slice_ts` is the current date, the current features should be computed. Conversely, if `slice_ts` is some past date, features corresponding to this date should be computed. Lastly, the source_conn is a flexible argument passed to the FeatureHandler indicating where the source data needed to compute the features is stored (e.g. a database connection or directory). Note that multiple features can be computed by a single `FeatureHandler`. For example, you may decide that it is more convenient for compute multiple different features simultaneously (e.g. a hospitalisation and the classification of said hospitalisation or a test and the associated test result). ## Retrieving features The `FeatureHandler` defines a `$get()` function which must be in the form: ``` get = function(target_table, slice_ts, target_conn) ``` Typically, you do not need to specify this function since the default (a variant of `SCDB::get_table()`) always works. However, in the case that you do need to specify it, the `target_table` argument will be a `DBI::Id` specifying the location of the data base table where the features are stored. `target_conn` is connection to the database. And as above, `slice_ts` is the time-keeping variable. ## Aggregators The `FeatureHandler` defines a `key_join` function which must be on the form: ``` key_join = function(.data, feature) ``` In most cases, you should be able to use the bundled `key_join_*` functions (see `?aggregators` for a full list). In the event, that you need to create your own aggregator the arguments are as follows: * `.data` is a grouped `data.frame` whose groups are those specified by the `stratification` argument (see [Automatic aggregation](diseasystore.html#automatic-aggregation)). * `feature` is the name of the feature(s) to aggregate. Your aggregator should return a `dplyr::summarise()` call that operates on all columns specified in the `feature` argument. ## Putting it all together By now, you should know the basics of creating your own `FeatureHandlers`. To see some `FeatureHandlers` in action, you can consult a few of those bundled with the `diseasystore` package. For example: * [DiseasystoreGoogleCovid19: index](https://github.com/ssi-dk/diseasystore/blob/ceedbe1/R/DiseasystoreGoogleCovid19.R#L161) * [DiseasystoreGoogleCovid19: min temperature](https://github.com/ssi-dk/diseasystore/blob/ceedbe1/R/DiseasystoreGoogleCovid19.R#L253) # Creating a `diseasystore` With the knowledge of how to build custom `FeatureHandlers`, we turn our attention to the remaining parts of the `diseasystore`'s anatomy. The `diseasystores` are [R6 classes](https://r6.r-lib.org/index.html) which is a implementation of object-oriented (OO) programming. To those unfamiliar with OO programming, the `diseasystores` are single "objects" with a number of "public" and "private" functions and variables. The public functions and variables are visible to the user of the `diseasystore` with the private functions and variables are visible only to us (the developers). When extending `diseasystore`, we are only writing private functions and variables. The public functions and variables are handled elsewhere[^4]. ## ds_map The `ds_map` field of the `diseasystore` tells the `diseasystore` which `FeatureHandler` is responsible for each feature, thus allowing the `diseasystore` to retrieve the features specified in the `observable` and `stratification` arguments of calls to `$get_feature()`. In other words, it maps the names of features to their corresponding `FeatureHandlers`. As we saw above, a `FeatureHandler` may compute more than a single feature. Each feature should be mapped to the `FeatureHandler` here or else the `diseasystore` will not be able to automatically interact with it. By convention, the name of the `FeatureHandler` should be snake_case and contain a `diseasystore` specific prefix (e.g. for `DiseasystoreGoogleCovid19`, all `FeatureHandlers` are named "google_covid_19_"). These names are used as the table names when storing the features in the database, and the prefix helps structure the database accordingly. This latter part becomes important when [clean up](diseasystore.html#dropping-computed-features) for the data base needs to be performed. ## Key join filter The `diseasystore` are made to be as flexible as possible which means that it can incorporate both individual level data and semi-aggregated data. For semi-aggregated data, it is often the case that the data includes aggregations at different levels, nested within the data. For example, the Google COVID-19 data repository contains information on both country-level and region-level in the same data files. When the user of `DiseasystoreGoogleCovid19` asks to get a feature stratified by, for example, "country_id", we need to filter out the data aggregated at the region level. This is the purpose of `$key_join_filter()`. It takes as input the requested stratifications and filters the data accordingly after the features have been joined inside the `diseasystore`. For an example, you can consult [DiseasystoreGoogleCovid19: key_join_filter](https://github.com/ssi-dk/diseasystore/blob/ceedbe1/R/DiseasystoreGoogleCovid19.R#L75) ## Testing your `diseasystore` The `diseasystore` package includes the function `test_diseasystore()` to test the `diseasystores`. You can see how to call the testing suite in action with `DiseasystoreGoogleCovid19` as an example [here](https://github.com/ssi-dk/diseasystore/blob/main/tests/testthat/test-DiseasystoreGoogleCovid19.R). [^1]: The `SCDB` package places `checksum`, `from_ts`, and `until_ts` as the last columns. But `valid_from` and `valid_until` should be the last columns in the output passed to `SCDB`. [^2]: In practice, this means that the names of features should be in `snake_case`. [^3]: Or "coupled" set of features as we will soon see. [^4]: By the `DiseasystoreBase` class.