--- title: "An overview of features" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{An overview of features} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` As the [introductory vignette](eider.html) shows, writing R code with `eider` simply consists of a call to `run_pipeline()`. Most of the time spent using this library will be spent defining the features themselves using JSON. ## Features as JSON Features are JSON _objects_, which are an association of _keys_ and _values_ tied together within curly braces. Keys are always strings, and values can be strings, numbers, booleans, arrays, or objects themselves, as shown in the example object below. Conceptually, JSON objects are similar to R lists. ```json { "key_1": "a string", "key_2": 1, "key_3": true, "key_4": [1, 2, 3], "key_5": { "nested_key_1": "a string", "nested_key_2": 1 } } ``` To be correctly parsed by `eider`, each feature must contain a specific set of keys. The keys that are shared across all features are: * **(string) `source_table`** The name of the table to be read in. Note that this is not a filename: it is a unique identifier which is passed as part of the `data_sources` argument to `run_pipeline()`. Please see [the introductory vignette](eider.html) for an explanation of this. * **(string) `output_feature_name`** This determines the name of the column in the final dataframe that `eider` produces. It can be any string you like, as long as there are no clashes between multiple features; the feature name `"id"` is also reserved and cannot be used. * **(string) `grouping_column`** This is the name of the column in the input dataframe that the feature will be calculated over. The feature column will contain the result of the calculation for each unique value in this column. In the case of medical data, this would typically be the name of the column containing the patient ID. In the remainder of this vignette, we will refer to the values within this column as "IDs". * **(number; optional) `absent_default_value`** This is the value that will be used for IDs that are not present in the input dataframe. Because `eider` only calculates numeric features, this has to be a number. If omitted, `eider` will insert `NA` values for missing IDs. * **(string) `transformation_type`** This defines the _type_ of calculation performed for the feature. Each transformation type may require an extra set of keys to be specified for the feature to be correctly calculated. ## Transformation types The available transformation types can be split into a few groups: ### Counting **Transformation types:** `"count"`, `"present"` The two simplest features are `"count"`, which counts the number of occurrences of each ID in the dataset, and `"present"`, which outputs `1` if the ID was found in the dataset and `0` if not. Examples of the `"count"` feature type are provided in [A&E features 1 and 2](examples_ae.html#feature-1-total-number-of-attendances), as well as [SMR04 feature 1](examples_smr04.html#feature-1-number-of-episodes-associated-with-a-psychotherapy-specialty). The `"present"` feature type is showcased in [A&E feature 3](examples_ae.html#feature-3-whether-a-patient-has-ever-attended-ae), as well as [LTC features 2 and 3](examples_ltc.html#feature-2-whether-a-patient-has-asthma-or-not). ### Summaries **Transformation types:** `"sum"`, `"nunique"`, `"mean"`, `"median"`, `"sd"`, `"first"`, `"last"`, `"min"`, `"max"` As these features act with respect to the values in a specific column, they require a single extra key to be specified: * **(string) `aggregation_column`** The name of the column containing the values to be aggregated. The feature will be calculated for each unique ID by aggregating the values in this column. Example features with summary functions include all the [PIS features](examples_pis.html), and also [SMR04 features 2, 3, and 4](examples_smr04.html#feature-2-number-of-stays-associated-with-a-psychotherapy-specialty). These cover the transformation types `"nunique"`, `"sum"`, and `"max"`. ### Time-based **Transformation types:** `"time_since"` The `time_since` transformation type calculates the period of time between a given date and the first (or last) date in the dataset for each ID. This feature requires a few more keys: * **(string) `date_column`** The name of the column containing the dates to be used in the calculation. * **(string) `cutoff_date`** The date to be used as the reference point for the calculation. This should be in the format `YYYY-MM-DD`. * **(boolean) `from_first`** If `true`, the feature will calculate the time between the cutoff date and the first date in the dataset for each ID. If `false`, it will calculate the time between the cutoff date and the last date. * **(string) `time_units`** The unit of time to be used in the calculation. This can be either `"days"` or `"years"`: a year is defined as 365.25 days. Examples of `"time_since"` features are given in [A&E feature 4](examples_ae.html#feature-4-the-number-of-days-since-the-last-ae-attendance) and [LTC feature 1](examples_ltc.html#feature-1-number-of-years-with-asthma). ### Combination features **Transformation types:** `"combine_linear"`, `"combine_min"`, `"combine_max"` Combination features are a way of combining the results of multiple features into a single feature. They have a slightly different structure to the rest: broadly speaking, these transformation types require a `subfeature` key, which is itself an object which contains the features which are to be combined. Combination features are covered in a [separate vignette](combination.html). ## Preprocessing and filtering While the above may seem like a large number of possible calculations, on their own they offer no way of controlling which parts of the input data are to be considered. In addition to the keys shown above, (non-combination) features may also contain the `preprocess` and `filter` keys, which perform transformations on the input table before the features are calculated from them. Preprocessing refers to the modification of values within a table, whereas filtering does not modify the values, but only allows rows that pass a set of criteria to be considered when calculating the feature. Preprocessing is performed prior to filtering: thus, if both are specified, filtering is performed on the already-preprocessed values. Both the `preprocess` and `filter` keys are themselves JSON objects, and are detailed respectively in the [preprocessing](preprocessing.html) and [filtering](filtering.html) vignettes.