--- title: "Introduction to eider" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to eider} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(eider) ``` `eider` is a lightweight package for processing tabular data in a declarative fashion. Users may specify a set of operations to be performed on a table using JSON, which are then executed by the package. The primary use case of `eider` is for extraction of machine learning features from health data, but `eider` can in principle be used for any kind of data. The usage of `eider` in R source code itself is straightforward, and consists of a single call to `run_pipeline()`. ## Data To illustrate this, we will construct some very simplistic data, which may be, for example, a record of patients who attended their GP and their associated complaints. ```{r} example_table <- data.frame( patient_id = c(1, 1, 1, 2, 2, 3, 3, 3), attendance_reason = c(6, 6, 7, 6, 6, 7, 7, 7) ) data_sources <- list(attendances = example_table) ``` In practice, it is more likely that you will be reading in data from a file instead. For example, if you had a CSV file called `attendances.csv` in the current working directory, you could just do: ```{r} data_sources_2 <- list(attendances = "attendances.csv") ``` `eider` allows you to mix and match data sources, so you could have some data in a CSV file and some in an R data frame: ```{r} data_sources_3 <- list( attendances = example_table, # A variable which has already been constructed other_data = "other_data.csv" # A file to be read in ) ``` This allows the user to, for example, perform preprocessing on a portion of their data if so needed. ## Feature specification Suppose we want to extract a feature corresponding to the total number of times a patient attended for reason 6. `eider` requires that the feature is specified as JSON, which looks like this: ```{r comment='', echo=FALSE} writeLines(readLines("json_examples/eider.json")) ``` - `transformation_type` tells you what kind of overall operation is being performed. This determines which other fields are required in the JSON. - `source_table` specifies the name of the data source to be used in the list of data sources. - `grouping_column` specifies the columns to group by. - `absent_default_value` specifies what to do if there is no data for a particular patient ID. - `output_feature_name` specifies the name of the column to be created in the output table. - `filter` is a filter object which is used to select rows from the input table which match particular conditions. Subsequent vignettes will go into more detail about the different types of transformations and the required JSON fields for each of them. ## Performing the transformation To obtain the desired feature, we can place the JSON above in a file (here [`json_examples/eider.json`](https://github.com/alan-turing-institute/eider/blob/main/vignettes/json_examples/eider.json)) and simply do: ```{r} run_pipeline( data_sources = data_sources, feature_filenames = "json_examples/eider.json" ) ``` As expected, both patients 1 and 2 have attended for reason 6 twice, and patient 3 has not. `run_pipeline()` returns a list of two data frames, called _features_ and _responses_ respectively. These refer to data used for training machine learning models: _features_ are the independent variables (i.e. `X`), and _responses_ are the dependent variables (i.e. `y`). For consistency, `eider` always returns both of these data frames and ensures that both of them have the same list of IDs. Responses may be specified in exactly the same way as features, but using the `response_filenames` argument instead of `feature_filenames`. As an alternative to placing the JSON in a file and providing the filename to `run_pipeline()`, you can also provide the JSON directly as a string: ```{r} json_string <- '{ "transformation_type": "count", "source_table": "attendances", "grouping_column": "patient_id", "absent_default_value": 0, "output_feature_name": "total_attendances", "filter": { "column": "attendance_reason", "type": "in", "value": [6] } }' run_pipeline( data_sources = data_sources, feature_filenames = json_string ) ``` ## Next steps If you want to read a more step-by-step guide to the features of `eider`, move on to [the next vignette](features.html), which covers the different types of features that `eider` lets you define. Alternatively, jump ahead to the [gallery section](examples_ae.html) to see some examples of features that you might use `eider` to calculate.