--- title: "Examples: LTC data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Examples: LTC data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(eider) library(magrittr) ``` This series of vignettes in the _Gallery_ section aim to demonstrate the functionality of `eider` through examples that are similar to real-life usage. To do this, we have created a series of randomly generated datasets that are stored with the package. You can access these datasets using the `eider_example()` function, which will return the path to where the dataset is stored in your installation of R. ```{r} ltc_data_filepath <- eider_example("random_ltc_data.csv") ltc_data_filepath ``` ## The data In this specific vignette, we are using simulated long-term condition (LTC) data. Our dataset does not contain every column specified in here, but serves as a useful example of how real-life data may be treated using `eider`. ```{r} ltc_data <- utils::read.csv(ltc_data_filepath) %>% dplyr::mutate(asthma = lubridate::ymd(asthma), diabetes = lubridate::ymd(diabetes), parkinsons = lubridate::ymd(parkinsons)) dplyr::glimpse(ltc_data) ``` (Note that when the data is loaded by `eider`, the date columns are automatically converted to the date type for you: you do not need to do the manual processing above.) This simplified table has 4 columns: * `id`, which is a numeric patient ID; * `asthma`, `diabetes`, and `parkinsons`, which are columns with dates indicating when a patient was first diagnosed with the corresponding condition. If the patient has never been diagnosed with this condition, the value is `NA`. ## Feature 1: Number of years with asthma In this example, we will calculate the number of years since each patient was first diagnosed with asthma. ```{r} years_asthma_filepath <- eider_example("years_with_asthma.json") writeLines(readLines(years_asthma_filepath)) ``` This is very similar to [one of the examples in the A&E data vignette](examples_ae.html#feature-4-the-number-of-days-since-the-last-ae-attendance). Here, we use a `"time_since"` transformation type, and additionally specify `"time_units"` as `"years"` to obtain the result as a number of years (formally, the number of days divided by 365.25). In this particular example, the `"from_first"` parameter is set to `true`. Because each patient only has one row in the table, there is no 'first' row, and thus this parameter could equally well be set to `false`. (However, it cannot be omitted, as it is a required parameter for the `"time_since"` transformation type.) ```{r} res <- run_pipeline( data_sources = list(ltc = ltc_data_filepath), feature_filenames = years_asthma_filepath ) dplyr::glimpse(res$features) ``` ## Feature 2: Whether a patient has asthma or not This example is slightly more interesting because it involves a more ingenious filter operation. We would like a binary feature here which has value 1 if the patient has asthma, and 0 otherwise. However, we cannot simply use a `"present"` or `"count"` transformation type without filtering, because every patient appears in this table. We need to first filter the table such that all the rows where an `NA` value appears in the asthma column are removed. However, `eider`'s filter operation does not support filtering based on `NA` values! To work around this, what we can do is to filter based on the dates: if we choose only the rows where the date is greater than some _sentinel value_ which is a long time in the past, any genuine dates in the table will pass this test, but NA values will not. Thus, what we need is a `"date_gt"` filter with a value that is suitably far in the past such that any real date in the table will come after it. ```{r} has_asthma_filepath <- eider_example("has_asthma.json") writeLines(readLines(has_asthma_filepath)) ``` ```{r} res <- run_pipeline( data_sources = list(ltc = ltc_data_filepath), feature_filenames = has_asthma_filepath ) dplyr::glimpse(res$features) ``` ## Feature 3: Number of conditions As a final example, we will calculate the number of long-term conditions each patient has a diagnosis for. This essentially involves calculating one binary 0/1 feature for each condition (much like [Feature 2](#feature-2-whether-a-patient-has-asthma-or-not), and then summing them up. Thus, we need to use a `"combine_linear"` transformation type, with the weights of each individual feature set to 1 (see the [combination feature vignette](combination.html) for more information). The full JSON looks like this: ```{r} num_ltcs_filepath <- eider_example("number_of_ltcs.json") writeLines(readLines(num_ltcs_filepath)) ``` The `subfeature` object contains a named list of the individual features that we want to combine: each of these have exactly the same structure as before, except that the filtering is performed on a different column each time. Each individual subfeature is also given a `"weight": 1`, as described previously. Finally, the `"output_feature_name"` field is lifted to the top level of the JSON instead of in each individual subfeature. ```{r} res <- run_pipeline( data_sources = list(ltc = ltc_data_filepath), feature_filenames = num_ltcs_filepath ) dplyr::glimpse(res$features) ```