--- title: "Data Collation and Synthesis Protocol" output: bookdown::html_vignette2: code_folding: show bibliography: references.json link-citations: true vignette: > %\VignetteIndexEntry{Data Collation and Synthesis Protocol} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, load-epiparameter, echo=FALSE} library(epiparameter) ``` ## About the package The [`{epiparameter}` R package](https://github.com/epiverse-trace/epiparameter) contains a library of epidemiological parameter data and functions that read and handle this data. The delay distributions describe the time between two events in epidemiology, for example incubation period, serial interval and onset-to-death; while the offspring distributions describe the number of secondary infections from a primary infection in disease transmission. The library is compiled by a process of collecting, reviewing and extracting data from peer-reviewed literature[^1], including research articles, systematic reviews and meta-analyses. The `{epiparameter}` package will act as a 'living systematic review' (*sensu* @elliottLivingSystematicReviews2014) which will be actively updated and maintained to provide a reliable source of data on epidemiological distributions. To prevent bias in the collection or assessment of the data, a well-defined methodology of searching and refining is required. This document aims to provide transparency on the methodology used by the `{epiparameter}` maintainers by outlining the steps taken at each stage of the data handling. It can also serve as a guide to contributors wanting to search and provide epidemiological parameters currently missing from the library. This protocol should also facilitate reproducibility in the searches, results and appraisal steps. There is a large body of work on the methods to best conduct literature searches and data collection as part of systematic reviews and meta-analyses[^2], which we use as the basis for our protocol. These sources are: - [Cochrane Handbook](https://training.cochrane.org/handbook/current/) [@higginsCochraneHandbookSystematic2019] - [PRISMA](https://www.prisma-statement.org/) [@pagePRISMA2020Statement2021] ## Objective of `{epiparameter}` As defined by the PRISMA guidelines, having a clearly stated objective helps to refine the goal of the project. `{epiparameter}`'s objective is to provide information for a collection of distributions for a range of infectious diseases that is as accurate, unbiased and as comprehensive as possible. Such distributions will enable outbreak analysts to easily access these distributions for routine analysis. For example, delay distributions are necessary for: calculating case fatality rates adjusting for delay to outcome, quantifying implications of different screening measures and quarantine periods, estimating reproduction numbers, and scenario modelling using transmission dynamic models. ## Contributing to the package To contribute to the `{epiparameter}` library of epidemiological parameter information, added your data to [this google sheet](https://docs.google.com/spreadsheets/d/1zVpaaKkQ7aeMdRN2r0p-W4d2TtccL5HcIOp_w-lfkEQ/edit?usp=sharing). This will then be integrated into the `{epiparameter}` library by the package maintainers, and the information will then be accessible to all `{epiparameter}` package users. ## Scope of package The `{epiparameter}` package spans a range of infectious diseases, including several distributions for each disease when available. The pathogens and diseases that are currently systematically searched for and included in the package library are: ```{r show-diseases, echo=FALSE} db <- epiparameter_db() disease <- vapply(db, "[[", FUN.VALUE = character(1), "disease") pathogen <- vapply(db, "[[", FUN.VALUE = character(1), "pathogen") knitr::kable( data.frame( Disease = disease, Pathogen = pathogen ), align = "cc" ) ``` The distributions currently included in the literature search for each pathogen/disease are: ```{r, show-epiparameters, echo=FALSE} db <- epiparameter_db() epi_name <- epiparameter:::.clean_string(unique( # nolint undesirable_operator_linter vapply(db, "[[", FUN.VALUE = character(1), "epi_name") )) knitr::kable( data.frame(epi_name), align = "cc", col.names = "Epidemiological Parameter" ) ``` ## Guide to identifying distributions in the literature 1. **Key word searches**: when searching the literature, the use of specific search phrases to ensure the correct literature is procured is required. We use a search schema that includes searching for the pathogen or the disease, and the desired distribution. The search phrase can optionally include a specific variant/strain/subtype. The search is not constrained based on year of publication. Examples of searches: - "SARS-CoV-2 incubation period" - "ebola serial interval" - "influenza H7N9 onset to admission" However, these simple search phrases can return a large number of irrelevant papers. Using a more specific search schema depending on the search engine used. For example, if using Google Scholar a schema like: - ("Middle East Respiratory Syndrome" OR MERS) AND "onset to death" AND (estimation OR inference OR calculation) - (ebola OR EVD) AND "onset to death" AND (estimation OR inference OR calculation) Or if Web of Science is being used: - ("Middle East Respiratory Syndrome" OR MERS) AND "onset to death" AND estimat* - (ebola OR EVD) AND "onset to death" AND estimat* This should refine the results to a more suitable set of literature. 1. **Literature search engines**: using a selection of search engines to prevent one source potentially omitting papers. Suggested search sites are: Google Scholar, Web of Science, PubMed, and Scopus. 1. **Adding papers**: in addition to the database entries from papers that have been identified in a literature search, entries can be supplemented by recommendations (i.e. from the community) or through being cited by a paper in the literature search. Papers may be recommended by experts in research or public health communities. We plan to use two methods of community engagement. Firstly a [open-access Google sheet](https://docs.google.com/spreadsheets/d/1zVpaaKkQ7aeMdRN2r0p-W4d2TtccL5HcIOp_w-lfkEQ/edit?usp=sharing) allows people to add their distribution data which will then be reviewed by one of the `{epiparameter}` maintainers and incorporated if it meets quality checks. The second method - not yet implemented - involves community members uploading their data to [zenodo](https://zenodo.org/), which can then be read and loaded into R using `{epiparameter}` once checked. 1. **Language restrictions**: papers in English or Spanish are currently supported in `{epiparameter}`. Papers written in another language but verified by an expert can also be included in the database. However, these are not evaluated with the review process described below and as a result are flagged to the user when loaded in `{epiparameter}`. ## Guide to data refinement once sources identified 1. **Removing duplicates**: the library of parameters does not contain duplicates of studies, but multiple entries per study can be included if a paper reports multiple results (e.g. from the full data set and then a subset of the data). Studies that use the same data, subsets or supersets of data used in other papers in the library are included. 1. **Abstract and methods screening**: once a number of unique sources have been identified, each should be reviewed for its suitability by reviewing the abstract and searching for words or phrases in the paper that indicate it reports the parameters or summary statistics of a distribution, this can include searching the methods section for words on types of distributions (e.g. lognormal), fitting procedures (e.g. maximum likelihood or bayesian), or searching the results for parameter estimates. The `{epiparameter}` library includes entries where parameters or summary statistics are reported but a distribution is not specified, and entries where the distribution is specified but the parameters are not reported. A database of unsuitable papers will be kept to remind maintainers which papers have not been included and aids in the updating of the database (see below) by preventing redundant reviewing a previously rejected paper. 1. **Stopping criteria**: for many searches, the number of results is far larger than could be reasonably evaluated outside a full systematic review. After refining which papers contain the required information (abstract and methods screening), around 10 papers per pathogen are screened for each search (per search round, see updating section below for details). If the number of papers that pass the abstract and methods screening is fewer than 10, all suitable papers are reviewed. 1. **Full paper screening**: after the abstract and methods screening, those papers not excluded should be reviewed in full to verify they indeed contain the required information on distribution parameters and information on methodology used. It is acceptable to include a secondary source that contains information on the delay distribution when the primary source is unavailable or does not report the distribution. The inference of the delay distribution does not have to be a primary subject of the research article, for example if it was inferred to be used in estimation of $R_0$ it can still be included in the database. Additionally, distribution parameters based on illustrative values for use in simulations - rather than inferred from data - are considered unsuitable and should be excluded. Again, any papers excluded at this stage should be recorded in the database of unsuitable sources with reasoning to prevent having to reassess when updating the database. 1. **Post hoc removal**: if any `{epiparameter}` parameters are later identified as being inappropriate then they can be removed from the database. In most cases this is unlikely as limitations can be appended onto data entries to make users aware of limitations (e.g. around assumptions used to infer the distribution), only in the most extreme cases will data be completely removed from the database. Note: systematic reviews focusing on effect sizes can be subject to publication bias (e.g. more positive or significant results in the literature). However, distribution inference does not focus on significance testing or effect sizes, so this bias is not considered in the collection process. ## Guide to extracting parameters **Extracting parameters**: for any underlying distributions (e.g. gamma, lognormal), parameters (e.g. shape/scale, meanlog/sdlog), and summary statistics (e.g. mean, standard deviation, median, range or quantiles) given in the paper, values should be recorded verbatim from the paper into the database. When these are read into R using the `{epiparameter}` package, other aspects of the distribution are automatically calculated and available. For example if the mean and standard deviation of a gamma distribution is reported for a serial interval these values will be stored in the database. But once in R, the shape and scale parameters of the gamma distribution will be automatically reconstructed and the resulting distribution available for use. The `{epiparameter}` library exactly reflects the literature. By which we mean that information not present in the paper should not be imputed from prior knowledge (e.g. vector of a disease is known but not stated), or by performing calculating with reported values. This prevents the issue of not having clear provenance for the data in the library. The requirements of each entry in the database is defined by the data dictionary. Here we outline the minimal dataset that is required to be included in the `{epiparameter}` library is: - Name of disease - Type of distribution - Citation information (author(s) of the paper, the year of publication, publication title and journal, and DOI) - Whether the distribution is extrinsic (e.g. extrinsic incubation period). If the disease is not vector-borne this should be NA. - Whether the distribution fitted was discretised, this is a boolean (true or false). All other information for each database entry is non-essential. See the data dictionary included in `{epiparameter}` for all database fields with a description of each and the range of possible values each field can take. ## Data quality assessment in `{epiparameter}` The inference of parameters for a delay distribution often requires methodological adjustments to correct for factors that would otherwise bias the estimates. These includes accounting for interval-censoring of the data when the timing of an event (e.g. exposure to a pathogen) is not know with certainty, but rather within a time window. Or adjusting for phase bias when the distribution is estimated during a growing or shrinking stage of an epidemic. The aim of `{epiparameter}` is not to make a judgement on which parameters are 'better' than others, but to notify and warn the user of the potential limitations of the data. The aspects assessed are: 1) whether the method includes single or double interval-censoring when the exposure or onset times are not known with certainty (i.e. on a single day); 2) does the method adjust for phase bias when the outbreak is in an ascending or descending phase. These are indicated by boolean values to indicate whether they are reported in the paper and users are recommended to refer back to the paper to determine whether estimates are biased. ## Guide to the `{epiparameter}` review process For a set of parameters to be included in the database it must pass the abstract and methods screening and full screening and subsequently a review by one of the `{epiparameter}` maintainers. This process involves running diagnostic checks and cross-referencing the reported parameters with those in the paper to ensure they match exactly and that the results plot of the PDF/CDF/PMF matches anything plotted in the paper, if available. This prevents a possible misinterpretation (e.g. serial interval for incubation period). This check also includes making sure the unique identifiers for the paper match the author's name, publication year and other data recorded in the database. ## Updating parameters in the database Because search and review stages are time consuming and cannot be continuously carried out, we aim to keep the `{epiparameter}` library up-to-date as a living data library by conducting regular searches (i.e. every 3-4 months) to fill in any missing papers or new publication since the last search. The epidemiological literature can expand rapidly, especially during a new outbreak. Therefore we can optionally include new studies that are of use to the epidemiological community in between regular updates. These small additions will still be subject to the data quality assessment and diagnostics to ensure accuracy, and will likely be picked up in the subsequent literature searches. It is likely that for existing pathogens that have not had a major increase in incidence since the last update few new papers will be reporting delay distributions. In these cases papers that were not previously reviewed due to limited reviewing time for each round of updates are now checked. We particularly value community contributions to the database, so everyone can benefit from analysis that has already been conducted, and duplicated of effort is reduced. ## Database of excluded papers All papers that are returned in the search results but are not suitable, either at the stage of abstract screening, or after reviewing the entirety of the paper, are recorded in a database by the following information: - First author's last name - Unique identifier, ideally the DOI - Journal, pre-print server, or host website - One or several reasons for why it was deemed unsuitable - Date of recording ## References [^1]: We also include distribution parameters that are published on pre-print servers such as arxiv, bioRxiv and medRxiv. But will update the citation reference once the paper is accepted and published by a journal. [^2]: `{epiparameter}` is itself not a meta-analytic tool, but the references stored within may be useful as part of a meta-analysis.