Getting Started: OFH Synthetic Cohort Generation

Overview

This vignette shows how to generate synthetic cohort datasets for method development before using real health data.

The package-style API supports:

configurable cohort size
reproducible generation via seed
optional ICD-10 / OPCS4 / BNF code restrictions
configurable dataset coverage, record density, and field-level generation probabilities
control over whether to save CSVs and/or return R objects

1. Load the package

library(ofhsyn)

2. Generate a basic cohort

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123
)

names(out)

This returns a named list of data frames and writes CSVs to an output folder in your current working directory.

To return objects only (without writing CSV files):

out_objects_only <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  save_csv = FALSE,
  return_objects = TRUE
)

If you run this interactively, the generated data frames are also available in your R environment (for example questionnaire_data, clinic_measurements_data, nhse_inpat_data).

3. Restrict to specific code lists

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10 = c(
    I210 = "STEMI of anterolateral wall",
    I500 = "Congestive heart failure"
  ),
  opcs4 = c(
    K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"
  ),
  bnf_codes = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

You can also provide code files:

ICD10/OPCS4 files must include both code and description
For ICD10/OPCS4: use CSV (code,description) or tab-separated TXT (code<TAB>description)
For BNF: use CSV with BNFCode, BNFName, Formulation (optional Strength)

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10_file = "icd10_codes.txt",
  opcs4_file = "opcs4_codes.txt",
  bnf_codes_file = "bnf_medications.csv"
)

4. Configure dataset generation probabilities

out_custom <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  proportions = list(
    nhse_outpat = 0.25,
    nhse_inpat = 0.20,
    nhse_ed = 0.30,
    nhse_primcare_meds = 0.75
  ),
  record_multipliers = list(
    nhse_outpat = 1.2,
    nhse_inpat = 1.1,
    nhse_ed = 1.3
  ),
  code_config = list(
    nhse_outpat_data = list(diag_4_02_missing_prob = 0.70),
    nhse_inpat_data = list(single_diag_prob = 0.85)
  )
)

5. Use the OOP interface directly

syn <- OFHCohortSynthesizer$new(project_root = ".", seed = 123)

syn$set_code_pools(
  icd10 = c(I210 = "STEMI of anterolateral wall"),
  opcs4 = c(K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"),
  bnf_meds = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

out <- syn$run_all(n = 800)

6. Practical tips for researchers

Start with small n (for example, 200 to 1000) while developing.
Fix seed for reproducibility during method testing.
Check row counts and pid linkage assumptions in your analysis scripts.
Expand code lists as your phenotype definitions evolve.

7. Notes

Some datasets are intentional subsets of the full cohort.
Questionnaire output includes a small v1 proportion by design.
Primary care meds include prescribed-but-not-dispensed rows.