Getting Started: OFH Synthetic Cohort Generation

Overview

This vignette shows how to generate synthetic cohort datasets for method development before using real health data.

The package-style API supports:

  • configurable cohort size
  • reproducible generation via seed
  • optional ICD-10 / OPCS4 / BNF code restrictions
  • configurable dataset coverage, record density, and field-level generation probabilities
  • control over whether to save CSVs and/or return R objects

1. Load the package

library(ofhsyn)

2. Generate a basic cohort

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123
)

names(out)

This returns a named list of data frames and writes CSVs to an output folder in your current working directory.

To return objects only (without writing CSV files):

out_objects_only <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  save_csv = FALSE,
  return_objects = TRUE
)

If you run this interactively, the generated data frames are also available in your R environment (for example questionnaire_data, clinic_measurements_data, nhse_inpat_data).

3. Restrict to specific code lists

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10 = c(
    I210 = "STEMI of anterolateral wall",
    I500 = "Congestive heart failure"
  ),
  opcs4 = c(
    K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"
  ),
  bnf_codes = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

You can also provide code files:

  • ICD10/OPCS4 files must include both code and description
  • For ICD10/OPCS4: use CSV (code,description) or tab-separated TXT (code<TAB>description)
  • For BNF: use CSV with BNFCode, BNFName, Formulation (optional Strength)
out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10_file = "icd10_codes.txt",
  opcs4_file = "opcs4_codes.txt",
  bnf_codes_file = "bnf_medications.csv"
)

4. Configure dataset generation probabilities

out_custom <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  proportions = list(
    nhse_outpat = 0.25,
    nhse_inpat = 0.20,
    nhse_ed = 0.30,
    nhse_primcare_meds = 0.75
  ),
  record_multipliers = list(
    nhse_outpat = 1.2,
    nhse_inpat = 1.1,
    nhse_ed = 1.3
  ),
  code_config = list(
    nhse_outpat_data = list(diag_4_02_missing_prob = 0.70),
    nhse_inpat_data = list(single_diag_prob = 0.85)
  )
)

5. Use the OOP interface directly

syn <- OFHCohortSynthesizer$new(project_root = ".", seed = 123)

syn$set_code_pools(
  icd10 = c(I210 = "STEMI of anterolateral wall"),
  opcs4 = c(K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"),
  bnf_meds = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

out <- syn$run_all(n = 800)

6. Practical tips for researchers

  • Start with small n (for example, 200 to 1000) while developing.
  • Fix seed for reproducibility during method testing.
  • Check row counts and pid linkage assumptions in your analysis scripts.
  • Expand code lists as your phenotype definitions evolve.

7. Notes

  • Some datasets are intentional subsets of the full cohort.
  • Questionnaire output includes a small v1 proportion by design.
  • Primary care meds include prescribed-but-not-dispensed rows.