| Title: | UK Biobank Data Processing and Survival Analysis Toolkit |
|---|---|
| Description: | Provides an integrated workflow for UK Biobank Research Analysis Platform (RAP) hosted and RAP-generated analysis tables. The package supports RAP phenotype extraction planning, predefined variable sets and disease definitions, standardized baseline preprocessing, multi-source endpoint ascertainment, prevalent and incident case classification, survival-ready cohort construction, regression, multiple imputation, propensity score analysis, mediation analysis, subgroup and sensitivity analyses, machine learning, proteomics enrichment and protein-protein interaction analysis, and publication-oriented visualization. The package workflow is described in He et al. (2026) <doi:10.64898/2026.06.19.26356057>. |
| Authors: | Nan He [aut, cre] (ORCID: <https://orcid.org/0009-0008-6932-3867>) |
| Maintainer: | Nan He <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0 |
| Built: | 2026-06-30 13:23:36 UTC |
| Source: | https://github.com/cran/UKBAnalytica |
Assess balance of covariates between treatment groups before and after matching or weighting.
assess_balance( data, treatment, covariates, method = c("unmatched", "matched", "weighted"), weight_col = NULL, threshold = 0.1 )assess_balance( data, treatment, covariates, method = c("unmatched", "matched", "weighted"), weight_col = NULL, threshold = 0.1 )
data |
A data.frame or data.table. |
treatment |
Character string specifying the treatment variable name. |
covariates |
Character vector of covariate names to assess. |
method |
Character string specifying the data type: "unmatched", "matched", or "weighted". |
weight_col |
Character string specifying the weight column name (for weighted method). |
threshold |
Numeric threshold for SMD to determine balance. Default 0.1. |
A data.frame with balance statistics:
Variable name
Mean in treatment group
Mean in control group
Standardized mean difference
Variance ratio (treated/control)
Whether SMD < threshold
Integrates diagnosis data from multiple sources (ICD-10, ICD-9, self-report, death, OPCS4 procedures, cancer registry records, First Occurrence fields, algorithm) to generate a survival dataset. By default, returns a wide table that retains all participants and adds disease history/incident indicators plus follow-up time for a primary disease.
build_survival_dataset( dt, disease_definitions, prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death"), outcome_sources = c("ICD10", "ICD9", "Death"), censor_date = as.Date("2023-10-31"), baseline_col = "p53_i0", time_skeleton = NULL, primary_disease = NULL, output = c("wide", "long"), include_all = TRUE, show_flow = TRUE, dt_threads = NULL )build_survival_dataset( dt, disease_definitions, prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death"), outcome_sources = c("ICD10", "ICD9", "Death"), censor_date = as.Date("2023-10-31"), baseline_col = "p53_i0", time_skeleton = NULL, primary_disease = NULL, output = c("wide", "long"), include_all = TRUE, show_flow = TRUE, dt_threads = NULL )
dt |
A data.table or data.frame containing complete UKB data. |
disease_definitions |
Named list of disease definitions (see |
prevalent_sources |
Character vector specifying data sources for identifying
prevalent (baseline) cases. Self-report is recommended here since participants
reporting a disease at baseline clearly had it before enrollment. Default includes
all core sources: "ICD10", "ICD9", "Self-report", "Death".
"OPCS4" can be added for surgical phenotypes when |
outcome_sources |
Character vector specifying data sources for defining incident outcomes. Self-report is typically excluded here because self-reported diagnosis dates are imprecise (year only) and less reliable for prospective endpoint ascertainment. Default: "ICD10", "ICD9", "Death". "CancerRegistry" can be added for cancer outcomes; "FirstOccurrence" can be added when the extracted dataset includes UKB First Occurrence fields for the disease definition. "OPCS4" can be included when the event of interest is a surgery or procedure-based phenotype. |
censor_date |
Administrative censoring date (default: "2023-10-31"). |
baseline_col |
Column name for baseline assessment date (default: "p53_i0"). |
time_skeleton |
Optional output from |
primary_disease |
Disease key used to compute follow-up time and event
status (must be in |
output |
Output format: |
include_all |
Logical; when |
show_flow |
Logical; if |
dt_threads |
Optional integer. If provided, temporarily sets
|
This function supports separate source definitions for prevalent case exclusion and outcome ascertainment. This is important because:
Self-reported conditions at baseline clearly indicate pre-existing disease and should be used for prevalent case identification.
However, self-reported incident events during follow-up have imprecise dates (year only) and lower validity, making them unsuitable for outcome definition.
OPCS4 procedure dates are often useful for procedure-defined endpoints or surgical history, but may occur later than the true biological disease onset.
Case classification logic:
Prevalent case: Earliest diagnosis date (from prevalent_sources) <= baseline date.
These participants have outcome_status = NA and outcome_surv_time = NA
because they are not at risk for incident disease.
Incident case: Earliest diagnosis date (from outcome_sources) > baseline date
Censored: No diagnosis by end of follow-up (status = 0)
Follow-up time calculation (controlled by primary_disease):
Prevalent case (primary disease): NA (not at risk)
Incident case: (diagnosis_date - baseline_date) / 365.25
Censored: (min(death_date, censor_date) - baseline_date) / 365.25
A data.table with columns:
Participant identifier
<Disease>_history1 if prevalent case (from prevalent_sources), 0 otherwise
<Disease>_incident1 if incident case (from outcome_sources), 0 otherwise
Event indicator for primary disease (1=event, 0=censored, NA=prevalent case)
Follow-up time in years for primary disease (NA for prevalent cases)
Computes averaged air pollution exposures from multiple time points.
calculate_air_pollution(df, pollutants = c("NO2", "PM10", "PM2.5", "NOx"))calculate_air_pollution(df, pollutants = c("NO2", "PM10", "PM2.5", "NOx"))
df |
A data.table containing air pollution columns |
pollutants |
Character vector of pollutants to calculate. Available: "NO2", "PM10", "PM2.5", "NOx" |
A data.table with averaged pollution columns
Combines automated and manual BP readings using UK Biobank collection logic: automated readings are primary and manual readings are used as fallback when automated measurements are unavailable. Returns the mean of the two available readings.
calculate_blood_pressure( df, type = c("sbp", "dbp"), prefer = c("auto", "manual") )calculate_blood_pressure( df, type = c("sbp", "dbp"), prefer = c("auto", "manual") )
df |
A data.table containing BP columns |
type |
Character: "sbp" or "dbp" |
prefer |
Character: |
A data.table with calculated sbp or dbp column added
Computes a simplified healthy diet score based on food frequency questionnaire.
calculate_diet_score( df, components = c("fruit", "vegetable", "fish", "meat", "cereal", "milk"), na_handling = c("strict", "partial") )calculate_diet_score( df, components = c("fruit", "vegetable", "fish", "meat", "cereal", "milk"), na_handling = c("strict", "partial") )
df |
A data.table containing diet-related columns |
components |
Character vector of diet components to include. Available: "fruit", "vegetable", "fish", "meat", "cereal", "milk" |
na_handling |
Character: "strict" (NA if any component missing) or "partial" (calculate from available components, NA only if insufficient data) |
A data.table with diet_score column (0-7 scale)
Calculate inverse probability of treatment weights (IPTW) for causal inference.
calculate_weights( data, ps_col = "ps", treatment, weight_type = c("ATE", "ATT", "ATC"), stabilized = TRUE, truncate = c(0.01, 0.99) )calculate_weights( data, ps_col = "ps", treatment, weight_type = c("ATE", "ATT", "ATC"), stabilized = TRUE, truncate = c(0.01, 0.99) )
data |
A data.table containing propensity scores. |
ps_col |
Character string specifying the propensity score column name. Default "ps". |
treatment |
Character string specifying the treatment variable name. |
weight_type |
Character string specifying weight type: "ATE", "ATT", or "ATC". |
stabilized |
Logical; whether to use stabilized weights. Default TRUE. |
truncate |
Numeric vector of length 2 specifying quantiles for weight truncation. Default c(0.01, 0.99). |
Weight formulas:
ATE: T/PS + (1-T)/(1-PS)
ATT: T + (1-T) * PS/(1-PS)
ATC: T * (1-PS)/PS + (1-T)
Stabilized weights multiply by the marginal probability of treatment.
A data.table with the original data plus:
IPTW weight
Classify metabolite-like names into broad groups used by the metabolomics ORA workflow. Small molecules can be mapped to MetaboAnalyst-compatible names, whereas lipoprotein subclass measures, lipid aggregate measures, and proteins are retained in the mapping table but are not passed to small-molecule ORA by default.
classify_metabolites(metabolites)classify_metabolites(metabolites)
metabolites |
Character vector of metabolite names. |
A data.frame with metabolite, category, and metaboanalyst_name.
classify_metabolites(c("Alanine", "LDL Cholesterol", "Apolipoprotein B"))classify_metabolites(c("Alanine", "LDL Cholesterol", "Apolipoprotein B"))
Extract effect estimates from mediation analysis results.
## S3 method for class 'mediation_result' coef(object, ...)## S3 method for class 'mediation_result' coef(object, ...)
object |
An object of class "mediation_result". |
... |
Additional arguments (unused). |
A data.frame with effect estimates.
Merges multiple disease definitions into a single composite endpoint definition. Useful for creating MACE (Major Adverse Cardiovascular Events) or similar composite outcomes.
combine_disease_definitions(..., name = "Combined")combine_disease_definitions(..., name = "Combined")
... |
Disease definition objects to combine. |
name |
Name for the composite outcome. |
A combined disease definition object.
Generates a summary table comparing case counts from different data sources. Useful for methods sections and sensitivity analysis planning.
compare_data_sources(dt, disease_definitions, baseline_col = "p53_i0")compare_data_sources(dt, disease_definitions, baseline_col = "p53_i0")
dt |
A data.table containing complete UKB data. |
disease_definitions |
Named list of disease definitions. |
baseline_col |
Column name for baseline date. |
A data.table with case counts by source and combination.
A thin wrapper around TCMDATA::compute_nodeinfo().
compute_protein_ppi_metrics( ppi, weight_attr = "score", normalize = FALSE, seed = 42 )compute_protein_ppi_metrics( ppi, weight_attr = "score", normalize = FALSE, seed = 42 )
ppi |
An |
weight_attr |
Character. Edge attribute used as weight. Default is
|
normalize |
Logical. Whether to normalize betweenness and closeness.
Default is |
seed |
Numeric random seed used by the EPC calculation. Default is
|
An igraph object with additional vertex attributes.
Extract confidence intervals from mediation analysis results.
## S3 method for class 'mediation_result' confint(object, parm = NULL, level = 0.95, ...)## S3 method for class 'mediation_result' confint(object, parm = NULL, level = 0.95, ...)
object |
An object of class "mediation_result". |
parm |
Character vector of effect names. If NULL, returns all effects. |
level |
Confidence level. Default 0.95. |
... |
Additional arguments (unused). |
A matrix with lower and upper confidence limits.
Create a baseline table comparing cases and controls under different conditions.
create_baseline_table( data, case_col, factor_cols = NULL, continuous_cols = NULL, test = FALSE )create_baseline_table( data, case_col, factor_cols = NULL, continuous_cols = NULL, test = FALSE )
data |
a data.table containing the baseline characteristics and case/control status. |
case_col |
the name of the column indicating case/control status (1 for cases, 0 for controls). |
factor_cols |
a vector of column names that are factors (categorical variables) |
continuous_cols |
a vector of column names that are continuous variables. |
test |
whether to perform statistical tests comparing cases and controls for each variable (default: FALSE). |
a list containing table one information.
https://github.com/kaz-yos/tableone
Helper function to create a standardized disease definition object containing ICD-10/ICD-9 patterns, self-report codes, UK Biobank First Occurrence fields, and optionally a UK Biobank algorithmically-defined outcome date field.
create_disease_definition( name = NULL, icd10_pattern = NULL, icd9_pattern = NULL, sr_codes = NULL, death_icd10 = NULL, opcs4_pattern = NULL, first_occurrence_fields = NULL, first_occurrence_source_fields = NULL, cancer_icd10_pattern = NULL, cancer_histology = NULL, cancer_behaviour = NULL, algo_date_field = NULL, algo_source_field = NULL, icd10 = NULL, icd9 = NULL, self_report = NULL )create_disease_definition( name = NULL, icd10_pattern = NULL, icd9_pattern = NULL, sr_codes = NULL, death_icd10 = NULL, opcs4_pattern = NULL, first_occurrence_fields = NULL, first_occurrence_source_fields = NULL, cancer_icd10_pattern = NULL, cancer_histology = NULL, cancer_behaviour = NULL, algo_date_field = NULL, algo_source_field = NULL, icd10 = NULL, icd9 = NULL, self_report = NULL )
name |
Full disease name (e.g., "Aortic Aneurysm"). If NULL, defaults to "Custom disease". |
icd10_pattern |
Regular expression pattern for ICD-10 codes (optional). |
icd9_pattern |
Regular expression pattern for ICD-9 codes (optional). |
sr_codes |
Integer vector of UKB self-report illness codes (optional). |
death_icd10 |
Optional regular expression pattern (or code vector) for
death-cause ICD-10 matching. If NULL, defaults to |
opcs4_pattern |
Optional regular expression pattern (or code vector) for OPCS4 operative procedure matching. If NULL, operative procedures are not used in case ascertainment. |
first_occurrence_fields |
Optional integer vector of UK Biobank First Occurrence date field IDs. These fields are generated for 3-character ICD-10 codes in Category 1712, e.g. 131298 for I21 (acute myocardial infarction) and 130708 for E11 (type 2 diabetes). The source field is normally the next field ID and is inferred automatically. |
first_occurrence_source_fields |
Optional integer vector of First
Occurrence source field IDs. If NULL, uses |
cancer_icd10_pattern |
Optional regular expression pattern for UKB cancer registry ICD-10 codes (Field 40006). |
cancer_histology |
Optional integer vector of tumour histology codes (Field 40011) to retain. |
cancer_behaviour |
Optional integer vector of tumour behaviour codes
(Field 40012) to retain. Use |
algo_date_field |
Integer. UKB field ID for the algorithmically-defined
outcome date (Category 42). For example, 42016 for COPD, 42014 for Asthma.
The corresponding data column can be |
algo_source_field |
Integer. UKB field ID for the algorithmically-defined outcome source (Category 42). For example, 42017 for COPD source and 42015 for Asthma source. Stored as metadata for source provenance. |
icd10 |
Deprecated alias of |
icd9 |
Deprecated alias of |
self_report |
Deprecated alias of |
A list containing the disease definition parameters.
Converts a list of data.frames to a imputationList object for use
with mitools functions.
create_imputation_list(datasets, validate = TRUE)create_imputation_list(datasets, validate = TRUE)
datasets |
A list of data.frames (imputed datasets). |
validate |
Logical; whether to validate that all datasets have the same structure. Default TRUE. |
An imputationList object.
Helper for defining medication code sets from UK Biobank self-reported treatment/medication fields. The first implementation focuses on field 20003 arrays and intentionally stores only medication codes and classes, not copied source codelist descriptions.
create_medication_definition( name, codes, source = "Self-report 20003", field_id = 20003L, medication_class = NULL, match_type = "exact" )create_medication_definition( name, codes, source = "Self-report 20003", field_id = 20003L, medication_class = NULL, match_type = "exact" )
name |
Medication definition name. |
codes |
Character or numeric medication codes. |
source |
Source label. Defaults to |
field_id |
UK Biobank field ID. Defaults to 20003. |
medication_class |
Optional medication class label. |
match_type |
Matching mode. Defaults to |
A list describing the medication definition.
bp <- create_medication_definition("Any BP medication", c(1, 2, 3))bp <- create_medication_definition("Any BP medication", c(1, 2, 3))
Calculate propensity scores using logistic regression or gradient boosting.
estimate_propensity_score( data, treatment, covariates, method = c("logistic", "gbm"), formula = NULL )estimate_propensity_score( data, treatment, covariates, method = c("logistic", "gbm"), formula = NULL )
data |
A data.frame or data.table containing all variables. |
treatment |
Character string specifying the treatment variable name (binary 0/1). |
covariates |
Character vector of covariate names used to estimate propensity scores. |
method |
Character string specifying the estimation method: "logistic" (default) or "gbm". |
formula |
Optional custom formula. If NULL, formula is built from treatment and covariates. |
A data.table with the original data plus:
Propensity score (probability of treatment)
Flexibly extracts disease cases using user-specified data sources. Enables main analysis with strict case definitions (e.g., ICD-10 only) and sensitivity analyses with broader definitions (e.g., all sources).
extract_cases_by_source( dt, disease_definitions, sources = c("ICD10", "ICD9", "Self-report", "Death"), censor_date = as.Date("2023-10-31"), baseline_col = "p53_i0" )extract_cases_by_source( dt, disease_definitions, sources = c("ICD10", "ICD9", "Self-report", "Death"), censor_date = as.Date("2023-10-31"), baseline_col = "p53_i0" )
dt |
A data.table or data.frame containing complete UKB data. |
disease_definitions |
Named list of disease definitions. |
sources |
Character vector specifying data sources to include.
Valid options: "ICD10", "ICD9", "Self-report", "Death", "OPCS4",
"CancerRegistry", "FirstOccurrence", "Algorithm".
"OPCS4" uses hospital inpatient summary operations
( |
censor_date |
Administrative censoring date. |
baseline_col |
Column name for baseline assessment date. |
This function is designed for epidemiological studies requiring:
Main analysis with hospital-confirmed diagnoses only
Sensitivity analyses including self-reported conditions
Procedure-augmented definitions for surgical phenotypes using OPCS4
Cancer registry ascertainment for malignant neoplasm endpoints
First Occurrence fields for UKB's pre-mapped 3-character ICD-10 outcomes
Source-specific case counts for methods reporting
UK Biobank algorithmically-defined outcomes for validated case ascertainment
The "Algorithm" source reads date fields from UK Biobank Category 42
(Algorithmically-defined outcomes). These are pre-computed by the UK Biobank
outcome adjudication group, combining self-report, hospital admissions,
and death records with high positive predictive value.
Records with date 1900-01-01 are excluded (date unknown).
If a source field is available in the definition, it is propagated into
diagnosis_source as "Algorithm_<source_code>".
The "FirstOccurrence" source reads singular UKB fields such as
p131298_i0 or p131298 for I21 first reported. Values with
UKB special date coding 819 (1900-01-01, 1901-01-01,
1902-02-02, 1903-03-03, 1909-09-09, and
2037-07-07) are excluded.
A data.table with case-level survival data from specified sources.
Extracts baseline prevalent Type 1 and Type 2 diabetes using existing source-based disease history logic, and optionally augments Type 2 classification using baseline HbA1c.
extract_diabetes_subtype_baseline( dt, disease_definitions = NULL, sources = c("ICD10", "ICD9", "Self-report"), baseline_col = "p53_i0", hba1c_col = "p30750_i0", hba1c_threshold = 48, include_hba1c = TRUE )extract_diabetes_subtype_baseline( dt, disease_definitions = NULL, sources = c("ICD10", "ICD9", "Self-report"), baseline_col = "p53_i0", hba1c_col = "p30750_i0", hba1c_threshold = 48, include_hba1c = TRUE )
dt |
A data.table or data.frame containing UKB data. |
disease_definitions |
Named list of disease definitions. If NULL,
uses |
sources |
Character vector specifying sources for baseline history. Options: "ICD10", "ICD9", "Self-report", "Death", "CancerRegistry", "FirstOccurrence", "Algorithm". |
baseline_col |
Column name for baseline date. Default: |
hba1c_col |
Column name for baseline HbA1c (mmol/mol).
Default: |
hba1c_threshold |
Numeric threshold for diabetes by HbA1c.
Default: |
include_hba1c |
Logical. If TRUE (default), HbA1c is used to augment T2DM classification. |
This is a baseline classification helper and does not redefine incident event logic. Type 1 has priority when both T1 and T2 evidence are present.
A data.table with columns:
Participant identifier
Baseline prevalent T1DM from selected sources (0/1)
Baseline prevalent T2DM from selected sources (0/1)
Baseline HbA1c diabetes flag (0/1/NA)
T2DM from source history OR HbA1c criterion (0/1)
Any baseline diabetes (T1DM or enhanced T2DM) (0/1)
"Type1", "Type2", or "No_diabetes"
Defines whether each participant has a selected disease using one or more
UK Biobank evidence sources. This is the recommended public helper when the
goal is disease ascertainment rather than construction of a full survival
cohort. For survival-ready endpoints, use build_survival_dataset.
extract_disease_diagnosis( dt, disease, disease_definitions = NULL, sources = c("ICD10", "ICD9", "Self-report", "Death"), censor_date = as.Date("2023-10-31"), baseline_col = "p53_i0", include_all = TRUE )extract_disease_diagnosis( dt, disease, disease_definitions = NULL, sources = c("ICD10", "ICD9", "Self-report", "Death"), censor_date = as.Date("2023-10-31"), baseline_col = "p53_i0", include_all = TRUE )
dt |
A data.table or data.frame containing UKB data. |
disease |
Character vector of disease keys or disease names. |
disease_definitions |
Optional named list of disease definitions. If
|
sources |
Character vector of evidence sources. Valid options are
|
censor_date |
Administrative censoring date. |
baseline_col |
Column name for baseline assessment date. |
include_all |
Logical. If |
A data.table with participant-level diagnosis status, first diagnosis
date, diagnosis source, prevalent and incident indicators, and survival
fields returned by extract_cases_by_source where available.
Extracts prevalent case status (diagnosed before baseline) for specified diseases. Designed for use as covariates in sensitivity analyses or covariate adjustment. Returns a wide-format table with one binary column per disease.
extract_disease_history( dt, diseases, disease_definitions = NULL, sources = "ICD10", baseline_col = "p53_i0" )extract_disease_history( dt, diseases, disease_definitions = NULL, sources = "ICD10", baseline_col = "p53_i0" )
dt |
A data.table or data.frame containing complete UKB data. |
diseases |
Character vector of disease names to extract.
Must match keys in |
disease_definitions |
Named list of disease definitions. If NULL,
uses |
sources |
Character vector specifying data sources. Default: "ICD10". Options: "ICD10", "ICD9", "Self-report", "Death", "OPCS4", "CancerRegistry", "FirstOccurrence", "Algorithm". |
baseline_col |
Column name for baseline assessment date. |
This function is specifically designed for extracting covariate data in epidemiological studies. Common use cases:
Adjusting for baseline comorbidities in Cox regression
Sensitivity analyses with different case definitions
Creating propensity score matching variables
The function only returns history (prevalent) columns, not incident columns, to clearly separate covariate extraction from outcome definition.
A data.table with columns:
Participant identifier
1 if prevalent case, 0 otherwise (one column per disease)
Extracts prevalent case status from multiple data source combinations simultaneously for sensitivity analysis comparison. Returns a table with separate columns for each source definition.
extract_disease_history_sensitivity( dt, diseases, disease_definitions = NULL, baseline_col = "p53_i0" )extract_disease_history_sensitivity( dt, diseases, disease_definitions = NULL, baseline_col = "p53_i0" )
dt |
A data.table or data.frame containing complete UKB data. |
diseases |
Character vector of disease names to extract. |
disease_definitions |
Named list of disease definitions. |
baseline_col |
Column name for baseline date. |
A data.table with columns:
Participant identifier
Prevalent case using ICD-10 only
Prevalent case using ICD-10 + ICD-9
Prevalent case using all sources
Processes medication fields (6177 for male, 6153 for female) to extract specific medication categories.
extract_medications( df, medications = c("cholesterol", "blood_pressure", "insulin") )extract_medications( df, medications = c("cholesterol", "blood_pressure", "insulin") )
df |
A data.table containing medication columns (p6177_i0, p6153_i0) |
medications |
Character vector of medications to extract. Available: "cholesterol", "blood_pressure", "insulin" |
A data.table with binary medication columns added (1=Yes, 0=No, NA=Missing)
Matches UK Biobank treatment/medication code arrays (p20003_i*_a*) against
predefined or user-supplied medication definitions and appends binary
participant-level medication indicators.
extract_self_report_medications( data, medications = NULL, medication_definitions = get_predefined_medications(), id_col = "eid", instance = 0, prefix = "med20003", missing_as_zero = TRUE, return_long = FALSE )extract_self_report_medications( data, medications = NULL, medication_definitions = get_predefined_medications(), id_col = "eid", instance = 0, prefix = "med20003", missing_as_zero = TRUE, return_long = FALSE )
data |
A data.frame or data.table containing field 20003 array columns. |
medications |
Optional medication definition names to extract. If NULL, all predefined definitions are used. |
medication_definitions |
Named list of medication definitions. Defaults
to |
id_col |
Participant identifier column. |
instance |
Optional UKB assessment instance. If NULL, all available instances are searched. |
prefix |
Prefix for output variable names. |
missing_as_zero |
Logical. If TRUE, participants with no valid 20003 entries are coded as 0; otherwise they are coded as NA. |
return_long |
Logical. If TRUE, return one row per participant and medication definition instead of appending wide columns. |
A data.table.
dat <- data.frame( eid = 1:3, p20003_i0_a0 = c("1140883066", "1140874686", NA), p20003_i0_a1 = c(NA, "1140851690", NA) ) extract_self_report_medications(dat, medications = c("Insulin", "Metformin"))dat <- data.frame( eid = 1:3, p20003_i0_a0 = c("1140883066", "1140874686", NA), p20003_i0_a1 = c(NA, "1140851690", NA) ) extract_self_report_medications(dat, medications = c("Insulin", "Metformin"))
Fits the specified regression model on each imputed dataset.
fit_mi_models( datasets, formula, model_type = c("lm", "logistic", "poisson", "cox", "negbin"), family = NULL, ... )fit_mi_models( datasets, formula, model_type = c("lm", "logistic", "poisson", "cox", "negbin"), family = NULL, ... )
datasets |
A list of data.frames or an |
formula |
A formula specifying the model. |
model_type |
Character string specifying the model type. |
family |
A |
... |
Additional arguments passed to the model fitting function. |
A list of fitted model objects.
Returns death dates for all deceased participants, used for censoring in survival analysis.
get_death_dates(dt)get_death_dates(dt)
dt |
A data.table or data.frame containing UKB data. |
A data.table with columns: eid, death_date.
Returns a source-aware disease code catalog containing curated
UKBAnalytica disease definitions and Pomegranate-derived UK Biobank
phenotype coding definitions. This function returns tabular code metadata;
it does not change the default behavior of get_predefined_diseases().
get_disease_catalog( source = c("all", "curated", "pomegranate"), disease = NULL, code_system = NULL, supported_only = FALSE )get_disease_catalog( source = c("all", "curated", "pomegranate"), disease = NULL, code_system = NULL, supported_only = FALSE )
source |
Character. One of |
disease |
Optional disease name or definition ID pattern. |
code_system |
Optional code system filter, such as |
supported_only |
Logical. If TRUE, keep only catalog rows currently supported by UKBAnalytica disease parsers. |
A data.frame.
copd_codes <- get_disease_catalog(disease = "copd") head(copd_codes)copd_codes <- get_disease_catalog(disease = "copd") head(copd_codes)
Convenience wrapper around get_field_metadata() for a single UKB field_id.
This is the simplest way to ask "what does field 4080 correspond to?" and get
a one-row metadata table back in R.
get_field_info( field_id, ukb_data_dict = NULL, dataset = NULL, fields_df = NULL, entity = "participant", live = FALSE, timeout = 30 )get_field_info( field_id, ukb_data_dict = NULL, dataset = NULL, fields_df = NULL, entity = "participant", live = FALSE, timeout = 30 )
field_id |
A single UKB numeric field ID. |
ukb_data_dict |
Optional path to a |
dataset |
Optional RAP |
fields_df |
Optional data.frame returned by |
entity |
RAP dataset entity. Defaults to |
live |
Logical. If |
timeout |
Timeout in seconds used for the live web request. |
A one-row data.frame when the field is found.
Returns a structured data.frame of UK Biobank field metadata. When
ukb_data_dict is supplied, the function reads a UK Biobank data dictionary
metadata file available in the current session and standardizes common
metadata columns. When fields_df or
a RAP dataset is supplied, the function also records the approved RAP field
names available in the current project.
This is intended to be a simple entry point for users who want to inspect UKB field metadata in R before planning an extraction.
get_field_metadata( field_id = NULL, query = NULL, ukb_data_dict = NULL, dataset = NULL, fields_df = NULL, entity = "participant" )get_field_metadata( field_id = NULL, query = NULL, ukb_data_dict = NULL, dataset = NULL, fields_df = NULL, entity = "participant" )
field_id |
Optional UKB numeric field IDs to keep. |
query |
Optional keyword used to filter the metadata table. The keyword is matched against the title, description, category, and RAP field names. |
ukb_data_dict |
Optional path to a |
dataset |
Optional RAP |
fields_df |
Optional data.frame returned by |
entity |
RAP dataset entity. Defaults to |
A data.frame with one row per UKB field and standardized metadata columns. When RAP field metadata is available, the result also includes the matching RAP column names and the number of approved RAP columns per field.
Returns a medication code catalog containing UKBAnalytica curated medication definitions and UK Biobank official coding 4 entries for field 20003.
get_medication_catalog(medication = NULL, medication_class = NULL)get_medication_catalog(medication = NULL, medication_class = NULL)
medication |
Optional medication name, ID, or code pattern. |
medication_class |
Optional medication class filter. |
A data.frame.
metformin <- get_medication_catalog("metformin") head(metformin)metformin <- get_medication_catalog("metformin") head(metformin)
Converts the Pomegranate-derived rows in the disease catalog into
UKBAnalytica disease definition objects. Only code systems currently
supported by the package parsers are used by default; GP and medication
rows remain available through get_disease_catalog().
get_pomegranate_diseases(disease = NULL, supported_only = TRUE)get_pomegranate_diseases(disease = NULL, supported_only = TRUE)
disease |
Optional disease name or definition ID pattern. |
supported_only |
Logical. If TRUE, use only rows supported by current UKBAnalytica disease parsers. |
A named list of disease definition objects.
pom <- get_pomegranate_diseases("asthma") names(pom)pom <- get_pomegranate_diseases("asthma") names(pom)
Returns source provenance for the built-in Pomegranate resources, including the GitHub YAML commit used for the canonical disease catalog and the portal CSV retained for audit.
get_pomegranate_source_manifest()get_pomegranate_source_manifest()
A data.frame.
get_pomegranate_source_manifest()get_pomegranate_source_manifest()
Returns a list of commonly used cardiovascular and metabolic disease definitions with validated ICD-10, ICD-9, and self-report code mappings.
get_predefined_diseases( source = c("curated", "pomegranate", "both"), merge_type = c("intersection", "union"), disease = NULL, supported_only = TRUE )get_predefined_diseases( source = c("curated", "pomegranate", "both"), merge_type = c("intersection", "union"), disease = NULL, supported_only = TRUE )
source |
Definition source. |
merge_type |
Merge strategy for |
disease |
Optional disease key or name pattern used to subset the returned definition list. |
supported_only |
Logical. For Pomegranate-derived definitions, keep only code systems currently supported by UKBAnalytica parsers. |
Included diseases:
Aortic Aneurysm (I71, 441)
Thoracic Aortic Aneurysm
Abdominal Aortic Aneurysm
Cardiovascular Disease
Myocardial Infarction
Heart Failure
Stroke (ischemic and hemorrhagic)
Essential and secondary hypertension
Diabetes Mellitus (all types)
Type 1 Diabetes Mellitus
Type 2 Diabetes Mellitus
Peripheral vascular disease
Broad cardiac arrhythmia endpoint including OPCS4 procedures
Atrial arrhythmia / atrial fibrillation-flutter
Ventricular arrhythmia endpoint
Atrioventricular conduction block
Intraventricular conduction block
Supraventricular tachycardia
Lung cancer using ICD-10/death and cancer registry
Common respiratory, renal, gastrointestinal, neurologic, psychiatric, eye, skin, musculoskeletal, and cancer endpoints used in UKB epidemiology workflows
A named list of disease definition objects.
Returns curated field-20003 medication code sets for common self-reported
treatment groups. These definitions are designed for baseline covariate
derivation and sensitivity analyses, and are separate from disease endpoint
definitions returned by get_predefined_diseases().
get_predefined_medications()get_predefined_medications()
A named list of medication definition objects.
meds <- get_predefined_medications() names(meds)meds <- get_predefined_medications() names(meds)
Convert protein identifiers to gene symbols and retrieve a protein-protein
interaction network from STRING via clusterProfiler::getPPI().
get_protein_ppi( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", organism_db = "org.Hs.eg.db", taxID = 9606, required_score = NULL, network_type = "functional", add_nodes = 0, show_query_node_labels = 0, output = c("igraph", "data.frame") )get_protein_ppi( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", organism_db = "org.Hs.eg.db", taxID = 9606, required_score = NULL, network_type = "functional", add_nodes = 0, show_query_node_labels = 0, output = c("igraph", "data.frame") )
proteins |
A character vector of protein identifiers, or a data.frame containing a protein identifier column. |
protein_col |
Optional column name when |
from_type |
Character string describing the input identifier type for
Bioconductor-based mapping. Default is |
mapping_table |
Optional data.frame containing custom protein-to-symbol mappings. |
mapping_protein_col |
Column name in |
mapping_symbol_col |
Column name in |
organism_db |
Character string naming the OrgDb package. Default is
|
taxID |
NCBI taxon identifier passed to |
required_score |
Optional STRING score cutoff passed to
|
network_type |
STRING network type. One of |
add_nodes |
Number of partner nodes to add in STRING. Default is |
show_query_node_labels |
Passed to |
output |
One of |
A list with components source, gene_symbols, mapping, and
ppi.
Returns the column names generated by ukb_demo(). This is useful for
documentation examples that need RAP-style toy column names.
get_ukb_demo_colnames()get_ukb_demo_colnames()
A character vector of original demo-data column names.
get_ukb_demo_colnames()get_ukb_demo_colnames()
Returns a data.frame describing all predefined variables available for preprocessing.
get_variable_info(category = "all")get_variable_info(category = "all")
category |
Character. Filter by category:
|
A data.frame with variable information
Get one curated UK Biobank variable set
get_variable_set(set, output = c("data.frame", "field_id", "ukb_col"))get_variable_set(set, output = c("data.frame", "field_id", "ukb_col"))
set |
Set name. |
output |
Output format. |
A data.frame or character/integer vector.
get_variable_set("clinical_core") get_variable_set("air_pollution", output = "field_id")get_variable_set("clinical_core") get_variable_set("air_pollution", output = "field_id")
Returns curated UKB field groups for common analysis domains. These sets are
intended for field discovery and RAP extraction, not for automatic
preprocessing. Use preprocess_baseline() only for variables documented by
get_variable_info().
get_variable_sets(set = NULL, category = NULL)get_variable_sets(set = NULL, category = NULL)
set |
Optional set name, such as |
category |
Optional broad category filter. |
A data.frame with one row per curated variable.
vars <- get_variable_sets("air_pollution") unique(vars$field_id)vars <- get_variable_sets("air_pollution") unique(vars$field_id)
Loads a long-form Pomegranate portal extraction from a user-supplied local
CSV or CSV.GZ file for audit and traceability. The canonical Pomegranate
disease catalog used by get_disease_catalog(source = "pomegranate") is
built into the package from the public GitHub YAML algorithms; the portal
audit table is not required for endpoint construction and is not shipped in
the CRAN build.
load_pomegranate_portal_coding(path = NULL)load_pomegranate_portal_coding(path = NULL)
path |
Path to a local Pomegranate portal CSV or CSV.GZ file. |
A data.frame.
Loads the UK Biobank coding 4 table used by field 20003
(treatment/medication code). This table is included as a lightweight
reference so users can inspect the meaning of medication codes used by
get_predefined_medications().
load_ukb_medication_coding(path = NULL)load_ukb_medication_coding(path = NULL)
path |
Optional path to a local coding 4 TSV file. If NULL, the package
copy in |
A data.frame with columns coding and meaning.
coding4 <- load_ukb_medication_coding() head(coding4)coding4 <- load_ukb_medication_coding() head(coding4)
Read the metabolite metadata table bundled in inst/extdata. The current
file contains UK Biobank Nightingale non-ratio metabolite names, field IDs,
and RAP-style column names. It is mainly intended as a helper for examples,
tests, and metabolite-name checking.
load_ukb_metabolite_panel(file = NULL, file_encoding = "UTF-16LE")load_ukb_metabolite_panel(file = NULL, file_encoding = "UTF-16LE")
file |
Optional path to a metabolite panel file. If |
file_encoding |
Character file encoding. Default is |
A data.frame with columns such as Description, UKB_ID, and
meta_ID.
panel <- load_ukb_metabolite_panel() head(panel)panel <- load_ukb_metabolite_panel() head(panel)
Perform propensity score matching using nearest neighbor or optimal matching.
match_propensity( data, ps_col = "ps", treatment, ratio = 1, caliper = 0.2, method = c("nearest", "optimal"), replace = FALSE, exact_match = NULL )match_propensity( data, ps_col = "ps", treatment, ratio = 1, caliper = 0.2, method = c("nearest", "optimal"), replace = FALSE, exact_match = NULL )
data |
A data.table containing propensity scores. |
ps_col |
Character string specifying the propensity score column name. Default "ps". |
treatment |
Character string specifying the treatment variable name. |
ratio |
Numeric matching ratio (1:ratio). Default 1 for 1:1 matching. |
caliper |
Numeric caliper width in standard deviations of PS. Default 0.2. |
method |
Character string specifying matching method: "nearest" or "optimal". |
replace |
Logical; whether to match with replacement. Default FALSE. |
exact_match |
Character vector of variable names for exact matching. Default NULL. |
A data.table with matched data, including:
Matched pair identifier
Distance between matched pairs
Convert common UK Biobank Nightingale metabolite labels to names that are more likely to be recognized by MetaboAnalystR name cross-referencing. Users can provide a custom mapping table to override or extend the built-in map.
metabolite_to_metaboanalyst_name( metabolites, mapping_table = NULL, mapping_metabolite_col = "metabolite", mapping_name_col = "metaboanalyst_name", drop_unmapped = FALSE )metabolite_to_metaboanalyst_name( metabolites, mapping_table = NULL, mapping_metabolite_col = "metabolite", mapping_name_col = "metaboanalyst_name", drop_unmapped = FALSE )
metabolites |
Character vector of metabolite names. |
mapping_table |
Optional data.frame with metabolite-to-name mappings. |
mapping_metabolite_col |
Column in |
mapping_name_col |
Column in |
drop_unmapped |
Logical. If |
A data.frame with metabolite, metaboanalyst_name, and
mapping_source.
metabolite_to_metaboanalyst_name(c("Acetate", "Alanine", "LDL Cholesterol"))metabolite_to_metaboanalyst_name(c("Acetate", "Alanine", "LDL Cholesterol"))
Extracts cancer registry records from UK Biobank fields 40006, 40005, 40011, and 40012 into a standardized long-format table. Field 40006 stores cancer ICD-10 type, 40005 stores diagnosis date, 40011 stores tumour histology, and 40012 stores tumour behaviour.
parse_cancer_registry(dt)parse_cancer_registry(dt)
dt |
A data.table or data.frame containing UKB cancer registry columns. |
A data.table with columns: eid, cancer_icd10_code,
diag_date, cancer_histology, cancer_behaviour, and
source.
Extracts death registry data from UK Biobank linked mortality records. Parses both primary (p40001) and contributing (p40002) causes of death along with death dates (p40000). Caution: Death records only contain ICD-10 codes.
parse_death_records(dt)parse_death_records(dt)
dt |
A data.table or data.frame containing UKB data with columns:
|
Death causes serve as definitive diagnosis confirmation. If a participant died from a specific disease, the death date becomes the diagnosis date for that condition (if not previously diagnosed).
A data.table with columns:
Participant identifier
ICD-10 cause of death code
Date of death
Data source identifier ("Death")
"primary" or "secondary"
Extracts ICD-10 diagnosis codes from UK Biobank hospital inpatient data. Converts the mixed-format storage (Python list string in p41270 + date array in p41280_a*) into a standardized long-format data.table.
parse_icd10_diagnoses(dt)parse_icd10_diagnoses(dt)
dt |
A data.table or data.frame containing UKB data with columns:
|
The function implements the Index-Match logic specified in the UKB data dictionary: the k-th element in p41270 corresponds to date column p41280_a(k-1) (0-indexed).
Processing pipeline:
Parse Python list string format in p41270
Melt p41280_a* date columns to long format
Join codes and dates by eid and positional index
A data.table with columns:
Participant identifier
ICD-10 diagnosis code
Date of diagnosis
Data source identifier ("ICD10")
Extracts ICD-9 diagnosis codes from UK Biobank hospital inpatient data. Converts the mixed-format storage (Python list string in p41271 + date array in p41281_a*) into a standardized long-format data.table.
parse_icd9_diagnoses(dt)parse_icd9_diagnoses(dt)
dt |
A data.table or data.frame containing UKB data with columns:
|
ICD-9 codes in UKB follow the format: 3-5 digits, optionally prefixed with V or E. The function handles logical NA columns that may occur when all values are missing.
A data.table with columns:
Participant identifier
ICD-9 diagnosis code
Date of diagnosis
Data source identifier ("ICD9")
Extracts OPCS4 operative procedure codes from UK Biobank hospital inpatient
summary operations data. Supports the common export shape where
p41272 stores a list-string of codes and p41282_a* stores the
corresponding dates, while also tolerating expanded p41272_a* columns.
parse_opcs4_procedures(dt)parse_opcs4_procedures(dt)
dt |
A data.table or data.frame containing UKB data with columns:
|
The function implements the same index-matching logic used for UKB summary
diagnosis fields: the k-th procedure code in p41272 corresponds to the
date stored in p41282_a(k-1) (0-indexed).
A data.table with columns:
Participant identifier
OPCS4 procedure code
Date of first recorded procedure for that code/index
Data source identifier ("OPCS4")
Extracts self-reported illness data from UK Biobank touchscreen questionnaire. Converts coded illness data (p20002_i*_a*) and interpolated year of diagnosis (p20008_i*_a*) into a standardized long-format data.table.
parse_self_reported_illnesses(dt, baseline_col = "p53_i0")parse_self_reported_illnesses(dt, baseline_col = "p53_i0")
dt |
A data.table or data.frame containing UKB data with columns:
|
baseline_col |
Column name for baseline date (default: "p53_i0"). |
Year-to-date conversion logic:
p20008 stores fractional years (e.g., 1983.5 = mid-1983)
Fractional part * 12 = approximate month
Special values (-1, -3) indicate "don't know" or "prefer not to answer"
A data.table with columns:
Participant identifier
Self-report illness code
Approximate date of diagnosis
Data source identifier ("Self-report")
Assessment instance (0, 1, 2, 3)
Array index within instance
Create a Love plot comparing standardized mean differences before and after matching/weighting.
plot_balance( balance_before, balance_after, threshold = 0.1, title = "Covariate Balance", xlab = "Standardized Mean Difference" )plot_balance( balance_before, balance_after, threshold = 0.1, title = "Covariate Balance", xlab = "Standardized Mean Difference" )
balance_before |
A data.frame from |
balance_after |
A data.frame from |
threshold |
Numeric threshold for balance (vertical lines). Default 0.1. |
title |
Character string for plot title. Default "Covariate Balance". |
xlab |
Character string for x-axis label. Default "Standardized Mean Difference". |
A ggplot2 object.
Create a calibration plot comparing predicted probabilities to observed outcomes.
plot_calibration( data, predicted, observed, n_bins = 10, smooth = TRUE, conf_int = TRUE )plot_calibration( data, predicted, observed, n_bins = 10, smooth = TRUE, conf_int = TRUE )
data |
A data.frame or data.table. |
predicted |
Character string specifying the column with predicted probabilities. |
observed |
Character string specifying the column with observed binary outcomes. |
n_bins |
Integer number of bins for calibration. Default 10. |
smooth |
Logical; whether to add a smooth calibration line. Default TRUE. |
conf_int |
Logical; whether to show confidence intervals. Default TRUE. |
A ggplot2 object.
Create an annotated heatmap of a correlation matrix with customizable appearance. This helps identify patterns, multicollinearity, and variable relationships visually.
plot_correlation( corr_matrix, title = "Correlation Matrix", show_values = TRUE, digits = 2, text_size = 3, color_low = "#3B4CC0", color_mid = "white", color_high = "#B40426", upper_triangle = FALSE )plot_correlation( corr_matrix, title = "Correlation Matrix", show_values = TRUE, digits = 2, text_size = 3, color_low = "#3B4CC0", color_mid = "white", color_high = "#B40426", upper_triangle = FALSE )
corr_matrix |
A numeric correlation matrix (from |
title |
Character string. Plot title. Default: |
show_values |
Logical. If |
digits |
Integer. Number of decimal places for correlation values. Default: 2. |
text_size |
Numeric. Size of text labels on tiles. Default: 3. |
color_low |
Character. Color for negative correlations. Default: |
color_mid |
Character. Color for zero correlation. Default: |
color_high |
Character. Color for positive correlations. Default: |
upper_triangle |
Logical. If |
A ggplot2 object. Can be further customized with ggplot2 functions.
Plot training-validation Cox log(HR) concordance
plot_cox_loghr_correlation( comparison, train_loghr_col = "train_logHR", validation_loghr_col = "validation_logHR", highlight_col = "train_significant_bonferroni", highlight_label = "Train Bonferroni significant" )plot_cox_loghr_correlation( comparison, train_loghr_col = "train_logHR", validation_loghr_col = "validation_logHR", highlight_col = "train_significant_bonferroni", highlight_label = "Train Bonferroni significant" )
comparison |
Comparison table from |
train_loghr_col |
Training log(HR) column. |
validation_loghr_col |
Validation log(HR) column. |
highlight_col |
Optional logical column used to highlight proteins. |
highlight_label |
Highlight legend label. |
A ggplot object.
Plot sensitivity-analysis Cox log(HR) concordance
plot_cox_sensitivity_correlation( comparison, sensitivity_col = "sensitivity", main_loghr_col = "main_logHR", sensitivity_loghr_col = "sensitivity_logHR", highlight_col = "main_significant_bonferroni", highlight_label = "Main Bonferroni significant", nrow = NULL, ncol = NULL )plot_cox_sensitivity_correlation( comparison, sensitivity_col = "sensitivity", main_loghr_col = "main_logHR", sensitivity_loghr_col = "sensitivity_logHR", highlight_col = "main_significant_bonferroni", highlight_label = "Main Bonferroni significant", nrow = NULL, ncol = NULL )
comparison |
Comparison table from |
sensitivity_col |
Sensitivity-analysis label column. |
main_loghr_col |
Main-analysis log(HR) column. |
sensitivity_loghr_col |
Sensitivity-analysis log(HR) column. |
highlight_col |
Optional logical column used to highlight variables. |
highlight_label |
Highlight legend label. |
nrow, ncol
|
Facet layout. |
A ggplot object.
A thin wrapper around TCMDATA::gglollipop() for enrichment results. This
function accepts either a raw enrichResult object or a list returned by one
of the proteomics ORA helpers in this package.
plot_enrichment_lollipop(x, ...)plot_enrichment_lollipop(x, ...)
x |
An |
... |
Additional arguments passed to |
A ggplot2 object.
Create a forest plot to visualize subgroup analysis results with effect estimates and confidence intervals.
plot_forest( results, estimate_col = "estimate", lower_col = "lower95", upper_col = "upper95", label_col = "subgroup", pvalue_col = "pvalue", p_interaction_col = "p_interaction", null_value = 1, log_scale = TRUE, colors = NULL, title = "Subgroup Analysis", xlab = "Hazard Ratio (95% CI)", show_n = TRUE, show_events = TRUE )plot_forest( results, estimate_col = "estimate", lower_col = "lower95", upper_col = "upper95", label_col = "subgroup", pvalue_col = "pvalue", p_interaction_col = "p_interaction", null_value = 1, log_scale = TRUE, colors = NULL, title = "Subgroup Analysis", xlab = "Hazard Ratio (95% CI)", show_n = TRUE, show_events = TRUE )
results |
A data.frame from |
estimate_col |
Character string specifying the column name for effect estimates. Default "estimate". |
lower_col |
Character string specifying the column for lower CI. Default "lower95". |
upper_col |
Character string specifying the column for upper CI. Default "upper95". |
label_col |
Character string specifying the column for subgroup labels. Default "subgroup". |
pvalue_col |
Character string specifying the column for p-values. Default "pvalue". |
p_interaction_col |
Character string for interaction p-value column. Default "p_interaction". |
null_value |
Numeric value for the null effect line. Default 1 (for HR/OR). |
log_scale |
Logical; whether to use log scale for x-axis. Default TRUE. |
colors |
Character vector of colors. Default NULL uses ggplot2 defaults. |
title |
Character string for plot title. Default "Subgroup Analysis". |
xlab |
Character string for x-axis label. Default "Hazard Ratio (95\% CI)". |
show_n |
Logical; whether to show sample size. Default TRUE. |
show_events |
Logical; whether to show event count. Default TRUE. |
A ggplot2 object.
A thin wrapper around TCMDATA::go_barplot() for GO enrichment results.
This function accepts either a raw enrichResult object or a list returned
by run_protein_ora().
plot_go_ora_bar(x, ...)plot_go_ora_bar(x, ...)
x |
An |
... |
Additional arguments passed to |
A ggplot2 object.
Draw a compact heatmap from long-format data. The function uses string
column names and .data pronouns internally, which makes it suitable for
scripted package workflows and CRAN checks.
plot_heatmap( data, x, y, fill, label = NULL, show_values = FALSE, low = "#2F6FA3", mid = "#F7F7F7", high = "#C74732", midpoint = 0, title = NULL, xlab = NULL, ylab = NULL, fill_lab = NULL, base_size = 7 )plot_heatmap( data, x, y, fill, label = NULL, show_values = FALSE, low = "#2F6FA3", mid = "#F7F7F7", high = "#C74732", midpoint = 0, title = NULL, xlab = NULL, ylab = NULL, fill_lab = NULL, base_size = 7 )
data |
A data.frame. |
x |
Character column name for the x axis. |
y |
Character column name for the y axis. |
fill |
Character column name for the heatmap value. |
label |
Optional character column name for tile labels. |
show_values |
Logical. If TRUE, show values or |
low |
Low-end color for the diverging scale. |
mid |
Midpoint color for the diverging scale. |
high |
High-end color for the diverging scale. |
midpoint |
Midpoint for the diverging scale. |
title |
Optional title. If NULL, no title is shown. |
xlab |
Optional x-axis label. |
ylab |
Optional y-axis label. |
fill_lab |
Optional fill legend label. |
base_size |
Base font size. |
A ggplot object.
Create a Kaplan-Meier survival curve with optional risk table and log-rank p-value.
plot_km_curve( data, time_col, status_col, group_col = NULL, conf_int = TRUE, risk_table = TRUE, censor_marks = TRUE, palette = "jco", title = NULL, xlab = "Time (years)", ylab = "Survival Probability", legend_title = "Group", median_line = TRUE, pvalue = TRUE, xlim = NULL, break_time = NULL )plot_km_curve( data, time_col, status_col, group_col = NULL, conf_int = TRUE, risk_table = TRUE, censor_marks = TRUE, palette = "jco", title = NULL, xlab = "Time (years)", ylab = "Survival Probability", legend_title = "Group", median_line = TRUE, pvalue = TRUE, xlim = NULL, break_time = NULL )
data |
A data.frame or data.table containing survival data. |
time_col |
Character string specifying the time column name. |
status_col |
Character string specifying the event status column name. |
group_col |
Character string specifying the grouping variable. Default NULL for overall curve. |
conf_int |
Logical; whether to show confidence intervals. Default TRUE. |
risk_table |
Logical; whether to show number at risk table. Default TRUE. |
censor_marks |
Logical; whether to show censoring marks. Default TRUE. |
palette |
Character string specifying color palette. Default "jco". Options: "jco", "nejm", "lancet", "npg", or custom color vector. |
title |
Character string for plot title. Default NULL. |
xlab |
Character string for x-axis label. Default "Time (years)". |
ylab |
Character string for y-axis label. Default "Survival Probability". |
legend_title |
Character string for legend title. Default "Group". |
median_line |
Logical; whether to show median survival line. Default TRUE. |
pvalue |
Logical; whether to show log-rank p-value. Default TRUE. |
xlim |
Numeric vector of length 2 for x-axis limits. Default NULL. |
break_time |
Numeric value for x-axis tick interval. Default NULL. |
A ggplot2 object (or a list with plot and risk table if risk_table = TRUE).
Create visualizations for mediation analysis results, including path diagrams, effect bar charts, and decomposition plots.
plot_mediation( mediation_result, type = c("effects", "path", "decomposition"), show_ci = TRUE, show_pvalue = TRUE, exponentiate = FALSE, title = NULL, colors = NULL )plot_mediation( mediation_result, type = c("effects", "path", "decomposition"), show_ci = TRUE, show_pvalue = TRUE, exponentiate = FALSE, title = NULL, colors = NULL )
mediation_result |
An object of class "mediation_result" from |
type |
Character string specifying plot type:
|
show_ci |
Logical; whether to show confidence intervals. Default TRUE. |
show_pvalue |
Logical; whether to show p-values. Default TRUE. |
exponentiate |
Logical; whether to exponentiate estimates (for HR/OR). Default FALSE. |
title |
Character string for plot title. Default NULL (auto-generated). |
colors |
Character vector of colors. Default NULL uses package defaults. |
A ggplot2 object.
Create a forest plot to visualize results from multiple mediator analysis.
plot_mediation_forest( multi_mediation_result, effect_type = c("tnie", "pnde", "te", "pm"), exponentiate = FALSE, null_value = 0, title = "Mediation Analysis: Multiple Mediators" )plot_mediation_forest( multi_mediation_result, effect_type = c("tnie", "pnde", "te", "pm"), exponentiate = FALSE, null_value = 0, title = "Mediation Analysis: Multiple Mediators" )
multi_mediation_result |
A data.frame from |
effect_type |
Character string specifying which effect to display: "tnie" (indirect effect), "pnde" (direct effect), "te" (total effect), or "pm" (proportion mediated). Default "tnie". |
exponentiate |
Logical; whether to exponentiate estimates. Default FALSE. |
null_value |
Numeric; null effect value for reference line. Default 0. |
title |
Character string for plot title. |
A ggplot2 object.
Plot metabolite ORA results as a bar plot
plot_metabolite_ora_barplot( x, top_n = 15, p_col = "pvalue", pathway_col = "pathway", fill_color = "#2F6FA3" )plot_metabolite_ora_barplot( x, top_n = 15, p_col = "pvalue", pathway_col = "pathway", fill_color = "#2F6FA3" )
x |
A data.frame returned by |
top_n |
Number of pathways to show. Default |
p_col |
P-value column used for ordering and color. Default |
pathway_col |
Column containing pathway names. Default |
fill_color |
Bar color. |
A ggplot object.
Plot metabolite ORA results as a dot plot
plot_metabolite_ora_dotplot( x, top_n = 15, p_col = "pvalue", size_col = "hits", pathway_col = "pathway", color_low = "#2F6FA3", color_high = "#C74732" )plot_metabolite_ora_dotplot( x, top_n = 15, p_col = "pvalue", size_col = "hits", pathway_col = "pathway", color_low = "#2F6FA3", color_high = "#C74732" )
x |
A data.frame returned by |
top_n |
Number of pathways to show. Default |
p_col |
P-value column used for ordering and color. Default |
size_col |
Column used for point size. Default |
pathway_col |
Column containing pathway names. Default |
color_low, color_high
|
Colors for the sequential p-value gradient. |
A ggplot object.
Creates diagnostic plots for multiple imputation results, including fraction of missing information (FMI), variance ratios, and degrees of freedom.
plot_mi_diagnostics( mi_result, type = c("fmi", "variance_ratio", "df"), title = NULL )plot_mi_diagnostics( mi_result, type = c("fmi", "variance_ratio", "df"), title = NULL )
mi_result |
An object of class |
type |
Character string specifying the diagnostic plot type:
|
title |
Character string for plot title. If NULL, auto-generated. |
A ggplot2 object.
Creates a forest plot for pooled estimates from multiple imputation analysis.
plot_mi_pooled( mi_result, terms = NULL, exponentiate = NULL, null_value = NULL, title = "Pooled Estimates (Multiple Imputation)", colors = NULL, show_fmi = TRUE )plot_mi_pooled( mi_result, terms = NULL, exponentiate = NULL, null_value = NULL, title = "Pooled Estimates (Multiple Imputation)", colors = NULL, show_fmi = TRUE )
mi_result |
An object of class |
terms |
Character vector of terms to include. If NULL, all terms except intercept are shown. |
exponentiate |
Logical; whether to exponentiate estimates. If NULL, uses the setting from the mi_result object. |
null_value |
Numeric; reference line value. If NULL, automatically set based on exponentiation (0 for linear scale, 1 for exp scale). |
title |
Character string for plot title. |
colors |
Named character vector for colors. Default uses package palette. |
show_fmi |
Logical; whether to display FMI (Fraction of Missing Information) as point size or annotation. Default TRUE. |
A ggplot2 object.
Create calibration curve plot showing predicted vs observed probabilities.
plot_ml_calibration(object, title = "Calibration Curve", ...)plot_ml_calibration(object, title = "Calibration Curve", ...)
object |
A ukb_ml_calibration object from ukb_ml_calibration() |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create comparison plot for multiple ML models.
plot_ml_compare( object, metric = NULL, type = c("bar", "dot"), title = "Model Comparison", ... )plot_ml_compare( object, metric = NULL, type = c("bar", "dot"), title = "Model Comparison", ... )
object |
A ukb_ml_compare object from ukb_ml_compare() |
metric |
Metric to highlight (default first available) |
type |
Plot type: "bar", "dot", or "radar" |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create heatmap visualization of confusion matrix.
plot_ml_confusion( object, normalize = TRUE, colors = c("white", "#E34A33"), title = "Confusion Matrix", ... )plot_ml_confusion( object, normalize = TRUE, colors = c("white", "#E34A33"), title = "Confusion Matrix", ... )
object |
A ukb_ml_confusion object from ukb_ml_confusion() |
normalize |
Whether to show percentages (default TRUE) |
colors |
Color gradient (default c("white", "#E34A33")) |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a Decision Curve Analysis plot showing net benefit of the model compared to treat-all and treat-none strategies.
plot_ml_dca(object, title = "Decision Curve Analysis", ...)plot_ml_dca(object, title = "Decision Curve Analysis", ...)
object |
A ukb_ml_dca object from ukb_ml_dca() |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a Gain curve plot comparing model targeting against random selection.
plot_ml_gain(object, title = "Gain Curve", ...)plot_ml_gain(object, title = "Gain Curve", ...)
object |
A ukb_ml_gain_lift object from ukb_ml_gain_lift() |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a bar plot of variable importance from a trained ML model.
plot_ml_importance( object, n_features = 20, type = c("bar", "dot"), color = "#3182BD", title = "Variable Importance", ... )plot_ml_importance( object, n_features = 20, type = c("bar", "dot"), color = "#3182BD", title = "Variable Importance", ... )
object |
A ukb_ml object from ukb_ml_model() |
n_features |
Number of top features to display (default 20) |
type |
Plot type: "bar" or "dot" |
color |
Bar color (default "#3182BD") |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a KS (Kolmogorov-Smirnov) curve plot showing TPR, FPR, and their difference (KS statistic) across thresholds.
plot_ml_ks(object, title = "KS Curve", ...)plot_ml_ks(object, title = "KS Curve", ...)
object |
A ukb_ml_ks object from ukb_ml_ks() |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a Lift curve plot showing the ratio of model vs random targeting.
plot_ml_lift(object, title = "Lift Curve", ...)plot_ml_lift(object, title = "Lift Curve", ...)
object |
A ukb_ml_gain_lift object from ukb_ml_gain_lift() |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a Precision-Recall curve plot with AUPRC annotation.
plot_ml_pr(object, title = "PR Curve", ...)plot_ml_pr(object, title = "PR Curve", ...)
object |
A ukb_ml_pr object from ukb_ml_pr() |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create ROC curve plot for one or more ML models.
plot_ml_roc(object, ci_alpha = 0.2, title = "ROC Curve", ...)plot_ml_roc(object, ci_alpha = 0.2, title = "ROC Curve", ...)
object |
A ukb_ml_roc object from ukb_ml_roc() |
ci_alpha |
Alpha for confidence interval ribbon (default 0.2) |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Creates a publication-ready ROC curve plot from one or more data frames
returned by ukb_ml_roc_data. AUC and 95% confidence interval
values are included in the legend when available.
plot_ml_roc_compare( roc_data, colors = NULL, show_auc = TRUE, title = NULL, xlab = "1 - Specificity", ylab = "Sensitivity", legend_position = "bottom", base_size = 7, ... )plot_ml_roc_compare( roc_data, colors = NULL, show_auc = TRUE, title = NULL, xlab = "1 - Specificity", ylab = "Sensitivity", legend_position = "bottom", base_size = 7, ... )
roc_data |
A data.frame returned by |
colors |
Optional named or unnamed vector of line colors. |
show_auc |
Logical. Include AUC and 95% CI in the legend labels. |
title |
Optional plot title. If |
xlab |
X-axis label. |
ylab |
Y-axis label. |
legend_position |
Legend position passed to |
base_size |
Base font size. |
... |
Additional arguments reserved for future use. |
A ggplot2 object.
Plot a participant flow table
plot_participant_flow( flow, show_removed = TRUE, show_events = TRUE, fill = "#2C7FB8" )plot_participant_flow( flow, show_removed = TRUE, show_events = TRUE, fill = "#2C7FB8" )
flow |
A |
show_removed |
Logical. If |
show_events |
Logical. If |
fill |
Fill color for the retained-participant bars. |
A ggplot object.
dat <- data.frame(eid = 1:5, age = c(50, 60, NA, 55, 70), status = c(0, 1, 0, 1, 0)) flow <- ukb_participant_flow(dat, list("Complete age" = "age"), outcome_col = "status") plot_participant_flow(flow)dat <- data.frame(eid = 1:5, age = c(50, 60, NA, 55, 70), status = c(0, 1, 0, 1, 0)) flow <- ukb_participant_flow(dat, list("Complete age" = "age"), outcome_col = "status") plot_participant_flow(flow)
Visualize the distribution of propensity scores by treatment group.
plot_ps_distribution( data, ps_col = "ps", treatment, type = c("histogram", "density", "mirror"), matched = FALSE, match_col = NULL )plot_ps_distribution( data, ps_col = "ps", treatment, type = c("histogram", "density", "mirror"), matched = FALSE, match_col = NULL )
data |
A data.frame or data.table containing propensity scores. |
ps_col |
Character string specifying the PS column name. Default "ps". |
treatment |
Character string specifying the treatment variable name. |
type |
Character string specifying plot type: "histogram", "density", or "mirror". |
matched |
Logical; whether to show matched vs unmatched. Default FALSE. |
match_col |
Character string for the matching indicator column. Default NULL. |
A ggplot2 object.
Produces a publication-ready ggplot2 figure from a ukb_rcs object
returned by run_rcs. The main panel shows the estimated
effect curve with a 95\
(histogram, density, or rug) is drawn behind the curve to show exposure
density. P values and the knot count are annotated by default.
plot_rcs(x, ...) ## S3 method for class 'ukb_rcs' plot(x, ...) ## S3 method for class 'ukb_rcs' plot_rcs( x, show_distribution = TRUE, distribution = c("histogram", "density", "rug"), show_ref = TRUE, show_p = TRUE, show_knots = TRUE, curve_color = "#2166AC", dist_color = "#AECDE8", title = NULL, xlab = NULL, ylab = NULL, ... )plot_rcs(x, ...) ## S3 method for class 'ukb_rcs' plot(x, ...) ## S3 method for class 'ukb_rcs' plot_rcs( x, show_distribution = TRUE, distribution = c("histogram", "density", "rug"), show_ref = TRUE, show_p = TRUE, show_knots = TRUE, curve_color = "#2166AC", dist_color = "#AECDE8", title = NULL, xlab = NULL, ylab = NULL, ... )
x |
A |
... |
Additional arguments (currently unused). |
show_distribution |
Logical. Whether to overlay an exposure distribution
layer. Default |
distribution |
One of |
show_ref |
Logical. Whether to mark the reference value with a point.
Default |
show_p |
Logical. Whether to annotate P-overall and P-nonlinear.
Default |
show_knots |
Logical. Whether to annotate the knot count. Default |
curve_color |
Character. Hex color for the main curve and ribbon.
Default |
dist_color |
Character. Fill color for the distribution layer.
Default |
title |
Character. Plot title. Default |
xlab |
Character. x-axis label. Default: the exposure variable name. |
ylab |
Character. y-axis label. Default is chosen from model type. |
A ggplot2 object.
Create a volcano-style plot from regression summary results such as
runmulti_cox() or runmulti_logit(). The x-axis shows the supplied effect
estimate column (for example HR or OR), and the y-axis shows
-log10(P). Points can be highlighted by an adjusted p-value column, while
labels are selected from the largest and smallest highlighted effects.
plot_regression_volcano( data, effect_col = NULL, p_col = "pvalue", adjusted_p_col = NULL, label_col = NULL, significance_cutoff = 0.05, top_n_label_each = 5, null_effect = 1, x_lab = NULL, y_lab = NULL, x_limits = NULL, y_limits = NULL, point_size = 1.05, label_size = 2, colors = c(neutral = "#D8D8D8", lower = "#2F6FA3", higher = "#C74732"), show_cutoff = TRUE )plot_regression_volcano( data, effect_col = NULL, p_col = "pvalue", adjusted_p_col = NULL, label_col = NULL, significance_cutoff = 0.05, top_n_label_each = 5, null_effect = 1, x_lab = NULL, y_lab = NULL, x_limits = NULL, y_limits = NULL, point_size = 1.05, label_size = 2, colors = c(neutral = "#D8D8D8", lower = "#2F6FA3", higher = "#C74732"), show_cutoff = TRUE )
data |
A data.frame containing regression results. |
effect_col |
Character. Column containing the effect estimate to plot on
the x-axis. If |
p_col |
Character. Column containing raw p-values. Default |
adjusted_p_col |
Optional character. Column used for highlighting
significant points, such as |
label_col |
Optional character. Column used for point labels. If |
significance_cutoff |
Numeric cutoff applied to |
top_n_label_each |
Integer. Number of highlighted proteins to label from
each direction. Direction is defined relative to |
null_effect |
Numeric null effect. Use |
x_lab, y_lab
|
Axis labels. If |
x_limits, y_limits
|
Optional numeric vectors of length 2 for axis limits. |
point_size |
Numeric point size. Default |
label_size |
Numeric label size. Default |
colors |
Named character vector for groups |
show_cutoff |
Logical. Whether to draw a horizontal significance cutoff
line. Default |
A ggplot2 object with attributes plot_data and label_data.
Draw a scatter plot with optional color grouping, linear smooth, and reference line. This is intended for compact association or validation panels.
plot_scatter( data, x, y, color = NULL, palette = NULL, add_smooth = TRUE, add_identity = FALSE, alpha = 0.72, point_size = 1.2, title = NULL, xlab = NULL, ylab = NULL, base_size = 7 )plot_scatter( data, x, y, color = NULL, palette = NULL, add_smooth = TRUE, add_identity = FALSE, alpha = 0.72, point_size = 1.2, title = NULL, xlab = NULL, ylab = NULL, base_size = 7 )
data |
A data.frame. |
x |
Character numeric column name for the x axis. |
y |
Character numeric column name for the y axis. |
color |
Optional grouping column for point colors. |
palette |
Optional vector of colors. |
add_smooth |
Logical. Add a linear smooth line. |
add_identity |
Logical. Add a dashed y = x reference line. |
alpha |
Point alpha. |
point_size |
Point size. |
title |
Optional title. If NULL, no title is shown. |
xlab |
Optional x-axis label. |
ylab |
Optional y-axis label. |
base_size |
Base font size. |
A ggplot object.
Creates a SHAP beeswarm plot directly from a ukb_shap object. The plot
displays the top features ranked by mean absolute SHAP value, with point color
representing the normalized feature value.
plot_shap_beeswarm( object, max_features = 20, label_map = NULL, feature_col = "feature", label_col = "label", colors = c("#1E88E5", "#7B3294", "#FF0051"), point_size = 0.58, alpha = 0.62, jitter_height = 0.18, seed = 20260509, title = NULL, xlab = "SHAP value", legend_title = "Feature value", base_size = 7, return_data = FALSE, ... )plot_shap_beeswarm( object, max_features = 20, label_map = NULL, feature_col = "feature", label_col = "label", colors = c("#1E88E5", "#7B3294", "#FF0051"), point_size = 0.58, alpha = 0.62, jitter_height = 0.18, seed = 20260509, title = NULL, xlab = "SHAP value", legend_title = "Feature value", base_size = 7, return_data = FALSE, ... )
object |
A |
max_features |
Maximum number of features to display. |
label_map |
Optional named vector or data.frame mapping feature names to
display labels. For a data.frame, columns are controlled by
|
feature_col |
Feature column in |
label_col |
Label column in |
colors |
Three or more colors for the low-to-high feature value scale. |
point_size |
Point size. |
alpha |
Point transparency. |
jitter_height |
Vertical jitter height. |
seed |
Optional seed for reproducible jitter. |
title |
Optional plot title. If |
xlab |
X-axis label. |
legend_title |
Legend title. |
base_size |
Base font size. |
return_data |
Logical. If |
... |
Additional arguments reserved for future use. |
A ggplot2 object, or a list with plot data when
return_data = TRUE.
Create SHAP dependence plot showing the relationship between a feature's value and its SHAP value.
plot_shap_dependence( object, feature, color_feature = NULL, alpha = 0.5, smooth = TRUE, title = NULL, ... )plot_shap_dependence( object, feature, color_feature = NULL, alpha = 0.5, smooth = TRUE, title = NULL, ... )
object |
A ukb_shap object |
feature |
Feature name to analyze |
color_feature |
Optional feature for coloring points (interaction) |
alpha |
Point transparency (default 0.5) |
smooth |
Add smooth line (default TRUE) |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create a waterfall plot showing feature contributions for a single prediction.
plot_shap_force(object, row_id = 1, max_features = 10, title = NULL, ...)plot_shap_force(object, row_id = 1, max_features = 10, title = NULL, ...)
object |
A ukb_shap object |
row_id |
Row index to explain (default 1) |
max_features |
Maximum features to show (default 10) |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Create SHAP summary plot (beeswarm or bar) for feature importance.
plot_shap_summary( object, max_features = 20, type = c("beeswarm", "bar"), color_palette = "viridis", title = "SHAP Summary", ... )plot_shap_summary( object, max_features = 20, type = c("beeswarm", "bar"), color_palette = "viridis", title = "SHAP Summary", ... )
object |
A ukb_shap object from ukb_shap() |
max_features |
Maximum features to display (default 20) |
type |
Plot type: "beeswarm" or "bar" |
color_palette |
Color palette for beeswarm (default "viridis") |
title |
Plot title |
... |
Additional arguments |
A ggplot2 object
Summarize observations by x and fill, then draw either proportional or
count-based stacked bars.
plot_stacked_bar( data, x, fill, weight = NULL, position = c("fill", "stack"), palette = NULL, title = NULL, xlab = NULL, ylab = NULL, legend_title = NULL, base_size = 7 )plot_stacked_bar( data, x, fill, weight = NULL, position = c("fill", "stack"), palette = NULL, title = NULL, xlab = NULL, ylab = NULL, legend_title = NULL, base_size = 7 )
data |
A data.frame. |
x |
Character column name for bar groups. |
fill |
Character column name for stack groups. |
weight |
Optional numeric column name for weighted summaries. |
position |
Either |
palette |
Optional vector of fill colors. |
title |
Optional title. If NULL, no title is shown. |
xlab |
Optional x-axis label. |
ylab |
Optional y-axis label. |
legend_title |
Optional legend title. |
base_size |
Base font size. |
A ggplot object.
Plot top positive and inverse Cox associations
plot_top_hr_bars( top_results, facet_col = "dataset", hr_col = "HR", lower_col = "lower95", upper_col = "upper95", label_col = "label" )plot_top_hr_bars( top_results, facet_col = "dataset", hr_col = "HR", lower_col = "lower95", upper_col = "upper95", label_col = "label" )
top_results |
A data.frame from |
facet_col |
Optional column used for faceting, commonly |
hr_col |
HR column. |
lower_col |
Lower confidence-limit column. |
upper_col |
Upper confidence-limit column. |
label_col |
Label column. |
A ggplot object.
Draw grouped distributions using violin layers with optional boxplot overlay.
plot_violin( data, x, y, fill = NULL, palette = NULL, add_boxplot = TRUE, add_points = FALSE, title = NULL, xlab = NULL, ylab = NULL, base_size = 7 )plot_violin( data, x, y, fill = NULL, palette = NULL, add_boxplot = TRUE, add_points = FALSE, title = NULL, xlab = NULL, ylab = NULL, base_size = 7 )
data |
A data.frame. |
x |
Character column name for groups. |
y |
Character numeric column name. |
fill |
Optional fill grouping column. Defaults to |
palette |
Optional vector of fill colors. |
add_boxplot |
Logical. Overlay a narrow boxplot. |
add_points |
Logical. Overlay jittered observations. |
title |
Optional title. If NULL, no title is shown. |
xlab |
Optional x-axis label. |
ylab |
Optional y-axis label. |
base_size |
Base font size. |
A ggplot object.
Plot a UKB ML Flow Object
## S3 method for class 'ukb_ml_flow' plot(x, type = c("roc", "shap_beeswarm"), ...)## S3 method for class 'ukb_ml_flow' plot(x, type = c("roc", "shap_beeswarm"), ...)
x |
A |
type |
Plot type: |
... |
Additional arguments passed to the underlying plot function. |
A ggplot2 object.
Plot a UKB ML Flow Comparison Object
## S3 method for class 'ukb_ml_flow_compare' plot(x, type = c("roc"), ...)## S3 method for class 'ukb_ml_flow_compare' plot(x, type = c("roc"), ...)
x |
A |
type |
Plot type. Currently |
... |
Additional arguments passed to |
A ggplot2 object.
Combines custom parameter estimates (not limited to regression coefficients) from multiply imputed datasets using Rubin's Rules.
pool_custom_estimates( estimates, variances, df.complete = Inf, conf.level = 0.95, labels = NULL )pool_custom_estimates( estimates, variances, df.complete = Inf, conf.level = 0.95, labels = NULL )
estimates |
A list of numeric vectors containing point estimates from each imputed dataset. All vectors must have the same length. |
variances |
A list of variance-covariance matrices (or single variances as 1x1 matrices) corresponding to the estimates. |
df.complete |
Complete-data degrees of freedom. Default Inf. |
conf.level |
Confidence level for intervals. Default 0.95. |
labels |
Character vector of labels for the estimates. If NULL, names are taken from the first estimate vector or generated as "est1", "est2", etc. |
An object of class mi_pooled_result.
Combines results of regression analyses performed on multiply imputed datasets using Rubin's Rules via the mitools package.
pool_mi_models( models = NULL, datasets = NULL, formula = NULL, model_type = c("lm", "logistic", "poisson", "cox", "negbin"), family = NULL, df.complete = Inf, conf.level = 0.95, exponentiate = NULL )pool_mi_models( models = NULL, datasets = NULL, formula = NULL, model_type = c("lm", "logistic", "poisson", "cox", "negbin"), family = NULL, df.complete = Inf, conf.level = 0.95, exponentiate = NULL )
models |
A list of fitted model objects (one per imputed dataset).
If NULL, models will be fitted using |
datasets |
A list of data.frames or an |
formula |
A formula specifying the model. Required if |
model_type |
Character string specifying the model type:
|
family |
A |
df.complete |
Complete-data degrees of freedom for small-sample correction. Default is Inf (large sample approximation). |
conf.level |
Confidence level for intervals. Default 0.95. |
exponentiate |
Logical; whether to exponentiate coefficients (for OR/HR/RR). If NULL, automatically determined based on model type. |
An object of class mi_pooled_result containing:
Data frame with pooled estimates, standard errors, CIs, p-values, and FMI
The raw MIresult object from mitools
Number of imputed datasets
The model type used
The model formula
Whether estimates are exponentiated
The function call
A unified function to preprocess UKB baseline characteristics with automatic field mapping and standardized transformations.
preprocess_baseline( df, variables, custom_mapping = NULL, missing_action = c("keep", "drop"), invalid_codes = c(-1, -3) )preprocess_baseline( df, variables, custom_mapping = NULL, missing_action = c("keep", "drop"), invalid_codes = c(-1, -3) )
df |
A data.table or data.frame containing UKB data from rap platform export. |
variables |
Character vector of variable names to process.
Use |
custom_mapping |
Optional named list for user-defined variable mappings.
Each element should have: |
missing_action |
Character. How to handle missing values:
|
invalid_codes |
Numeric vector of UKB codes to treat as missing. Default: c(-1, -3) which are "Prefer not to answer" and "Do not know" |
A data.table with original data plus processed variable columns
Print mediation analysis results.
## S3 method for class 'mediation_result' print(x, ...)## S3 method for class 'mediation_result' print(x, ...)
x |
An object of class "mediation_result". |
... |
Additional arguments passed to summary. |
Invisibly returns x, the original mediation result object.
Convert a vector of protein identifiers into HGNC gene symbols for downstream
enrichment analysis. When a custom mapping table is supplied, it is used
first. Remaining unmatched identifiers can then be mapped with
clusterProfiler::bitr(). Inputs in UK Biobank Olink coding 143 format,
such as "IL6;Interleukin-6", and RAP-exported Olink column names such as
"olink_instance_0.eno2" are parsed automatically. Multi-target Olink
symbols such as "IL12A_IL12B" are expanded into one row per gene symbol.
protein_to_gene_symbol( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", organism_db = "org.Hs.eg.db", drop_unmapped = TRUE )protein_to_gene_symbol( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", organism_db = "org.Hs.eg.db", drop_unmapped = TRUE )
proteins |
A character vector of protein identifiers, or a data.frame containing a protein identifier column. |
protein_col |
Optional column name when |
from_type |
Character string. Identifier type used by Bioconductor when
|
mapping_table |
Optional data.frame containing custom protein-to-symbol mappings. |
mapping_protein_col |
Column name in |
mapping_symbol_col |
Column name in |
organism_db |
Character string naming the OrgDb package. Default is
|
drop_unmapped |
Logical. If |
A data.frame with columns protein, gene_symbol, and
mapping_source.
A thin wrapper around TCMDATA::rank_ppi_nodes().
rank_protein_ppi_nodes( ppi, metrics = c("degree", "betweenness", "closeness", "eccentricity", "radiality", "Stress", "MCC", "MNC", "DMNC", "BN", "EPC"), weights = NULL, use_weight = TRUE, na_rm = TRUE )rank_protein_ppi_nodes( ppi, metrics = c("degree", "betweenness", "closeness", "eccentricity", "radiality", "Stress", "MCC", "MNC", "DMNC", "BN", "EPC"), weights = NULL, use_weight = TRUE, na_rm = TRUE )
ppi |
An |
metrics |
Character vector of node metrics used for integrated ranking. |
weights |
Optional numeric weights for |
use_weight |
Logical. Whether to prefer weighted betweenness and
closeness metrics. Default is |
na_rm |
Logical. Whether to ignore missing values during normalization.
Default is |
A list with components graph and table.
Uses dx extract_dataset --fields-file and reads the RAP-generated
result back into R within the active RAP session. This is intended for small
to medium extractions. For large phenotype pulls, use
rap_submit_extract().
rap_extract_pheno( field_id = NULL, field_names = NULL, variables = NULL, dataset = NULL, output = NULL, read = TRUE, strip_entity_prefix = FALSE, dry_run = FALSE, timeout = 300, ... )rap_extract_pheno( field_id = NULL, field_names = NULL, variables = NULL, dataset = NULL, output = NULL, read = TRUE, strip_entity_prefix = FALSE, dry_run = FALSE, timeout = 300, ... )
field_id |
UKB numeric field IDs to extract. |
field_names |
Exact RAP dataset column names to extract. |
variables |
Optional predefined variable names from
|
dataset |
Dataset file name. If NULL, |
output |
Optional CSV output path in the current RAP session. If NULL, a temporary file is used. |
read |
Logical. If TRUE, read the CSV into R and return a data.table. If FALSE, return the output path. |
strip_entity_prefix |
Logical. If TRUE, remove |
dry_run |
Logical. If TRUE, return the extraction plan without running
|
timeout |
Timeout in seconds for the extraction. |
... |
Additional arguments passed to |
A data.table when read = TRUE; otherwise the output CSV path.
In dry-run mode, returns a rap_extract_plan.
Find the RAP Dataset File in the Current Project
rap_find_dataset(refresh = FALSE, timeout = 30)rap_find_dataset(refresh = FALSE, timeout = 30)
refresh |
Logical. If TRUE, ignore the cached dataset name and call
|
timeout |
Timeout in seconds for the |
A character scalar naming the detected .dataset file.
List Approved RAP Dataset Fields
rap_list_fields( dataset = NULL, pattern = NULL, entity = "participant", refresh = FALSE, timeout = 120 )rap_list_fields( dataset = NULL, pattern = NULL, entity = "participant", refresh = FALSE, timeout = 120 )
dataset |
Dataset file name. If NULL, |
pattern |
Optional regular expression applied to field names and titles. |
entity |
Dataset entity. Defaults to |
refresh |
Logical. If TRUE, bypass the session cache. |
timeout |
Timeout in seconds for |
A data.frame with columns field_name and title.
Plan a RAP Phenotype Extraction
rap_plan_extract( field_id = NULL, field_names = NULL, variables = NULL, dataset = NULL, fields_df = NULL, entity = "participant", include_eid = TRUE, table_exporter = FALSE, manifest = NULL )rap_plan_extract( field_id = NULL, field_names = NULL, variables = NULL, dataset = NULL, fields_df = NULL, entity = "participant", include_eid = TRUE, table_exporter = FALSE, manifest = NULL )
field_id |
UKB numeric field IDs to extract. All instances and arrays are included. |
field_names |
Exact RAP dataset column names, such as
|
variables |
Optional predefined variable names from
|
dataset |
Dataset file name. If NULL, |
fields_df |
Optional cached field listing from |
entity |
Dataset entity. Defaults to |
include_eid |
Logical. Include participant ID automatically. |
table_exporter |
Logical. If TRUE, return field names in the format expected by the RAP table-exporter app. |
manifest |
Optional manifest CSV path in the current RAP session. |
A list containing extraction field names, matched requests, unmatched requests, dataset, entity, and column counts.
Submits an asynchronous RAP table-exporter job. This is the preferred
interface for large extraction jobs because the work runs on RAP rather than
inside the current R session.
rap_submit_extract( field_id = NULL, field_names = NULL, variables = NULL, dataset = NULL, file = NULL, instance_type = NULL, priority = c("low", "high"), dry_run = FALSE, manifest = NULL, ... )rap_submit_extract( field_id = NULL, field_names = NULL, variables = NULL, dataset = NULL, file = NULL, instance_type = NULL, priority = c("low", "high"), dry_run = FALSE, manifest = NULL, ... )
field_id |
UKB numeric field IDs to extract. |
field_names |
Exact RAP dataset column names to extract. |
variables |
Optional predefined variable names from
|
dataset |
Dataset file name. If NULL, |
file |
Output file stem on RAP. Defaults to
|
instance_type |
DNAnexus instance type. If NULL, selected from the number of columns. |
priority |
Job priority: |
dry_run |
Logical. If TRUE, return the planned fields and command metadata without uploading or submitting. |
manifest |
Optional manifest CSV path in the current RAP session. |
... |
Additional arguments passed to |
A list with class rap_extract_job containing job metadata.
In dry-run mode, returns a rap_extract_plan.
Before the formal regression analysis, it can be useful to check the correlation between variables. This function calculates the correlation matrix for a set of specified variables, which can help identify potential multicollinearity issues or inform variable selection.
run_correlation(df, vars, method = "pearson", threshold = 0.7)run_correlation(df, vars, method = "pearson", threshold = 0.7)
df |
A data.frame or data.table containing the variables of interest. |
vars |
A character vector of column names for which to calculate the correlation matrix. |
method |
The method to use for calculating correlation. Options are "pearson", "spearman", or "kendall". Default is "pearson". |
threshold |
Numeric value between 0 and 1. If specified, the variables with absolute correlation above this threshold will be highlighted in the output. Default is 0.7. |
A correlation matrix of the specified variables.
Run multiple imputation with the CRAN package mice on a subset of variables, then merge the imputed columns back to the original dataset by an ID column.
run_imputation( data, id_col = "eid", vars, factor_vars = NULL, method = "pmm", m = 5, maxit = 10, seed = 1234, print = TRUE, additional_data = NULL, additional_join = c("inner", "left") )run_imputation( data, id_col = "eid", vars, factor_vars = NULL, method = "pmm", m = 5, maxit = 10, seed = 1234, print = TRUE, additional_data = NULL, additional_join = c("inner", "left") )
data |
A data.frame/data.table containing the cohort. |
id_col |
Name of the ID column. Default is |
vars |
Character vector of column names to impute. |
factor_vars |
Optional character vector of variables (subset of
|
method |
Imputation method passed to |
m |
Number of multiple imputations. Default is 5. |
maxit |
Maximum number of iterations. Default is 10. |
seed |
Random seed for reproducibility. |
print |
Logical. If TRUE, show mice iteration logs. |
additional_data |
Optional named list of extra datasets to merge after
imputation. Each element must contain |
additional_join |
Join type for additional datasets. One of
|
This function is designed for workflows where you want to keep a set of "static" columns (exposures, outcomes, follow-up time, etc.) untouched while imputing a selected set of covariates.
The function:
Subsets the input data to the requested variables.
Runs mice().
Creates m completed datasets and merges imputed columns back.
Optionally merges additional datasets (e.g., omics) by ID.
Factor handling: for variables listed in factor_vars, the function will
coerce them to factors before imputation. All other variables in vars
are coerced to numeric.
A list with:
imp: the mice mids object
data_list: a list of length m containing completed and
merged datasets
https://github.com/amices/mice
Perform regression-based causal mediation analysis using the regmedint package. Supports linear, logistic, and Cox proportional hazards models for the outcome, and linear or logistic models for the mediator.
run_mediation( data, exposure, mediator, outcome, covariates = NULL, exposure_levels = c(0, 1), mediator_value = 0, covariate_values = NULL, mediator_type = c("continuous", "binary"), outcome_type = c("linear", "logistic", "cox"), endpoint = NULL, interaction = TRUE, boot = FALSE, boot_n = 1000, conf_level = 0.95 )run_mediation( data, exposure, mediator, outcome, covariates = NULL, exposure_levels = c(0, 1), mediator_value = 0, covariate_values = NULL, mediator_type = c("continuous", "binary"), outcome_type = c("linear", "logistic", "cox"), endpoint = NULL, interaction = TRUE, boot = FALSE, boot_n = 1000, conf_level = 0.95 )
data |
A data.frame or data.table containing all variables. |
exposure |
Character string specifying the exposure (treatment) variable name. |
mediator |
Character string specifying the mediator variable name. |
outcome |
Character string specifying the outcome variable name. For Cox models, this should be the time variable. |
covariates |
Character vector of covariate names. Default NULL. |
exposure_levels |
Numeric vector of length 2: c(a0, a1) where a0 is the reference level and a1 is the comparison level. Default c(0, 1). |
mediator_value |
Numeric value at which to evaluate the controlled direct effect (CDE). Default 0. |
covariate_values |
Numeric vector of covariate values at which to evaluate conditional effects. If NULL, uses mean (continuous) or mode (categorical). |
mediator_type |
Character string: "continuous" or "binary". Default "continuous". |
outcome_type |
Character string: "linear", "logistic", or "cox". Default "linear". |
endpoint |
Character vector of length 2 for Cox models: c("time_col", "status_col"). Required when outcome_type = "cox". |
interaction |
Logical; whether to include exposure-mediator interaction in the outcome model. Default TRUE. |
boot |
Logical; whether to use bootstrap for confidence intervals. Default FALSE. |
boot_n |
Integer; number of bootstrap replicates. Default 1000. |
conf_level |
Numeric; confidence level. Default 0.95. |
This function wraps the regmedint package to provide a user-friendly interface for causal mediation analysis. It implements the methods described in Valeri & VanderWeele (2013, 2015).
Effect definitions:
cde: Controlled Direct Effect - effect of exposure with mediator fixed
pnde: Pure Natural Direct Effect - direct effect (traditional NDE)
tnie: Total Natural Indirect Effect - indirect effect (traditional NIE)
tnde: Total Natural Direct Effect
pnie: Pure Natural Indirect Effect
te: Total Effect = NDE + NIE
pm: Proportion Mediated = NIE / TE
An object of class "mediation_result" containing:
data.frame with effect estimates, SE, CI, and p-values
Fitted mediator model object
Fitted outcome model object
Original regmedint object (if available)
The matched call
List of analysis parameters
Valeri L, VanderWeele TJ. Mediation analysis allowing for exposure-mediator interactions and causal interpretation. Psychological Methods. 2013;18(2):137-150.
Run ORA for a metabolite list. The recommended first backend is
backend = "custom", where users provide a two-column metabolite pathway
library. A backend = "metaboanalyst" interface is also provided for users
who have installed MetaboAnalystR and want to use its metabolite-set
libraries, such as "smpdb_pathway".
run_metabolite_ora( metabolites, pathway_db = NULL, universe = NULL, backend = c("custom", "metaboanalyst"), id_type = "name", library = "smpdb_pathway", mapping_table = NULL, pathway_col = "pathway", metabolite_col = "metabolite", min_metabolites = 3, p_adjust_method = "BH", run_subprocess = TRUE )run_metabolite_ora( metabolites, pathway_db = NULL, universe = NULL, backend = c("custom", "metaboanalyst"), id_type = "name", library = "smpdb_pathway", mapping_table = NULL, pathway_col = "pathway", metabolite_col = "metabolite", min_metabolites = 3, p_adjust_method = "BH", run_subprocess = TRUE )
metabolites |
Character vector of metabolite names. |
pathway_db |
Optional data.frame for custom ORA. Must contain pathway and metabolite columns. |
universe |
Optional background metabolite vector. If |
backend |
One of |
id_type |
Metabolite identifier type for MetaboAnalystR
cross-referencing. Default |
library |
MetaboAnalystR metabolite-set library. Default
|
mapping_table |
Optional custom mapping table passed to
|
pathway_col |
Column name in |
metabolite_col |
Column name in |
min_metabolites |
Minimum mapped metabolites required for ORA.
Default |
p_adjust_method |
Multiple-testing method used by |
run_subprocess |
Logical. For |
A list of class ukb_metabolite_ora with components input,
mapping, matched, unmatched, ora_result, backend, and library.
panel <- load_ukb_metabolite_panel() hits <- c("Alanine", "Glutamine", "Glycine", "Lactate", "Pyruvate") pathway_db <- data.frame( pathway = c(rep("Amino acid metabolism", 3), rep("Energy metabolism", 2)), metabolite = c("L-Alanine", "L-Glutamine", "Glycine", "Lactic acid", "Pyruvic acid") ) run_metabolite_ora(hits, pathway_db = pathway_db, backend = "custom")panel <- load_ukb_metabolite_panel() hits <- c("Alanine", "Glutamine", "Glycine", "Lactate", "Pyruvate") pathway_db <- data.frame( pathway = c(rep("Amino acid metabolism", 3), rep("Energy metabolism", 2)), metabolite = c("L-Alanine", "L-Glutamine", "Glycine", "Lactic acid", "Pyruvic acid") ) run_metabolite_ora(hits, pathway_db = pathway_db, backend = "custom")
Perform mediation analysis for multiple potential mediators, testing each one separately.
run_multi_mediator( data, exposure, mediators, outcome, covariates = NULL, mediator_type = "continuous", outcome_type = "linear", endpoint = NULL, ... )run_multi_mediator( data, exposure, mediators, outcome, covariates = NULL, mediator_type = "continuous", outcome_type = "linear", endpoint = NULL, ... )
data |
A data.frame or data.table containing all variables. |
exposure |
Character string specifying the exposure (treatment) variable name. |
mediators |
Character vector of mediator variable names. |
outcome |
Character string specifying the outcome variable name. For Cox models, this should be the time variable. |
covariates |
Character vector of covariate names. Default NULL. |
mediator_type |
Character string: "continuous" or "binary". Default "continuous". |
outcome_type |
Character string: "linear", "logistic", or "cox". Default "linear". |
endpoint |
Character vector of length 2 for Cox models: c("time_col", "status_col"). Required when outcome_type = "cox". |
... |
Additional arguments passed to |
A data.frame with mediation results for each mediator, including:
Mediator variable name
Total natural indirect effect estimate
Standard error of TNIE
P-value for TNIE
Pure natural direct effect estimate
Total effect estimate
Proportion mediated
Standard error of proportion mediated
Perform subgroup analyses across multiple subgroup variables.
run_multi_subgroup( data, exposure, outcome = NULL, subgroup_vars, covariates = NULL, model_type = c("cox", "logistic", "linear", "glm", "negbin"), family = "poisson", endpoint = NULL )run_multi_subgroup( data, exposure, outcome = NULL, subgroup_vars, covariates = NULL, model_type = c("cox", "logistic", "linear", "glm", "negbin"), family = "poisson", endpoint = NULL )
data |
A data.frame or data.table containing all variables. |
exposure |
Character string specifying the exposure variable name. |
outcome |
Character string specifying the outcome variable name. For Cox models, this can be NULL if endpoint is specified. |
subgroup_vars |
Character vector of subgroup variable names. |
covariates |
Character vector of covariate names to adjust for. Default NULL. |
model_type |
Character string specifying model type: |
family |
For |
endpoint |
Character vector of length 2 for Cox models: c("time", "status"). Required when model_type = "cox". |
A data.frame with results from all subgroup analyses combined.
Convert protein identifiers to gene symbols, then to Entrez IDs, and run
over-representation analysis (ORA) with clusterProfiler::enrichKEGG().
run_protein_kegg_ora( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", universe = NULL, organism_db = "org.Hs.eg.db", organism = "hsa", pvalueCutoff = 0.05, qvalueCutoff = 0.2, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, readable = TRUE, use_internal_data = FALSE )run_protein_kegg_ora( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", universe = NULL, organism_db = "org.Hs.eg.db", organism = "hsa", pvalueCutoff = 0.05, qvalueCutoff = 0.2, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, readable = TRUE, use_internal_data = FALSE )
proteins |
A character vector of protein identifiers, or a data.frame containing a protein identifier column. |
protein_col |
Optional column name when |
from_type |
Character string describing the input identifier type for
Bioconductor-based mapping. Default is |
mapping_table |
Optional data.frame containing custom protein-to-symbol mappings. |
mapping_protein_col |
Column name in |
mapping_symbol_col |
Column name in |
universe |
Optional character vector of background protein identifiers.
These identifiers are converted with the same rules as |
organism_db |
Character string naming the OrgDb package. Default is
|
organism |
Character string for KEGG organism code. Default is |
pvalueCutoff |
Numeric p-value cutoff for ORA. Default is |
qvalueCutoff |
Numeric q-value cutoff for ORA. Default is |
pAdjustMethod |
Character string for multiple-testing correction.
Default is |
minGSSize |
Minimum gene set size. Default is |
maxGSSize |
Maximum gene set size. Default is |
readable |
Logical. Passed to |
use_internal_data |
Logical. Passed to |
A list with components gene_symbols, entrez_ids, mapping,
universe_symbols, universe_entrez_ids, and ora_result.
Convert protein identifiers to gene symbols and run over-representation
analysis (ORA) with clusterProfiler::enrichGO(). This is the GO-specific
interface for proteomics hits extracted from UK Biobank RAP Olink data.
run_protein_ora( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", universe = NULL, organism_db = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 0.05, qvalueCutoff = 0.2, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, readable = TRUE )run_protein_ora( proteins, protein_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", universe = NULL, organism_db = "org.Hs.eg.db", ont = "BP", pvalueCutoff = 0.05, qvalueCutoff = 0.2, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, readable = TRUE )
proteins |
A character vector of protein identifiers, or a data.frame containing a protein identifier column. |
protein_col |
Optional column name when |
from_type |
Character string describing the input identifier type for
Bioconductor-based mapping. Default is |
mapping_table |
Optional data.frame containing custom protein-to-symbol mappings. |
mapping_protein_col |
Column name in |
mapping_symbol_col |
Column name in |
universe |
Optional character vector of background protein identifiers.
These identifiers are converted with the same rules as |
organism_db |
Character string naming the OrgDb package. Default is
|
ont |
One of |
pvalueCutoff |
Numeric p-value cutoff for ORA. Default is |
qvalueCutoff |
Numeric q-value cutoff for ORA. Default is |
pAdjustMethod |
Character string for multiple-testing correction.
Default is |
minGSSize |
Minimum gene set size. Default is |
maxGSSize |
Maximum gene set size. Default is |
readable |
Logical. Passed to |
A list with components gene_symbols, mapping, universe_symbols,
and ora_result.
Unified interface for community detection in STRING-derived PPI networks.
New analyses should call this function and choose the algorithm with
method. Method-specific helper functions are retained internally.
run_protein_ppi_clustering( ppi, method = c("fastgreedy", "louvain", "mcode", "mcl"), ... )run_protein_ppi_clustering( ppi, method = c("fastgreedy", "louvain", "mcode", "mcl"), ... )
ppi |
An |
method |
Clustering algorithm. Options are |
... |
Method-specific arguments passed to the selected clustering
helper, such as |
An igraph object with method-specific cluster attributes.
Convert target protein identifiers to gene symbols and run STRING-network
robustness analysis via TCMDATA::ppi_knock().
run_protein_ppi_robustness( ppi, targets, target_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", organism_db = "org.Hs.eg.db", n_perm = 100L, weight_attr = "score", rewire_niter = 10L, seed = 42L )run_protein_ppi_robustness( ppi, targets, target_col = NULL, from_type = "SYMBOL", mapping_table = NULL, mapping_protein_col = "protein", mapping_symbol_col = "gene_symbol", organism_db = "org.Hs.eg.db", n_perm = 100L, weight_attr = "score", rewire_niter = 10L, seed = 42L )
ppi |
An |
targets |
A character vector of target protein identifiers, or a data.frame containing a target identifier column. |
target_col |
Optional column name when |
from_type |
Character string describing the input identifier type for
Bioconductor-based mapping. Default is |
mapping_table |
Optional data.frame containing custom protein-to-symbol mappings. |
mapping_protein_col |
Column name in |
mapping_symbol_col |
Column name in |
organism_db |
Character string naming the OrgDb package. Default is
|
n_perm |
Integer. Number of permutation iterations. Default is |
weight_attr |
Character. Edge attribute containing the confidence score.
Default is |
rewire_niter |
Integer. Rewiring multiplier used in the null model.
Default is |
seed |
Integer random seed. Default is |
A list with components targets, mapping, and robustness.
Fits a restricted cubic spline (RCS) model to characterise nonlinear
exposure-response relationships. Supports Cox, logistic, and linear
regression. Returns prediction curves, confidence intervals, overall and
nonlinear P values, and the AIC-selected knot count. The returned object
is passed directly to plot_rcs() for publication-ready figures.
run_rcs( data, exposure, covariates = NULL, model_type = c("cox", "logistic", "linear"), endpoint = NULL, outcome = NULL, knots = NULL, knot_range = 3:7, ref = NULL, ref_quantile = 0.5, conf_level = 0.95, trim_quantiles = c(0.01, 0.99), grid_size = 200L, backend = c("rms", "ns") )run_rcs( data, exposure, covariates = NULL, model_type = c("cox", "logistic", "linear"), endpoint = NULL, outcome = NULL, knots = NULL, knot_range = 3:7, ref = NULL, ref_quantile = 0.5, conf_level = 0.95, trim_quantiles = c(0.01, 0.99), grid_size = 200L, backend = c("rms", "ns") )
data |
A data.frame containing all required columns. |
exposure |
Character. Name of the continuous exposure variable. |
covariates |
Character vector of covariate names, or |
model_type |
One of |
endpoint |
Character vector of length 2 giving |
outcome |
Character. Outcome column name. Required for
|
knots |
Integer. Number of knots (3-7). If |
knot_range |
Integer vector of candidate knot counts for AIC selection.
Default |
ref |
Numeric. Reference value for the exposure. If |
ref_quantile |
Numeric (0-1). Quantile of the exposure used as the
reference when |
conf_level |
Numeric. Confidence level for intervals. Default |
trim_quantiles |
Numeric vector of length 2. Exposure values outside
these quantiles are excluded before fitting. Default |
grid_size |
Integer. Number of points in the prediction grid. Default |
backend |
One of |
An object of class c("ukb_rcs", "list") with elements:
The fitted model object.
Character. One of "cox", "logistic", "linear".
Character. "rms" or "ns".
Character. Name of the exposure variable.
Character vector of covariate names.
Character vector. Cox endpoint columns.
Character. Outcome column name.
Integer. Number of knots used.
Numeric. Reference exposure value.
Integer. Number of observations in the fitted model.
Integer. Number of events (Cox only, else NA).
Numeric. Overall P value for the exposure term.
Numeric. P value for the nonlinear component.
data.frame with columns x, estimate,
lower95, upper95.
data.frame with column x (untrimmed exposure values).
data.frame with columns knots and AIC.
A unified wrapper around runmulti_cox, runmulti_lm,
runmulti_logit, runmulti_glm, runmulti_negbin, and
runmulti_gam. Select the model family with type.
run_regression( data, main_var, type = c("cox", "lm", "logit", "glm", "negbin", "gam"), outcome = NULL, endpoint = c("time", "status"), covariates = NULL, covariate_sets = NULL, family = NULL, smooth = TRUE, ... )run_regression( data, main_var, type = c("cox", "lm", "logit", "glm", "negbin", "gam"), outcome = NULL, endpoint = c("time", "status"), covariates = NULL, covariate_sets = NULL, family = NULL, smooth = TRUE, ... )
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
type |
One of |
outcome |
For all types except |
endpoint |
For |
covariates |
A character vector of covariate names. Default |
covariate_sets |
Optional named list of covariate sets for nested
epidemiological models. Each element must be |
family |
For |
smooth |
For |
... |
Additional arguments forwarded to the underlying fitting function. |
A data.frame whose columns depend on type:
variable, coef, se, z, HR, lower95, upper95, pvalue, n, n_event
variable, beta, lower95, upper95, pvalue
variable, OR, lower95, upper95, pvalue
variable, family, link, beta, lower95, upper95, pvalue, n
variable, IRR, lower95, upper95, pvalue, theta, n
variable, edf, ref_df, F, pvalue, family, link, n
variable, beta, lower95, upper95, pvalue, family, link, n
Perform sensitivity analysis to assess the impact of unmeasured confounding on mediation effect estimates.
run_sensitivity_mediation( mediation_result, rho_values = seq(-0.9, 0.9, by = 0.1) )run_sensitivity_mediation( mediation_result, rho_values = seq(-0.9, 0.9, by = 0.1) )
mediation_result |
An object of class "mediation_result" from |
rho_values |
Numeric vector of sensitivity parameter values (correlation between unmeasured confounder and mediator/outcome residuals). Default seq(-0.9, 0.9, by = 0.1). |
This function evaluates how the indirect effect would change if there were unmeasured confounding of the mediator-outcome relationship. The rho parameter represents the correlation between residuals that would be induced by an unmeasured confounder.
A robust mediation effect should remain significant across a range of plausible rho values.
A data.frame with effect estimates under different rho values.
Perform subgroup analysis by fitting regression models within each level of a subgroup variable and calculating interaction p-values.
run_subgroup_analysis( data, exposure, outcome = NULL, subgroup_var, covariates = NULL, model_type = c("cox", "logistic", "linear", "glm", "negbin"), family = "poisson", endpoint = NULL, ref_level = NULL )run_subgroup_analysis( data, exposure, outcome = NULL, subgroup_var, covariates = NULL, model_type = c("cox", "logistic", "linear", "glm", "negbin"), family = "poisson", endpoint = NULL, ref_level = NULL )
data |
A data.frame or data.table containing all variables. |
exposure |
Character string specifying the exposure variable name. |
outcome |
Character string specifying the outcome variable name. For Cox models, this can be NULL if endpoint is specified. |
subgroup_var |
Character string specifying the subgroup variable name. |
covariates |
Character vector of covariate names to adjust for. Default NULL. |
model_type |
Character string specifying model type: |
family |
For |
endpoint |
Character vector of length 2 for Cox models: c("time", "status"). Required when model_type = "cox". |
ref_level |
Character string specifying the reference level for the subgroup variable. If NULL, the first level is used as reference. |
A data.frame with columns:
Name of the subgroup variable
Subgroup level
Sample size in subgroup
Number of events (for Cox/logistic models)
Effect estimate (HR for Cox, OR for logistic, Beta for linear)
Lower 95\% CI
Upper 95\% CI
P-value for the exposure effect
P-value for interaction between exposure and subgroup
Fit regression models using IPTW weights with robust standard errors.
run_weighted_analysis( data, exposure, outcome = NULL, covariates = NULL, weight_col = "weight", model_type = c("cox", "logistic", "linear"), endpoint = NULL, robust_se = TRUE )run_weighted_analysis( data, exposure, outcome = NULL, covariates = NULL, weight_col = "weight", model_type = c("cox", "logistic", "linear"), endpoint = NULL, robust_se = TRUE )
data |
A data.frame or data.table containing all variables and weights. |
exposure |
Character string specifying the exposure variable name. |
outcome |
Character string specifying the outcome variable name (for logistic/linear). |
covariates |
Character vector of covariate names. Default NULL. |
weight_col |
Character string specifying the weight column name. Default "weight". |
model_type |
Character string specifying model type: "cox", "logistic", or "linear". |
endpoint |
Character vector of length 2 for Cox models: c("time", "status"). |
robust_se |
Logical; whether to use robust standard errors. Default TRUE. |
A data.frame with effect estimates and confidence intervals.
Run Multiple Fine-Gray Competing-Risk Models
runmulti_competing( data, main_var, covariates = NULL, time_col, event_col, compete_col = NULL, event_value = 1, compete_value = 2, conf_level = 0.95, ... )runmulti_competing( data, main_var, covariates = NULL, time_col, event_col, compete_col = NULL, event_value = 1, compete_value = 2, conf_level = 0.95, ... )
data |
A data.frame or data.table. |
main_var |
Character vector of exposure variable names. |
covariates |
Optional character vector of covariates. |
time_col |
Follow-up time column. |
event_col |
Event-status column, or the primary-event column in dual-column mode. |
compete_col |
Optional competing-event column in dual-column mode. |
event_value |
Event code used in single-column mode. |
compete_value |
Competing-event code used in single-column mode. |
conf_level |
Confidence level, reserved for future use. |
... |
Additional arguments passed to the weighted Cox fit. |
A data.frame with subdistribution hazard ratios.
Fit Cox proportional hazards models for each main variable separately.
When covariates is NULL, univariate models are fitted.
Otherwise, multivariate models adjusting for the specified covariates are fitted.
runmulti_cox( data, main_var, covariates = NULL, endpoint = c("time", "status"), ... )runmulti_cox( data, main_var, covariates = NULL, endpoint = c("time", "status"), ... )
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
covariates |
A character vector of covariate names to adjust for. Default |
endpoint |
A character vector of length 2: |
... |
Additional arguments passed to |
A data.frame with columns: variable, coef,
se, z, HR, lower95, upper95,
pvalue, n, and n_event.
Run Lagged Cox Sensitivity Analyses
runmulti_cox_lag( data, main_var, covariates = NULL, endpoint = c("time", "status"), lag_years = c(0, 1, 2, 5), verbose = TRUE, ... )runmulti_cox_lag( data, main_var, covariates = NULL, endpoint = c("time", "status"), lag_years = c(0, 1, 2, 5), verbose = TRUE, ... )
data |
A data.frame or data.table. |
main_var |
Character vector of exposure variable names. |
covariates |
Optional character vector of covariates. |
endpoint |
Character vector of length 2: |
lag_years |
Numeric vector of lag windows in years. |
verbose |
Logical; print progress messages. |
... |
Additional arguments passed to |
A data.frame containing lag-specific hazard-ratio estimates.
Run Multiple Cox Models with PH Diagnostics
runmulti_cox_zph( data, main_var, covariates = NULL, endpoint = c("time", "status"), transform = c("km", "rank", "identity"), alpha = 0.05, keep_models = FALSE, ... )runmulti_cox_zph( data, main_var, covariates = NULL, endpoint = c("time", "status"), transform = c("km", "rank", "identity"), alpha = 0.05, keep_models = FALSE, ... )
data |
A data.frame or data.table. |
main_var |
Character vector of exposure variable names. |
covariates |
Optional character vector of covariate names. |
endpoint |
Character vector of length 2: |
transform |
Character scalar passed to |
alpha |
Numeric threshold for flagging PH violations. |
keep_models |
Logical; if TRUE, attach fitted models as an attribute. |
... |
Additional arguments passed to |
A data.frame with effect estimates and PH-diagnostic columns.
Fit GAMs (mgcv::gam) for each main variable separately. By default
each main variable enters the model as a penalised thin-plate regression
spline s(var), allowing non-linear dose-response relationships to be
detected.
When smooth = TRUE (default) the returned table reports the smooth
term's estimated degrees of freedom (edf), F-statistic, and p-value -
useful for screening whether an association exists and whether it is
non-linear (edf > 1). When smooth = FALSE the main variable
enters as a parametric linear term and the output mirrors runmulti_glm
(beta, Wald CI, p-value).
runmulti_gam( data, main_var, outcome, covariates = NULL, smooth = TRUE, family = "gaussian", k = -1, ... )runmulti_gam( data, main_var, outcome, covariates = NULL, smooth = TRUE, family = "gaussian", k = -1, ... )
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
outcome |
A character string specifying the outcome column. |
covariates |
A character vector of covariate names added as parametric
linear terms. Default |
smooth |
Logical. If |
family |
A GLM family controlling the response distribution. Accepts
the same forms as |
k |
Integer. Basis dimension for each smooth term. |
... |
Additional arguments passed to |
When smooth = TRUE: a data.frame with columns
variable, edf, ref_df, F, pvalue,
family, link, n.
When smooth = FALSE: variable, beta, lower95,
upper95, pvalue, family, link, n.
Fit GLMs for each main variable separately using any stats family.
When covariates is NULL, univariate models are fitted.
Otherwise, multivariate models are fitted.
Quasi-families (quasipoisson, quasibinomial) use Wald
confidence intervals because profile-likelihood CIs are not available for
quasi-likelihood models. All other families use profile-likelihood CIs via
stats::confint.
runmulti_glm( data, main_var, family = "poisson", outcome, covariates = NULL, ... )runmulti_glm( data, main_var, family = "poisson", outcome, covariates = NULL, ... )
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
family |
A GLM family. Accepted forms:
|
outcome |
A character string specifying the outcome column. |
covariates |
A character vector of covariate names. Default |
... |
Additional arguments passed to |
A data.frame with columns: variable, family,
link, beta, lower95, upper95, pvalue,
n. For log- or logit-link families exp(beta) gives the
ratio-scale effect (IRR, rate ratio, etc.).
Fit linear regression models (lm) for each main variable separately.
When covariates is NULL, univariate models are fitted.
Otherwise, multivariate models adjusting for the specified covariates are fitted.
runmulti_lm(data, main_var, covariates = NULL, outcome, ...)runmulti_lm(data, main_var, covariates = NULL, outcome, ...)
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
covariates |
A character vector of covariate names to adjust for. Default |
outcome |
A character string specifying the outcome (dependent) variable name. |
... |
Additional arguments passed to |
A data.frame with columns: variable, beta, lower95, upper95, pvalue.
Fit logistic regression models (glm with family = binomial) for each main variable separately.
When covariates is NULL, univariate models are fitted.
Otherwise, multivariate models adjusting for the specified covariates are fitted.
runmulti_logit(data, main_var, covariates = NULL, outcome, ...)runmulti_logit(data, main_var, covariates = NULL, outcome, ...)
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
covariates |
A character vector of covariate names to adjust for. Default |
outcome |
A character string specifying the binary outcome (dependent) variable name (0/1). |
... |
Additional arguments passed to |
A data.frame with columns: variable, OR, lower95, upper95, pvalue.
Fit negative-binomial GLMs (MASS::glm.nb) for each main variable
separately. This is the standard approach for overdispersed count outcomes
where the Poisson variance assumption is violated.
The overdispersion parameter theta is estimated per model and
reported alongside the effect estimate.
runmulti_negbin(data, main_var, outcome, covariates = NULL, ...)runmulti_negbin(data, main_var, outcome, covariates = NULL, ...)
data |
A data.frame or data.table containing all variables. |
main_var |
A character vector of main variable names to test. |
outcome |
A character string specifying the count outcome column. |
covariates |
A character vector of covariate names. Default |
... |
Additional arguments passed to |
A data.frame with columns: variable, IRR,
lower95, upper95, pvalue, theta, n.
IRR is the incidence rate ratio (exp(beta)).
theta is the estimated negative-binomial dispersion parameter
(larger values indicate less overdispersion).
Run Grouped-Exposure Trend Tests
runmulti_trend( data, main_var, outcome = NULL, covariates = NULL, model_type = c("cox", "logistic", "linear"), endpoint = NULL, ref_level = NULL, score_method = c("integer", "median", "custom"), custom_scores = NULL, include_level_estimates = TRUE, ... )runmulti_trend( data, main_var, outcome = NULL, covariates = NULL, model_type = c("cox", "logistic", "linear"), endpoint = NULL, ref_level = NULL, score_method = c("integer", "median", "custom"), custom_scores = NULL, include_level_estimates = TRUE, ... )
data |
A data.frame or data.table. |
main_var |
Character vector of grouped exposure variable names. |
outcome |
Outcome column for logistic or linear models. |
covariates |
Optional character vector of covariates. |
model_type |
One of |
endpoint |
Character vector of length 2 for Cox models. |
ref_level |
Optional reference level applied to every grouped exposure. |
score_method |
One of |
custom_scores |
Optional named list of custom score mappings. |
include_level_estimates |
Logical; if TRUE, include category-specific estimates. |
... |
Additional arguments passed to the fitted model. |
A data.frame containing grouped-effect estimates and a repeated
p_trend column for each exposure.
A thin wrapper around TCMDATA::add_cluster_score().
score_protein_ppi_clusters(ppi, cluster_attr = "louvain_cluster", min_size = 3)score_protein_ppi_clusters(ppi, cluster_attr = "louvain_cluster", min_size = 3)
ppi |
An |
cluster_attr |
Character. Vertex attribute containing cluster labels.
Default is |
min_size |
Integer. Minimum cluster size to retain. Default is |
A data.frame containing cluster scores.
Keep only participants with incident events and classify them as occurring
within n_years or after n_years since enrollment. By default, the
function uses outcome_surv_time and outcome_status generated by
build_survival_dataset(). If the follow-up time column is not available,
the function can compute it from enrollment and event dates.
select_incident_by_years( df, n_years = 5, time_col = "outcome_surv_time", status_col = "outcome_status", baseline_col = "p53_i0", event_date_col = "earliest_date", group_col = "incident_timing", output = c("combined", "split"), copy = TRUE, verbose = TRUE )select_incident_by_years( df, n_years = 5, time_col = "outcome_surv_time", status_col = "outcome_status", baseline_col = "p53_i0", event_date_col = "earliest_date", group_col = "incident_timing", output = c("combined", "split"), copy = TRUE, verbose = TRUE )
df |
A data.frame or data.table. |
n_years |
Numeric scalar. Cutoff in years for classifying incident
events. Default is |
time_col |
Column name for follow-up time in years. Default is
|
status_col |
Column name for event status where |
baseline_col |
Column name for enrollment date. Used only when
|
event_date_col |
Column name for event date. Used only when |
group_col |
Name of the output grouping column. Default is
|
output |
Output format. |
copy |
Logical scalar. If |
verbose |
Logical scalar. If |
If output = "combined", a filtered object with the same class as
df, containing only participants with incident events. The output adds
group_col. If time_col is missing but can be derived from dates, the
function also adds time_col. If output = "split", a named list with
within_n_years and after_n_years, each preserving the same class as
df.
df <- data.frame( id = 1:5, outcome_surv_time = c(1.2, 4.9, 5.0, 8.1, 3.0), outcome_status = c(1, 1, 1, 1, 0) ) result <- select_incident_by_years(df, n_years = 5) table(result$incident_timing) split_result <- select_incident_by_years(df, n_years = 5, output = "split") names(split_result)df <- data.frame( id = 1:5, outcome_surv_time = c(1.2, 4.9, 5.0, 8.1, 3.0), outcome_status = c(1, 1, 1, 1, 0) ) result <- select_incident_by_years(df, n_years = 5) table(result$incident_timing) split_result <- select_incident_by_years(df, n_years = 5, output = "split") names(split_result)
Remove participants who experienced the event within the first n_years of
follow-up. The returned dataset keeps the same columns and class as the input
so it can be passed directly to the standard regression functions.
sensitivity_exclude_early_events( data, endpoint = c("outcome_surv_time", "outcome_status"), n_years, copy = TRUE, verbose = TRUE )sensitivity_exclude_early_events( data, endpoint = c("outcome_surv_time", "outcome_status"), n_years, copy = TRUE, verbose = TRUE )
data |
A data.frame or data.table. |
endpoint |
Character vector of length 2 giving the time and status
columns, e.g. |
n_years |
Numeric scalar. Events with follow-up time less than or equal to this value will be excluded. |
copy |
Logical scalar. If |
verbose |
Logical scalar. If |
An object with the same class and columns as data, with filtered
rows removed. A sensitivity_info attribute is added for auditability.
dt_sens <- sensitivity_exclude_early_events( data = mtcars, endpoint = c("wt", "vs"), n_years = 3 )dt_sens <- sensitivity_exclude_early_events( data = mtcars, endpoint = c("wt", "vs"), n_years = 3 )
Remove participants with missing values in any of the specified covariates. The returned dataset keeps the same columns and class as the input so it can be passed directly to the standard regression functions.
sensitivity_exclude_missing_covariates( data, covariates, copy = TRUE, stepwise = FALSE, verbose = TRUE )sensitivity_exclude_missing_covariates( data, covariates, copy = TRUE, stepwise = FALSE, verbose = TRUE )
data |
A data.frame or data.table. |
covariates |
Character vector of covariate names to check. |
copy |
Logical scalar. If |
stepwise |
Logical scalar. If |
verbose |
Logical scalar. If |
An object with the same class and columns as data, with filtered
rows removed. A sensitivity_info attribute is added for auditability.
dt_sens <- sensitivity_exclude_missing_covariates( data = mtcars, covariates = c("hp", "wt") )dt_sens <- sensitivity_exclude_missing_covariates( data = mtcars, covariates = c("hp", "wt") )
A thin wrapper around TCMDATA::ppi_subset() for STRING-derived PPI
networks.
subset_protein_ppi( ppi, n = NULL, score_cutoff = 0.7, edge_attr = "score", rm_isolates = TRUE )subset_protein_ppi( ppi, n = NULL, score_cutoff = 0.7, edge_attr = "score", rm_isolates = TRUE )
ppi |
An |
n |
Integer. Number of top-degree nodes to keep. If |
score_cutoff |
Numeric. Minimum STRING confidence score to retain.
Default is |
edge_attr |
Character. Edge attribute containing the confidence score.
Default is |
rm_isolates |
Logical. Remove isolated nodes after filtering? Default is
|
An igraph object.
Print a summary of mediation analysis results.
## S3 method for class 'mediation_result' summary(object, exponentiate = FALSE, ...)## S3 method for class 'mediation_result' summary(object, exponentiate = FALSE, ...)
object |
An object of class "mediation_result". |
exponentiate |
Logical; whether to exponentiate estimates (for HR/OR). Default FALSE. |
... |
Additional arguments (unused). |
Invisibly returns the object.
Returns a tidy data frame of pooled estimates, compatible with broom package style.
tidy.mi_pooled_result( x, conf.int = TRUE, conf.level = 0.95, exponentiate = FALSE, ... )tidy.mi_pooled_result( x, conf.int = TRUE, conf.level = 0.95, exponentiate = FALSE, ... )
x |
An |
conf.int |
Logical; include confidence intervals? Default TRUE. |
conf.level |
Confidence level. Default 0.95. |
exponentiate |
Logical; exponentiate estimates? Default FALSE. |
... |
Additional arguments (ignored). |
A data frame with columns: term, estimate, std.error, statistic, p.value, and optionally conf.low, conf.high, fmi.
Inspect whether the current R session is running inside a UK Biobank
Research Analysis Platform (RAP)-like environment and return reproducible
diagnostics for RAP-aware workflows. The function only checks environment
variables, local paths, and the availability of the dx command-line tool
unless check_auth = TRUE; it does not read or export participant-level
data.
ukb_check_rap_env( output_dir = NULL, require_rap = FALSE, require_dx = FALSE, check_auth = FALSE, check_write = FALSE, verbose = TRUE )ukb_check_rap_env( output_dir = NULL, require_rap = FALSE, require_dx = FALSE, check_auth = FALSE, check_write = FALSE, verbose = TRUE )
output_dir |
Optional output directory to assess. |
require_rap |
Logical. If |
require_dx |
Logical. If |
check_auth |
Logical. If |
check_write |
Logical. If |
verbose |
Logical. If |
A list with class ukb_rap_env containing RAP environment metadata
and a check table.
env <- ukb_check_rap_env(verbose = FALSE)env <- ukb_check_rap_env(verbose = FALSE)
Converts common UK Biobank non-response labels and numeric missing codes into
analysis-ready missing values. Empty strings are always converted to
NA. Informative non-response labels can either be converted to
NA or retained as "Unknown" for modelling.
ukb_clean_missing( data, cols = NULL, action = c("na", "unknown"), extra_labels = NULL, numeric_codes = c(-1, -3), trim = TRUE, in_place = FALSE, verbose = TRUE )ukb_clean_missing( data, cols = NULL, action = c("na", "unknown"), extra_labels = NULL, numeric_codes = c(-1, -3), trim = TRUE, in_place = FALSE, verbose = TRUE )
data |
A data.frame or data.table. |
cols |
Optional character vector of columns to clean. If NULL, all columns are considered. |
action |
How to handle informative character labels:
|
extra_labels |
Additional character labels to treat as informative missing. |
numeric_codes |
Numeric values to treat as missing. Defaults to common
UKB values |
trim |
Logical. Trim leading/trailing whitespace in character columns. |
in_place |
Logical. If TRUE and |
verbose |
Logical. Print a concise cleaning summary. |
A data.table.
Merge two Cox result tables by variable, summarize replication of training-set significant variables in validation, and compute log(HR) correlations.
ukb_compare_cox_results( train_results, validation_results, variable_col = "variable", hr_col = "HR", p_col = "pvalue", train_prefix = "train", validation_prefix = "validation", p_adjust_methods = c("BH", "bonferroni"), alpha = 0.05 )ukb_compare_cox_results( train_results, validation_results, variable_col = "variable", hr_col = "HR", p_col = "pvalue", train_prefix = "train", validation_prefix = "validation", p_adjust_methods = c("BH", "bonferroni"), alpha = 0.05 )
train_results |
Cox result table for the training set. |
validation_results |
Cox result table for the validation set. |
variable_col |
Variable column name. |
hr_col |
Hazard-ratio column name. |
p_col |
Raw p-value column name. |
train_prefix |
Prefix for training-set columns in the comparison table. |
validation_prefix |
Prefix for validation-set columns. |
p_adjust_methods |
Multiple-testing correction methods to add when adjusted p-value columns are absent. Defaults to BH and Bonferroni. |
alpha |
Significance threshold. |
A list with train_results, validation_results, comparison,
replication_summary, and correlation_summary.
Merge one or more sensitivity-analysis Cox result tables with a main Cox result table, then summarize concordance by sensitivity analysis.
ukb_compare_sensitivity_cox( main_results, sensitivity_results, sensitivity_col = "sensitivity", variable_col = "variable", hr_col = "HR", p_col = "pvalue", main_prefix = "main", sensitivity_prefix = "sensitivity", p_adjust_methods = c("BH", "bonferroni"), alpha = 0.05 )ukb_compare_sensitivity_cox( main_results, sensitivity_results, sensitivity_col = "sensitivity", variable_col = "variable", hr_col = "HR", p_col = "pvalue", main_prefix = "main", sensitivity_prefix = "sensitivity", p_adjust_methods = c("BH", "bonferroni"), alpha = 0.05 )
main_results |
Main Cox result table. |
sensitivity_results |
Sensitivity Cox result table containing one row per variable and sensitivity analysis. |
sensitivity_col |
Column identifying the sensitivity analysis. |
variable_col |
Variable column. |
hr_col |
Hazard-ratio column. |
p_col |
Raw p-value column. |
main_prefix |
Prefix for main-analysis columns. |
sensitivity_prefix |
Prefix for sensitivity-analysis columns. |
p_adjust_methods |
Multiple-testing correction methods to add if absent. |
alpha |
Significance threshold. |
A list with standardized result tables, comparison table, and correlation summary.
Diagnose Proportional Hazards Assumptions for a Cox Model
ukb_cox_diagnostics( model, transform = c("km", "rank", "identity"), terms = TRUE, global = TRUE, alpha = 0.05, return_object = TRUE )ukb_cox_diagnostics( model, transform = c("km", "rank", "identity"), terms = TRUE, global = TRUE, alpha = 0.05, return_object = TRUE )
model |
A fitted |
transform |
Character scalar passed to |
terms |
Logical; keep term-level rows. |
global |
Logical; keep the GLOBAL row. |
alpha |
Numeric threshold for flagging PH violations. |
return_object |
Logical; if TRUE, include the raw |
A list containing a tidy diagnostics table, the global p-value, and
optionally the raw cox.zph object.
Build a compact manifest describing the UKB fields intended for RAP
extraction. This is designed as an auditable planning object that can be
stored with analysis scripts before running rap_plan_extract() or
rap_extract_pheno().
ukb_create_extraction_manifest( field_id = NULL, variable_set = NULL, variables = NULL, dataset = NULL, entity = "participant", output = NULL, include_eid = TRUE, purpose = NULL, notes = NULL )ukb_create_extraction_manifest( field_id = NULL, variable_set = NULL, variables = NULL, dataset = NULL, entity = "participant", output = NULL, include_eid = TRUE, purpose = NULL, notes = NULL )
field_id |
Optional numeric or character vector of UKB field IDs. |
variable_set |
Optional curated variable-set name from
|
variables |
Optional predefined variable names from
|
dataset |
Optional RAP dataset name. |
entity |
RAP entity name, usually |
output |
Optional planned extraction output path. |
include_eid |
Logical. Whether participant ID is expected in the extraction. |
purpose |
Optional short description of the analysis purpose. |
notes |
Optional free-text notes. |
A list with class ukb_extraction_manifest.
manifest <- ukb_create_extraction_manifest( field_id = c(31, 21022), variable_set = "clinical_core", purpose = "demo" )manifest <- ukb_create_extraction_manifest( field_id = c(31, 21022), variable_set = "clinical_core", purpose = "demo" )
Decode UK Biobank RAP exports
ukb_decode( data, metadata = NULL, decode_names = TRUE, decode_values = TRUE, keep_raw = TRUE, suffix = "_label", ... )ukb_decode( data, metadata = NULL, decode_names = TRUE, decode_values = TRUE, keep_raw = TRUE, suffix = "_label", ... )
data |
A data.frame or data.table. |
metadata |
Optional object from |
decode_names |
Logical. If |
decode_values |
Logical. If |
keep_raw |
Logical. If |
suffix |
Suffix for decoded label columns when |
... |
Arguments passed to |
A data.frame or data.table matching the input class.
Decode UK Biobank column names
ukb_decode_column_names( data, metadata = NULL, style = c("snake", "title", "field_id_title"), keep_instance = TRUE, keep_array = TRUE, max_nchar = 80, ... )ukb_decode_column_names( data, metadata = NULL, style = c("snake", "title", "field_id_title"), keep_instance = TRUE, keep_array = TRUE, max_nchar = 80, ... )
data |
A data.frame or data.table. |
metadata |
Optional object from |
style |
Name style. |
keep_instance |
Logical. Keep UKB instance suffixes such as |
keep_array |
Logical. Keep UKB array suffixes such as |
max_nchar |
Optional maximum column-name length. |
... |
Arguments passed to |
A data.frame or data.table matching the input class.
Decode UK Biobank coded values
ukb_decode_values( data, metadata = NULL, keep_raw = TRUE, suffix = "_label", missing_to_na = TRUE, ... )ukb_decode_values( data, metadata = NULL, keep_raw = TRUE, suffix = "_label", missing_to_na = TRUE, ... )
data |
A data.frame or data.table. |
metadata |
Optional object from |
keep_raw |
Logical. If |
suffix |
Suffix for label columns when |
missing_to_na |
Logical. If |
... |
Arguments passed to |
A data.frame or data.table matching the input class.
Generates a small fully synthetic toy dataset for documentation and smoke-test workflows. The data are created at runtime from parametric and categorical toy distributions and are not stored in the package as participant-level records.
ukb_demo(n = NULL, seed = 20260618L)ukb_demo(n = NULL, seed = 20260618L)
n |
Optional number of rows to return. If |
seed |
Integer random seed used to generate the toy data. The default
provides reproducible examples. Use |
A data.frame of synthetic cohort variables with missing values retained.
demo <- ukb_demo(100) demo2 <- ukb_demo(100, seed = 1) dim(demo) names(demo)demo <- ukb_demo(100) demo2 <- ukb_demo(100, seed = 1) dim(demo) names(demo)
A field-path dictionary used by ukb_query_dictionary to support
Chinese-language lookup of UK Biobank variables. The table stores a six-level
translated category hierarchy and variable label. It does not contain
participant-level records and does not include official RAP data values.
Official UKB field IDs and RAP column names should still be resolved against
a project-specific RAP data dictionary generated inside RAP.
ukb_dictionary_zhukb_dictionary_zh
A data frame with 34,953 rows and 6 translated hierarchy columns.
The original UTF-8 column names are preserved in the data object and can be
inspected with names(ukb_dictionary_zh) after loading the dataset.
Curated UKBAnalytica Chinese field-path dictionary for metadata lookup. This dataset contains metadata labels only.
Runs dx extract_dataset -ddd inside UK Biobank RAP and returns the
generated official data dictionary CSV path. This function checks that it is
being executed in a RAP-like environment before calling dx.
ukb_download_rap_dictionary( dataset = NULL, output_dir = ".", delimiter = ",", timeout = 600, require_rap = TRUE )ukb_download_rap_dictionary( dataset = NULL, output_dir = ".", delimiter = ",", timeout = 600, require_rap = TRUE )
dataset |
RAP |
output_dir |
Directory where the dictionary files should be written. |
delimiter |
Output delimiter passed to |
timeout |
Timeout in seconds for the |
require_rap |
Logical. If |
Path to the generated *.data_dictionary.csv file.
Extract UK Biobank fields from a search result or field list
ukb_extract_fields( x = NULL, field_id = NULL, metadata = NULL, mode = c("plan", "sync", "job"), top_n = NULL, dataset = NULL, entity = "participant", ... )ukb_extract_fields( x = NULL, field_id = NULL, metadata = NULL, mode = c("plan", "sync", "job"), top_n = NULL, dataset = NULL, entity = "participant", ... )
x |
Optional object returned by |
field_id |
Optional UKB field IDs. Ignored when |
metadata |
Optional object from |
mode |
|
top_n |
Optional number of top search-result fields to extract. |
dataset |
Optional RAP |
entity |
RAP entity. Defaults to |
... |
Additional arguments passed to the selected RAP extraction function. |
A RAP extraction plan, a data.table, an output path, or a RAP job
submission result depending on mode.
Inspect one UK Biobank field
ukb_field_info( x, by = c("auto", "field_id", "title", "rap_column", "variable"), metadata = NULL, live = FALSE, ... )ukb_field_info( x, by = c("auto", "field_id", "title", "rap_column", "variable"), metadata = NULL, live = FALSE, ... )
x |
A UKB field ID, RAP column name, predefined UKBAnalytica variable name, or field title keyword. |
by |
Lookup mode. |
metadata |
Optional object from |
live |
Logical. If |
... |
Arguments passed to |
An object of class ukb_field_info.
Builds a lightweight metadata object from any combination of RAP-approved fields, a UK Biobank data dictionary, and coding/encoding tables. This is the recommended first step before searching fields, inspecting field definitions, extracting phenotype columns, or decoding RAP exports.
ukb_metadata_setup( source = c("auto", "files", "rap"), data_dict = NULL, codings = NULL, fields_df = NULL, dataset = NULL, entity = "participant", cache = FALSE, cache_dir = NULL, refresh = FALSE, quiet = FALSE )ukb_metadata_setup( source = c("auto", "files", "rap"), data_dict = NULL, codings = NULL, fields_df = NULL, dataset = NULL, entity = "participant", cache = FALSE, cache_dir = NULL, refresh = FALSE, quiet = FALSE )
source |
Metadata source strategy. |
data_dict |
Optional UKB data dictionary file, such as a RAP
|
codings |
Optional UKB coding/encoding table, such as an older
|
fields_df |
Optional cached output from |
dataset |
Optional RAP |
entity |
RAP dataset entity. Defaults to |
cache |
Logical. If |
cache_dir |
Optional cache directory. Defaults to
|
refresh |
Logical. Passed to |
quiet |
Logical. If |
An object of class ukb_metadata.
Converts user-provided train/test (and optional validation) data frames into a
standardized ukb_ml_split object. This object is consumed by the high
level ML workflow and keeps the test set frozen until final evaluation.
ukb_ml_as_split( train_data, test_data, validation_data = NULL, id_col = NULL, check_overlap = TRUE, outcome = NULL, outcome_type = c("auto", "binary", "multiclass", "continuous") )ukb_ml_as_split( train_data, test_data, validation_data = NULL, id_col = NULL, check_overlap = TRUE, outcome = NULL, outcome_type = c("auto", "binary", "multiclass", "continuous") )
train_data |
Training/development data. |
test_data |
Frozen test data. |
validation_data |
Optional validation data. |
id_col |
Optional participant ID column used to check overlap. |
check_overlap |
Logical. Check duplicated and overlapping IDs. |
outcome |
Outcome column name. |
outcome_type |
One of |
A ukb_ml_split object.
Generate calibration curve to assess prediction reliability.
ukb_ml_calibration( object, newdata = NULL, n_bins = 10, method = c("none", "loess", "isotonic"), plot = TRUE, ... )ukb_ml_calibration( object, newdata = NULL, n_bins = 10, method = c("none", "loess", "isotonic"), plot = TRUE, ... )
object |
A ukb_ml object |
newdata |
Optional new data |
n_bins |
Number of bins for calibration (default 10) |
method |
Smoothing method: "loess", "isotonic", or "none" |
plot |
Whether to create calibration plot |
... |
Additional arguments |
ukb_ml_calibration object
Compare performance of multiple trained ML models.
ukb_ml_compare( ..., models = list(), metrics = NULL, test_data = NULL, plot = TRUE )ukb_ml_compare( ..., models = list(), metrics = NULL, test_data = NULL, plot = TRUE )
... |
ukb_ml objects to compare |
models |
Alternative: list of ukb_ml objects |
metrics |
Metrics to compare |
test_data |
Optional common test data |
plot |
Whether to create comparison plot |
ukb_ml_compare object with comparison results
Runs the same machine-learning workflow across multiple feature sets using a
shared ukb_ml_split. For binary outcomes, the function can tune models
by cross-validation, learn a threshold on training-development predictions,
refit the final model, evaluate the frozen test set, and return unified
metrics, prediction, threshold, and ROC tables.
ukb_ml_compare_feature_sets( split, feature_sets, outcome = NULL, model = "xgboost", outcome_type = c("auto", "binary"), model_labels = NULL, param_grid = NULL, tune_params = list(), threshold_method = c("none", "fixed", "youden"), threshold_params = list(), metrics = c("auc", "accuracy", "sensitivity", "specificity", "ppv", "npv", "f1", "brier"), positive_class = NULL, use_validation_in_refit = FALSE, seed = NULL, verbose = TRUE )ukb_ml_compare_feature_sets( split, feature_sets, outcome = NULL, model = "xgboost", outcome_type = c("auto", "binary"), model_labels = NULL, param_grid = NULL, tune_params = list(), threshold_method = c("none", "fixed", "youden"), threshold_params = list(), metrics = c("auc", "accuracy", "sensitivity", "specificity", "ppv", "npv", "f1", "brier"), positive_class = NULL, use_validation_in_refit = FALSE, seed = NULL, verbose = TRUE )
split |
A |
feature_sets |
Named list of character vectors. Each vector contains the feature names used by one model. |
outcome |
Optional outcome column. Defaults to |
model |
Model type passed to |
outcome_type |
Outcome type. Currently this helper is intended for binary classification. |
model_labels |
Optional labels for feature sets. Can be a named vector or
a vector in the same order as |
param_grid |
Optional parameter grid. Can be a single grid shared by all models or a named list keyed by feature-set name. |
tune_params |
Additional arguments passed to |
threshold_method |
|
threshold_params |
Additional arguments passed to
|
metrics |
Optional metric names passed to |
positive_class |
Optional positive class label for binary outcomes. |
use_validation_in_refit |
Logical passed to |
seed |
Optional random seed. |
verbose |
Logical. |
A ukb_ml_feature_set_compare object containing per-feature-set
models and unified result tables.
Batch-runs ukb_ml_flow across feature-set and model
combinations. The same frozen train/test split is reused for every
combination, making the output suitable for comparing different feature
groups, different machine-learning algorithms, or the full
feature-set-by-model grid.
ukb_ml_compare_flows( formula = NULL, data = NULL, split = NULL, train_data = NULL, test_data = NULL, validation_data = NULL, id_col = NULL, outcome = NULL, feature_sets = NULL, features = NULL, models = "xgboost", compare = c("auto", "feature_sets", "models", "both"), outcome_type = c("auto", "binary", "multiclass", "continuous"), feature_set_labels = NULL, model_labels = NULL, param_grid = NULL, tune_params = list(), threshold_params = list(), ... )ukb_ml_compare_flows( formula = NULL, data = NULL, split = NULL, train_data = NULL, test_data = NULL, validation_data = NULL, id_col = NULL, outcome = NULL, feature_sets = NULL, features = NULL, models = "xgboost", compare = c("auto", "feature_sets", "models", "both"), outcome_type = c("auto", "binary", "multiclass", "continuous"), feature_set_labels = NULL, model_labels = NULL, param_grid = NULL, tune_params = list(), threshold_params = list(), ... )
formula |
Optional base formula. The response is used as the outcome.
Predictors are used as the default feature set when |
data, split, train_data, test_data, validation_data, id_col
|
Passed to
|
outcome |
Optional outcome column. Required when |
feature_sets |
Optional named list of feature vectors. If |
features |
Optional feature names used when |
models |
Character vector of models supported by
|
compare |
Comparison mode: |
outcome_type |
Outcome type passed to |
feature_set_labels |
Optional labels for feature sets. |
model_labels |
Optional labels for models. |
param_grid |
Optional hyperparameter grid. Can be a single grid shared
by all combinations, a named list keyed by model, feature set, or
|
tune_params |
Optional list passed to |
threshold_params |
Optional list passed to |
... |
Additional arguments passed to |
A ukb_ml_flow_compare object containing flows,
metrics, comparison, predictions, roc, and
thresholds.
Generate confusion matrix for classification model.
ukb_ml_confusion(object, newdata = NULL, threshold = 0.5, plot = TRUE, ...)ukb_ml_confusion(object, newdata = NULL, threshold = 0.5, plot = TRUE, ...)
object |
A ukb_ml object |
newdata |
Optional new data |
threshold |
Classification threshold (default 0.5) |
plot |
Whether to create confusion matrix plot |
... |
Additional arguments |
ukb_ml_confusion object
Perform k-fold cross-validation for ML models.
ukb_ml_cv( formula, data, model = "rf", task = "classification", folds = 5, repeats = 1, stratify = TRUE, metrics = NULL, params = list(), seed = NULL, verbose = TRUE, ... )ukb_ml_cv( formula, data, model = "rf", task = "classification", folds = 5, repeats = 1, stratify = TRUE, metrics = NULL, params = list(), seed = NULL, verbose = TRUE, ... )
formula |
Model formula |
data |
Data frame |
model |
Model type |
task |
Task type |
folds |
Number of folds (default 5) |
repeats |
Number of repeats (default 1) |
stratify |
Use stratified folds for classification |
metrics |
Metrics to compute |
params |
Model parameters |
seed |
Random seed |
verbose |
Print progress |
... |
Additional arguments |
ukb_ml_cv object with cross-validation results
Compute Decision Curve Analysis (DCA) net benefit across a range of threshold probabilities for a binary classification model.
ukb_ml_dca( object, newdata = NULL, plot = TRUE, thresholds = seq(0.01, 0.99, by = 0.01), harm = 0, ... )ukb_ml_dca( object, newdata = NULL, plot = TRUE, thresholds = seq(0.01, 0.99, by = 0.01), harm = 0, ... )
object |
A ukb_ml object |
newdata |
Optional new data |
plot |
Whether to create the DCA plot (default TRUE) |
thresholds |
Numeric vector of threshold probabilities (default seq(0.01, 0.99, by = 0.01)) |
harm |
Additional harm parameter subtracted from net benefit (default 0) |
... |
Additional arguments |
A ukb_ml_dca object with field data containing: threshold, net_benefit_model, net_benefit_all, net_benefit_none
Applies the final model, selected features, tuned hyperparameters, and fixed threshold to the frozen test set exactly once.
ukb_ml_evaluate_test( object, split, metrics = NULL, threshold = NULL, positive_class = NULL, verbose = TRUE )ukb_ml_evaluate_test( object, split, metrics = NULL, threshold = NULL, positive_class = NULL, verbose = TRUE )
object |
A |
split |
A |
metrics |
Optional metric names to return. |
threshold |
Optional threshold override for binary classification. |
positive_class |
Optional positive class label. |
verbose |
Logical. |
A ukb_ml_test_eval object.
Performs optional feature selection using only the training/development data
in a ukb_ml_split. The test set is never used for feature selection.
ukb_ml_feature_select( split, formula, method = c("none", "boruta", "filter", "glmnet"), outcome_type = c("auto", "binary", "multiclass", "continuous"), max_features = NULL, boruta_params = list(), keep_tentative = TRUE, seed = NULL, verbose = TRUE )ukb_ml_feature_select( split, formula, method = c("none", "boruta", "filter", "glmnet"), outcome_type = c("auto", "binary", "multiclass", "continuous"), max_features = NULL, boruta_params = list(), keep_tentative = TRUE, seed = NULL, verbose = TRUE )
split |
A |
formula |
Model formula. |
method |
|
outcome_type |
Outcome type. Defaults to the split outcome type. |
max_features |
Optional maximum number of selected features. |
boruta_params |
Parameters passed to |
keep_tentative |
Logical. Keep Boruta tentative features. |
seed |
Optional random seed. |
verbose |
Logical. |
A ukb_ml_feature object.
Fits the final model with selected features and tuned parameters using train or train plus validation data. The frozen test set is not used.
ukb_ml_fit_final( split, formula, model, best_params = list(), outcome_type = c("auto", "binary", "multiclass", "continuous"), feature_spec = NULL, threshold = NULL, use_validation_in_refit = TRUE, seed = NULL, verbose = TRUE, ... )ukb_ml_fit_final( split, formula, model, best_params = list(), outcome_type = c("auto", "binary", "multiclass", "continuous"), feature_spec = NULL, threshold = NULL, use_validation_in_refit = TRUE, seed = NULL, verbose = TRUE, ... )
split |
A |
formula |
Model formula. |
model |
Model type. |
best_params |
Best hyperparameters. |
outcome_type |
Outcome type. |
feature_spec |
Optional |
threshold |
Optional |
use_validation_in_refit |
Logical. If TRUE, refit on train + validation. |
seed |
Optional random seed. |
verbose |
Logical. |
... |
Additional arguments. |
A ukb_ml_final object.
High-level single-model interface for common UK Biobank machine-learning analyses. The function can create or consume a frozen train/test split, tune model hyperparameters, learn a binary threshold, fit the final model, evaluate the frozen test set, prepare ROC data, and optionally compute SHAP values.
ukb_ml_flow( formula = NULL, data = NULL, split = NULL, train_data = NULL, test_data = NULL, validation_data = NULL, id_col = NULL, outcome = NULL, features = NULL, model = "xgboost", model_id = "model", model_label = NULL, outcome_type = c("auto", "binary", "multiclass", "continuous"), split_params = list(), param_grid = NULL, tune = TRUE, tune_params = list(), best_params = NULL, threshold_method = c("none", "fixed", "youden"), threshold_params = list(), metrics = NULL, positive_class = NULL, use_validation_in_refit = FALSE, compute_shap = FALSE, shap_data = NULL, shap_params = list(), seed = NULL, verbose = TRUE )ukb_ml_flow( formula = NULL, data = NULL, split = NULL, train_data = NULL, test_data = NULL, validation_data = NULL, id_col = NULL, outcome = NULL, features = NULL, model = "xgboost", model_id = "model", model_label = NULL, outcome_type = c("auto", "binary", "multiclass", "continuous"), split_params = list(), param_grid = NULL, tune = TRUE, tune_params = list(), best_params = NULL, threshold_method = c("none", "fixed", "youden"), threshold_params = list(), metrics = NULL, positive_class = NULL, use_validation_in_refit = FALSE, compute_shap = FALSE, shap_data = NULL, shap_params = list(), seed = NULL, verbose = TRUE )
formula |
Model formula. Required unless both |
data |
Optional full dataset. Used to create a split when |
split |
Optional |
train_data, test_data, validation_data
|
Optional pre-split datasets used
when |
id_col |
Optional participant ID column for overlap checks and output predictions. |
outcome |
Optional outcome column. Defaults to the response in
|
features |
Optional feature names. Used when |
model |
Model type passed to |
model_id, model_label
|
Optional model identifier and display label. |
outcome_type |
Outcome type. |
split_params |
List passed to |
param_grid |
Optional hyperparameter grid. |
tune |
Logical. Run hyperparameter tuning. |
tune_params |
Additional arguments passed to |
best_params |
Optional final model parameters when |
threshold_method |
|
threshold_params |
Additional arguments passed to
|
metrics |
Optional metric names passed to |
positive_class |
Optional positive class label for binary outcomes. |
use_validation_in_refit |
Logical passed to |
compute_shap |
Logical. Compute SHAP values for the final model. |
shap_data |
Optional data used for SHAP. Defaults to the frozen test set. |
shap_params |
Additional arguments passed to |
seed |
Optional random seed. |
verbose |
Logical. |
A ukb_ml_flow object with standardized components:
split, formula, features, tune,
threshold, final_model, test_eval,
metrics, predictions, roc, and optional shap.
Compute Gain and Lift curves for a binary classification model by ranking predictions into decile bins.
ukb_ml_gain_lift(object, newdata = NULL, plot = TRUE, n_bins = 10, ...)ukb_ml_gain_lift(object, newdata = NULL, plot = TRUE, n_bins = 10, ...)
object |
A ukb_ml object |
newdata |
Optional new data |
plot |
Whether to create gain and lift plots (default TRUE) |
n_bins |
Number of bins / deciles (default 10) |
... |
Additional arguments |
A ukb_ml_gain_lift object with field data containing: decile, population_pct, positive_capture_pct, gain, lift
Extract variable importance from a trained ML model.
ukb_ml_importance(object, type = NULL, ...)ukb_ml_importance(object, type = NULL, ...)
object |
A ukb_ml object |
type |
Importance type (model-specific) |
... |
Additional arguments |
Data frame with variable importance scores
Compute Kolmogorov-Smirnov curve (TPR - FPR vs threshold) for a binary classification model.
ukb_ml_ks(object, newdata = NULL, plot = TRUE, n_thresholds = 200, ...)ukb_ml_ks(object, newdata = NULL, plot = TRUE, n_thresholds = 200, ...)
object |
A ukb_ml object |
newdata |
Optional new data for evaluation |
plot |
Whether to create the KS plot (default TRUE) |
n_thresholds |
Number of threshold points (default 200) |
... |
Additional arguments |
A ukb_ml_ks object with fields: data (threshold/tpr/fpr/ks), ks_stat (max KS), ks_threshold (threshold at max KS)
Compute performance metrics for a trained ML model.
ukb_ml_metrics( object, newdata = NULL, metrics = NULL, ci = FALSE, ci_method = c("bootstrap", "delong"), n_boot = 1000, verbose = TRUE, ... )ukb_ml_metrics( object, newdata = NULL, metrics = NULL, ci = FALSE, ci_method = c("bootstrap", "delong"), n_boot = 1000, verbose = TRUE, ... )
object |
A ukb_ml object |
newdata |
Optional new data for evaluation |
metrics |
Specific metrics to compute (NULL for defaults) |
ci |
Logical; compute confidence intervals (default FALSE) |
ci_method |
Method for CI: "bootstrap" or "delong" (for AUC) |
n_boot |
Number of bootstrap samples |
verbose |
Print results |
... |
Additional arguments |
Named vector or list with metrics and optional CIs
Unified interface for training machine learning models on UK Biobank data. Supports random forest, XGBoost, elastic net, SVM, and neural networks.
ukb_ml_model( formula, data, model = c("rf", "xgboost", "glmnet", "svm", "nnet", "logistic"), task = c("classification", "regression"), split_ratio = 0.8, stratify = TRUE, seed = NULL, sample_n = NULL, params = list(), cv = FALSE, cv_folds = 5, verbose = TRUE, ... )ukb_ml_model( formula, data, model = c("rf", "xgboost", "glmnet", "svm", "nnet", "logistic"), task = c("classification", "regression"), split_ratio = 0.8, stratify = TRUE, seed = NULL, sample_n = NULL, params = list(), cv = FALSE, cv_folds = 5, verbose = TRUE, ... )
formula |
Model formula (e.g., outcome ~ var1 + var2) |
data |
Data frame containing variables |
model |
Model type: "rf", "xgboost", "glmnet", "svm", "nnet", "logistic" |
task |
Task type: "classification" or "regression" |
split_ratio |
Train/test split ratio (default 0.8) |
stratify |
Logical; use stratified sampling for classification (default TRUE) |
seed |
Random seed for reproducibility |
sample_n |
Optional; subsample data for large datasets |
params |
List of model-specific parameters |
cv |
Logical; perform cross-validation (default FALSE) |
cv_folds |
Number of CV folds (default 5) |
verbose |
Logical; print progress messages |
... |
Additional arguments passed to model function |
An object of class "ukb_ml" containing:
model: The fitted model object
model_type: Type of model used
task: Task type (classification/regression)
predictors: Names of predictor variables
outcome: Name of outcome variable
train_data: Training data
test_data: Test data
metrics: Model performance metrics
Compute Precision-Recall curve and area under PR curve (AUPRC) for a binary classification model.
ukb_ml_pr(object, newdata = NULL, plot = TRUE, n_thresholds = 200, ...)ukb_ml_pr(object, newdata = NULL, plot = TRUE, n_thresholds = 200, ...)
object |
A ukb_ml object |
newdata |
Optional new data |
plot |
Whether to create the PR plot (default TRUE) |
n_thresholds |
Number of threshold points (default 200) |
... |
Additional arguments |
A ukb_ml_pr object with fields: data (threshold/precision/recall), auprc, prevalence
Generate predictions from a trained ukb_ml model.
ukb_ml_predict( object, newdata = NULL, type = c("response", "prob", "class", "link"), ... )ukb_ml_predict( object, newdata = NULL, type = c("response", "prob", "class", "link"), ... )
object |
A ukb_ml object from ukb_ml_model() |
newdata |
Optional new data for prediction. If NULL, uses test data. |
type |
Prediction type: "response", "prob", "class", "link" |
... |
Additional arguments |
Predictions as vector or matrix
Generate ROC curve and calculate AUC with optional confidence intervals.
ukb_ml_roc( object, newdata = NULL, plot = TRUE, ci = TRUE, ci_method = c("delong", "bootstrap"), ... )ukb_ml_roc( object, newdata = NULL, plot = TRUE, ci = TRUE, ci_method = c("delong", "bootstrap"), ... )
object |
A ukb_ml object or list of objects |
newdata |
Optional new data |
plot |
Whether to create ROC plot (default TRUE) |
ci |
Compute confidence interval for AUC |
ci_method |
Method: "delong" (default) or "bootstrap" |
... |
Additional arguments |
ukb_ml_roc object with ROC curve data
Converts binary outcome predictions into a tidy ROC curve table with AUC and optional 95% confidence interval. This helper is useful for plotting one or more model ROC curves without re-running model evaluation.
ukb_ml_roc_data( truth, prob, model_id = NULL, model_label = NULL, positive_class = NULL, ci = TRUE, ci_method = c("delong", "bootstrap"), quiet = TRUE )ukb_ml_roc_data( truth, prob, model_id = NULL, model_label = NULL, positive_class = NULL, ci = TRUE, ci_method = c("delong", "bootstrap"), quiet = TRUE )
truth |
True binary outcome values. |
prob |
Predicted probability for the positive class. |
model_id |
Optional model identifier. |
model_label |
Optional model label used in plots. |
positive_class |
Optional positive class label. Defaults to the second
level after converting |
ci |
Logical. Calculate AUC 95% confidence interval. |
ci_method |
Method passed to |
quiet |
Logical passed to |
A data.frame with specificity, sensitivity, false-positive rate, threshold, AUC, and optional confidence interval columns.
Creates a standardized ukb_ml_split object for the high-level ML
workflow. Supports train/test and train/validation/test splits. The older
split_ratio/stratify_by = <column> calling style is still
accepted for compatibility.
ukb_ml_split_data( df, outcome = NULL, outcome_type = c("auto", "binary", "multiclass", "continuous"), split = c("train_test", "train_valid_test"), train_ratio = 0.7, validation_ratio = 0.1, test_ratio = 0.2, split_ratio = NULL, stratify_by = c("auto", "outcome", "custom", "none"), stratify_col = NULL, regression_bins = 5, seed = NULL, verbose = TRUE )ukb_ml_split_data( df, outcome = NULL, outcome_type = c("auto", "binary", "multiclass", "continuous"), split = c("train_test", "train_valid_test"), train_ratio = 0.7, validation_ratio = 0.1, test_ratio = 0.2, split_ratio = NULL, stratify_by = c("auto", "outcome", "custom", "none"), stratify_col = NULL, regression_bins = 5, seed = NULL, verbose = TRUE )
df |
A data.frame or data.table. |
outcome |
Outcome column name. If NULL, a legacy random split is
returned with |
outcome_type |
One of |
split |
Either |
train_ratio |
Training proportion. |
validation_ratio |
Validation proportion for train/validation/test. |
test_ratio |
Test proportion. |
split_ratio |
Deprecated compatibility alias for |
stratify_by |
|
stratify_col |
Column used when |
regression_bins |
Number of quantile bins for continuous outcome stratification. |
seed |
Optional random seed. |
verbose |
Logical. Print split summary. |
A ukb_ml_split object.
Returns the machine-learning algorithms supported by the UKBAnalytica ML workflow, including eligible outcome types, required R package, and default tuning parameters.
ukb_ml_supported_models( outcome_type = c("all", "binary", "multiclass", "continuous") )ukb_ml_supported_models( outcome_type = c("all", "binary", "multiclass", "continuous") )
outcome_type |
Optional outcome type filter: |
A data.frame describing supported models.
ukb_ml_supported_models("binary")ukb_ml_supported_models("binary")
Deprecated legacy interface for training machine learning models for
survival analysis. New analyses should use
ukb_ml_survival_workflow, which freezes the test set before
feature selection, tuning, final refit, and evaluation.
ukb_ml_survival( formula, data, model = c("rsf", "gbm_surv", "coxnet"), split_ratio = 0.8, seed = NULL, params = list(), verbose = TRUE, ... )ukb_ml_survival( formula, data, model = c("rsf", "gbm_surv", "coxnet"), split_ratio = 0.8, seed = NULL, params = list(), verbose = TRUE, ... )
formula |
Survival formula (e.g., Surv(time, event) ~ x1 + x2) |
data |
Data frame |
model |
Model type: "rsf" (random survival forest), "gbm_surv" (gradient boosting), "coxnet" (regularized Cox) |
split_ratio |
Train/test split ratio (default 0.8) |
seed |
Random seed |
params |
List of model-specific parameters |
verbose |
Print progress |
... |
Additional arguments |
A ukb_ml_surv object containing:
model: Fitted survival model
c_index: Harrell's C-index on test data
train_data, test_data: Split datasets
Register user-provided survival train/test datasets as a frozen split object.
This is the survival analogue of ukb_ml_as_split.
ukb_ml_survival_as_split( train_data, test_data, validation_data = NULL, time, event, id_col = NULL, check_overlap = TRUE )ukb_ml_survival_as_split( train_data, test_data, validation_data = NULL, time, event, id_col = NULL, check_overlap = TRUE )
train_data |
Training/development data. |
test_data |
Frozen test data. |
validation_data |
Optional validation data. |
time |
Survival time column. |
event |
Event indicator column coded 0/1. |
id_col |
Optional participant ID column used to check overlap. |
check_overlap |
Logical. Check duplicated and overlapping IDs. |
A ukb_ml_survival_split object.
Computes final survival ML metrics on the frozen test set. The primary metric is Harrell's C-index. Naive time-specific Brier scores are also reported for requested prediction times without IPCW adjustment.
ukb_ml_survival_evaluate_test( object, split, times = c(1, 3, 5, 10), verbose = TRUE, ... )ukb_ml_survival_evaluate_test( object, split, times = c(1, 3, 5, 10), verbose = TRUE, ... )
object |
A |
split |
A |
times |
Time points for survival probability prediction. |
verbose |
Logical. |
... |
Additional arguments. |
A ukb_ml_survival_test_eval object.
Performs optional feature selection using only training data. The test set is never used.
ukb_ml_survival_feature_select( split, formula, method = c("none", "filter", "glmnet"), max_features = NULL, seed = NULL, verbose = TRUE )ukb_ml_survival_feature_select( split, formula, method = c("none", "filter", "glmnet"), max_features = NULL, seed = NULL, verbose = TRUE )
split |
A |
formula |
Survival formula. |
method |
|
max_features |
Optional maximum number of selected features. |
seed |
Optional random seed. |
verbose |
Logical. |
A ukb_ml_survival_feature object.
Refits a survival ML model on training plus validation data when available, leaving the frozen test set untouched.
ukb_ml_survival_fit_final( split, formula, model, best_params = list(), feature_spec = NULL, seed = NULL, verbose = TRUE, ... )ukb_ml_survival_fit_final( split, formula, model, best_params = list(), feature_spec = NULL, seed = NULL, verbose = TRUE, ... )
split |
A |
formula |
Survival formula. |
model |
Survival model type. |
best_params |
Model parameters. |
feature_spec |
Optional feature-selection result. |
seed |
Optional random seed. |
verbose |
Logical. |
... |
Additional arguments passed to the fitter. |
A ukb_ml_survival_final object.
Get Variable Importance for Survival Model
ukb_ml_survival_importance(object, ...)ukb_ml_survival_importance(object, ...)
object |
A ukb_ml_surv object |
... |
Additional arguments |
Data frame with variable importance
Generate predictions from a survival ML model or survival ML workflow.
ukb_ml_survival_predict( object, newdata = NULL, times = c(1, 3, 5, 10), type = c("survival", "risk", "chf"), ... )ukb_ml_survival_predict( object, newdata = NULL, times = c(1, 3, 5, 10), type = c("survival", "risk", "chf"), ... )
object |
A |
newdata |
Optional new data |
times |
Time points for survival prediction |
type |
Prediction type: "risk", "survival", "chf" (cumulative hazard) |
... |
Additional arguments |
Matrix of predictions (observations x time points)
Compute SHAP values for survival ML models at a specific time point.
ukb_ml_survival_shap( object, data = NULL, time_point = 5, nsim = 50, sample_n = NULL, seed = NULL, verbose = TRUE, ... )ukb_ml_survival_shap( object, data = NULL, time_point = 5, nsim = 50, sample_n = NULL, seed = NULL, verbose = TRUE, ... )
object |
A ukb_ml_surv object |
data |
Data for SHAP computation |
time_point |
Time point for SHAP calculation |
nsim |
Number of Monte Carlo samples |
sample_n |
Subsample size |
seed |
Random seed |
verbose |
Print progress |
... |
Additional arguments |
A ukb_shap object
Creates a frozen train/test or train/validation/test split for time-to-event machine learning. Event status is used for stratification by default.
ukb_ml_survival_split_data( df, time, event, split = c("train_test", "train_valid_test"), train_ratio = 0.7, validation_ratio = 0.1, test_ratio = 0.2, stratify_by = c("event", "custom", "none"), stratify_col = NULL, seed = NULL, verbose = TRUE )ukb_ml_survival_split_data( df, time, event, split = c("train_test", "train_valid_test"), train_ratio = 0.7, validation_ratio = 0.1, test_ratio = 0.2, stratify_by = c("event", "custom", "none"), stratify_col = NULL, seed = NULL, verbose = TRUE )
df |
A data.frame or data.table. |
time |
Survival time column. |
event |
Event indicator column coded 0/1. |
split |
Either |
train_ratio |
Training proportion. |
validation_ratio |
Validation proportion for train/validation/test. |
test_ratio |
Test proportion. |
stratify_by |
|
stratify_col |
Column used when |
seed |
Optional random seed. |
verbose |
Logical. Print split summary. |
A ukb_ml_survival_split object.
Tunes survival ML models using validation data or cross-validation inside the training set. The frozen test set is never used.
ukb_ml_survival_tune( split, formula, model, search = c("grid", "random"), param_grid = NULL, param_space = NULL, n_iter = NULL, resampling = c("cv", "validation"), folds = 5, metric = "c_index", maximize = TRUE, seed = NULL, verbose = TRUE, ... )ukb_ml_survival_tune( split, formula, model, search = c("grid", "random"), param_grid = NULL, param_space = NULL, n_iter = NULL, resampling = c("cv", "validation"), folds = 5, metric = "c_index", maximize = TRUE, seed = NULL, verbose = TRUE, ... )
split |
A |
formula |
Survival formula. |
model |
|
search |
|
param_grid |
List or data.frame of candidate parameters. |
param_space |
Parameter space for random search. |
n_iter |
Number of random-search iterations. |
resampling |
|
folds |
Number of CV folds. |
metric |
Currently |
maximize |
Logical. Whether higher metric values are better. |
seed |
Optional random seed. |
verbose |
Logical. |
... |
Reserved for future extensions. |
A ukb_ml_survival_tune object.
High-level survival ML workflow for time-to-event prediction. The test set is frozen before feature selection, hyperparameter tuning, final refit, and final evaluation.
ukb_ml_survival_workflow( formula, data = NULL, split = NULL, model = c("cox", "rsf", "gbm_surv", "coxnet"), split_params = list(), feature_select = c("none", "filter", "glmnet"), feature_params = list(), tune = TRUE, tune_params = list(), evaluation_times = c(1, 3, 5, 10), fit_final = TRUE, evaluate_test = TRUE, seed = NULL, verbose = TRUE, ... )ukb_ml_survival_workflow( formula, data = NULL, split = NULL, model = c("cox", "rsf", "gbm_surv", "coxnet"), split_params = list(), feature_select = c("none", "filter", "glmnet"), feature_params = list(), tune = TRUE, tune_params = list(), evaluation_times = c(1, 3, 5, 10), fit_final = TRUE, evaluate_test = TRUE, seed = NULL, verbose = TRUE, ... )
formula |
Survival formula, for example |
data |
Optional full dataset. Required when |
split |
Optional |
model |
|
split_params |
List passed to |
feature_select |
|
feature_params |
List passed to
|
tune |
Logical. Run hyperparameter tuning. |
tune_params |
List passed to |
evaluation_times |
Time points for survival probability prediction. |
fit_final |
Logical. Refit final model. |
evaluate_test |
Logical. Evaluate once on frozen test set. |
seed |
Optional random seed. |
verbose |
Logical. |
... |
Additional arguments. |
A ukb_ml_survival_workflow object.
Selects a binary classification threshold using a fixed value or Youden index on training-development predictions. The test set should never be supplied to this function.
ukb_ml_threshold( truth, prob, method = c("fixed", "youden"), fixed_threshold = 0.5, positive_class = NULL )ukb_ml_threshold( truth, prob, method = c("fixed", "youden"), fixed_threshold = 0.5, positive_class = NULL )
truth |
True binary outcome values. |
prob |
Predicted probability for the positive class. |
method |
|
fixed_threshold |
Threshold used when |
positive_class |
Optional positive class label. |
A ukb_ml_threshold object.
Searches model hyperparameters using only the training/development portion of
a ukb_ml_split. The frozen test set is never used.
ukb_ml_tune( split, formula, model, outcome_type = c("auto", "binary", "multiclass", "continuous"), search = c("grid", "random", "bayes"), param_grid = NULL, param_space = NULL, n_iter = NULL, resampling = c("cv", "validation"), folds = 5, metric = NULL, maximize = NULL, seed = NULL, verbose = TRUE, ... )ukb_ml_tune( split, formula, model, outcome_type = c("auto", "binary", "multiclass", "continuous"), search = c("grid", "random", "bayes"), param_grid = NULL, param_space = NULL, n_iter = NULL, resampling = c("cv", "validation"), folds = 5, metric = NULL, maximize = NULL, seed = NULL, verbose = TRUE, ... )
split |
A |
formula |
Model formula. |
model |
Model type. |
outcome_type |
Outcome type. |
search |
|
param_grid |
List or data.frame of candidate parameters. |
param_space |
Parameter space for random or Bayesian search. |
n_iter |
Number of random/Bayesian iterations. |
resampling |
|
folds |
Number of CV folds. |
metric |
Metric to optimize. |
maximize |
Logical. Whether higher metric values are better. |
seed |
Optional random seed. |
verbose |
Logical. |
... |
Reserved for future extensions. |
A ukb_ml_tune object.
High-level, publication-oriented ML workflow for binary, multiclass, and continuous outcomes. The test set is frozen before feature selection, hyperparameter tuning, threshold learning, and final refit.
ukb_ml_workflow( formula, data = NULL, split = NULL, model, outcome_type = c("auto", "binary", "multiclass", "continuous"), split_params = list(), feature_select = c("none", "boruta", "filter", "glmnet"), feature_params = list(), tune = TRUE, tune_params = list(), threshold_method = c("none", "fixed", "youden"), threshold_params = list(), fit_final = TRUE, evaluate_test = TRUE, seed = NULL, verbose = TRUE, ... )ukb_ml_workflow( formula, data = NULL, split = NULL, model, outcome_type = c("auto", "binary", "multiclass", "continuous"), split_params = list(), feature_select = c("none", "boruta", "filter", "glmnet"), feature_params = list(), tune = TRUE, tune_params = list(), threshold_method = c("none", "fixed", "youden"), threshold_params = list(), fit_final = TRUE, evaluate_test = TRUE, seed = NULL, verbose = TRUE, ... )
formula |
Model formula. |
data |
Optional full dataset. Required when |
split |
Optional |
model |
Model type: |
outcome_type |
|
split_params |
List passed to |
feature_select |
|
feature_params |
List passed to |
tune |
Logical. Run hyperparameter tuning. |
tune_params |
List passed to |
threshold_method |
|
threshold_params |
List passed to |
fit_final |
Logical. Refit final model. |
evaluate_test |
Logical. Evaluate once on frozen test set. |
seed |
Optional random seed. |
verbose |
Logical. |
... |
Additional arguments. |
A ukb_ml_workflow object.
Apply sequential inclusion or exclusion rules and record the number of participants retained and removed at each step. Rules can be supplied as one-sided formulas, functions, logical vectors, or character vectors of variables requiring complete-case data.
ukb_participant_flow( data, steps, id_col = NULL, outcome_col = NULL, event_value = 1, start_label = "Initial population" )ukb_participant_flow( data, steps, id_col = NULL, outcome_col = NULL, event_value = 1, start_label = "Initial population" )
data |
A data.frame or data.table. |
steps |
A named list of rules. Each rule can be:
a one-sided formula such as |
id_col |
Optional participant identifier column. If supplied, duplicate non-missing IDs are reported as an error. |
outcome_col |
Optional 0/1 outcome column used to count events after each step. |
event_value |
Value in |
start_label |
Label for the first row. |
A data.frame with class ukb_participant_flow. The kept row index is
stored in attr(result, "kept_index").
dat <- data.frame( eid = 1:5, age = c(50, 60, NA, 55, 70), status = c(0, 1, 0, 1, 0) ) flow <- ukb_participant_flow( dat, steps = list("Complete age" = "age"), id_col = "eid", outcome_col = "status" )dat <- data.frame( eid = 1:5, age = c(50, 60, NA, 55, 70), status = c(0, 1, 0, 1, 0) ) flow <- ukb_participant_flow( dat, steps = list("Complete age" = "age"), id_col = "eid", outcome_col = "status" )
Annotate Olink-style protein variables
ukb_protein_annotation( variables, protein_prefix = "^olink_instance_0[.]", drop_unmapped = FALSE )ukb_protein_annotation( variables, protein_prefix = "^olink_instance_0[.]", drop_unmapped = FALSE )
variables |
Protein variable names. |
protein_prefix |
Regular expression prefix removed from variables. |
drop_unmapped |
Passed to |
A data.frame with variable, protein_clean, gene_symbol, and mapping_source.
Searches UK Biobank variable metadata using a RAP-generated official data dictionary and the UKBAnalytica Chinese dictionary. Chinese queries are first matched against the built-in Chinese dictionary and translated into English candidate terms before matching the official dictionary. English queries, field IDs, and RAP/UKB column names are searched directly in the official dictionary.
This function is intended for RAP use. By default it requires a RAP-like
environment; set require_rap = FALSE only for package development or
tests using simulated dictionaries.
ukb_query_dictionary( query, official_dict = NULL, zh_dict = NULL, dataset = NULL, output_dir = tempdir(), language = c("auto", "zh", "en", "field_id", "column"), translation_map = NULL, max_results = 20, min_score = 0.35, require_rap = TRUE, timeout = 600 )ukb_query_dictionary( query, official_dict = NULL, zh_dict = NULL, dataset = NULL, output_dir = tempdir(), language = c("auto", "zh", "en", "field_id", "column"), translation_map = NULL, max_results = 20, min_score = 0.35, require_rap = TRUE, timeout = 600 )
query |
Character vector of query terms, Chinese variable names, English names, UKB field IDs, or UKB/RAP column names. |
official_dict |
Optional official RAP data dictionary CSV. If
|
zh_dict |
Optional Chinese dictionary CSV. Defaults to the UKBAnalytica built-in Chinese dictionary. |
dataset |
RAP |
output_dir |
Directory used when downloading the official dictionary. |
language |
Query language. |
translation_map |
Optional data.frame with columns |
max_results |
Maximum official dictionary matches returned per query. |
min_score |
Minimum internal matching score for official dictionary matches. |
require_rap |
Logical. Require a RAP-like environment before querying. |
timeout |
Timeout in seconds when downloading the official dictionary. |
A list of class ukb_dictionary_query with official matches,
Chinese matches, query metadata, and source paths.
Apply previously estimated centering and scaling parameters to a data set.
The parameter table can use either the native output from
ukb_standardize_by_train() (variable, center, scale) or the legacy
long format used by early case-study scripts (protein, statistic,
value).
ukb_scale_with_parameters(data, parameters, variables = NULL)ukb_scale_with_parameters(data, parameters, variables = NULL)
data |
Data frame to transform. |
parameters |
Scaling parameter table. |
variables |
Optional variables to transform. Defaults to all variables available in the parameter table. |
A data.table with standardized variables.
Search UK Biobank fields
ukb_search_fields( query = NULL, field_id = NULL, metadata = NULL, max_results = 50, search_in = c("title", "description", "category", "field_name", "rap_field_names", "coding_id"), ... )ukb_search_fields( query = NULL, field_id = NULL, metadata = NULL, max_results = 50, search_in = c("title", "description", "category", "field_name", "rap_field_names", "coding_id"), ... )
query |
Optional keyword matched against field title, description, category, coding ID, and RAP column names. |
field_id |
Optional UKB field IDs for exact lookup. |
metadata |
Optional object from |
max_results |
Maximum number of rows to return. |
search_in |
Columns to search. |
... |
Arguments passed to |
A data.frame of class ukb_search_result.
Fit a primary Cox model and common sensitivity models using the same endpoint, exposure, and covariate structure. The suite currently supports complete-case filtering, exclusion of early events, and additional covariate adjustment sets.
ukb_sensitivity_suite( data, exposure, covariates = NULL, endpoint = c("outcome_surv_time", "outcome_status"), early_event_years = c(2, 4, 6), complete_case_covariates = NULL, additional_covariate_sets = NULL, conf_level = 0.95, verbose = TRUE )ukb_sensitivity_suite( data, exposure, covariates = NULL, endpoint = c("outcome_surv_time", "outcome_status"), early_event_years = c(2, 4, 6), complete_case_covariates = NULL, additional_covariate_sets = NULL, conf_level = 0.95, verbose = TRUE )
data |
A data.frame or data.table. |
exposure |
Character vector of exposure variables. |
covariates |
Optional character vector of primary adjustment covariates. |
endpoint |
Character vector of length 2 giving survival time and status. |
early_event_years |
Optional numeric vector of lag periods used to exclude events occurring at or before each cut point. |
complete_case_covariates |
Optional covariates for a complete-case sensitivity dataset. |
additional_covariate_sets |
Optional named list of extra covariate vectors. Each set is added to the primary covariates and refitted. |
conf_level |
Confidence level for hazard-ratio intervals. |
verbose |
Logical. If |
A list with class ukb_sensitivity_suite containing model objects,
flow metadata, and a tidy summary table.
set.seed(1) dat <- data.frame( time = rexp(100, 0.1), status = rbinom(100, 1, 0.3), exposure = rnorm(100), age = rnorm(100, 60, 5), sex = rbinom(100, 1, 0.5) ) res <- ukb_sensitivity_suite( dat, exposure = "exposure", covariates = c("age", "sex"), endpoint = c("time", "status"), early_event_years = 1, verbose = FALSE )set.seed(1) dat <- data.frame( time = rexp(100, 0.1), status = rbinom(100, 1, 0.3), exposure = rnorm(100), age = rnorm(100, 60, 5), sex = rbinom(100, 1, 0.5) ) res <- ukb_sensitivity_suite( dat, exposure = "exposure", covariates = c("age", "sex"), endpoint = c("time", "status"), early_event_years = 1, verbose = FALSE )
Calculate SHAP values for model interpretation. SHAP values explain each feature's contribution to individual predictions.
ukb_shap( object, data = NULL, nsim = 100, sample_n = NULL, seed = NULL, verbose = TRUE, class_level = NULL, method = c("auto", "permutation", "xgboost"), ... )ukb_shap( object, data = NULL, nsim = 100, sample_n = NULL, seed = NULL, verbose = TRUE, class_level = NULL, method = c("auto", "permutation", "xgboost"), ... )
object |
A |
data |
Data for SHAP computation. If |
nsim |
Number of Monte Carlo samples for SHAP estimation (default 100).
Ignored when |
sample_n |
Optional; subsample observations for large datasets |
seed |
Random seed |
verbose |
Print progress |
class_level |
Optional class to explain for multiclass
|
method |
SHAP backend. |
... |
Additional arguments |
A ukb_shap object containing:
shap_values: Matrix of SHAP values (n x p)
baseline: Model baseline (expected) value
feature_names: Names of features
feature_values: Original feature values
Get SHAP dependence data for a specific feature.
ukb_shap_dependence(object, feature, color_feature = NULL, ...)ukb_shap_dependence(object, feature, color_feature = NULL, ...)
object |
A ukb_shap object |
feature |
Feature name to analyze |
color_feature |
Optional feature for coloring (interaction analysis) |
... |
Additional arguments |
Data frame with feature values and SHAP values
Get SHAP contribution data for a single observation (force plot).
ukb_shap_force(object, row_id = 1, max_features = 10, ...)ukb_shap_force(object, row_id = 1, max_features = 10, ...)
object |
A ukb_shap object |
row_id |
Row index to explain |
max_features |
Maximum features to show |
... |
Additional arguments |
Data frame with feature contributions for the observation
Calculate summary statistics from SHAP values.
ukb_shap_summary(object, n = 20, ...)ukb_shap_summary(object, n = 20, ...)
object |
A ukb_shap object |
n |
Number of top features to show (default 20) |
... |
Additional arguments |
Data frame with feature importance based on SHAP
Records lightweight cohort checkpoints during an analysis pipeline. Each
snapshot stores row count, column count, number of columns containing missing
values, complete row count, object size, and deltas from the previous
snapshot. Calling ukb_snapshot() without data returns the
current snapshot history.
ukb_snapshot( data = NULL, label = NULL, id = "default", reset = FALSE, verbose = TRUE )ukb_snapshot( data = NULL, label = NULL, id = "default", reset = FALSE, verbose = TRUE )
data |
Optional data.frame or data.table. If supplied, records a new snapshot. |
label |
Snapshot label. Required when recording a new snapshot. |
id |
Snapshot stream identifier. Use separate IDs for independent pipelines in the same R session. |
reset |
Logical. If TRUE, clears the snapshot history for |
verbose |
Logical. Print a concise snapshot summary. |
A data.table snapshot history.
Standardize a set of variables in the training data and optionally apply the same centering and scaling parameters to a validation data set. This is useful for omics analyses where all downstream association estimates should be expressed per one training-set standard deviation.
ukb_standardize_by_train( train_data, validation_data = NULL, variables, center = TRUE, scale = TRUE )ukb_standardize_by_train( train_data, validation_data = NULL, variables, center = TRUE, scale = TRUE )
train_data |
Training data. |
validation_data |
Optional validation data. |
variables |
Character vector of variables to standardize. |
center |
Logical. If TRUE, subtract the training-set mean. |
scale |
Logical. If TRUE, divide by the training-set standard deviation. |
A list with train, validation, and parameters.
dat <- data.frame(x = 1:5, y = c(2, 3, 5, 7, 11)) ukb_standardize_by_train(dat, variables = c("x", "y"))$parametersdat <- data.frame(x = 1:5, y = c(2, 3, 5, 7, 11)) ukb_standardize_by_train(dat, variables = c("x", "y"))$parameters
Creates a reusable participant-level time skeleton for prospective UK
Biobank analyses. The function standardizes baseline date, approximate birth
date, age at baseline, death date, loss-to-follow-up date, administrative
censoring date, follow-up end date, and follow-up time. It does not define
disease outcomes; instead, it provides a common time basis that can be reused
by endpoint-specific functions such as build_survival_dataset.
ukb_time_skeleton( data, id_col = "eid", baseline_col = "p53_i0", birth_year_col = "p34", birth_month_col = "p52", age_col = "p21022", death_date_cols = "^(participant\\.)?p40000_i[0-9]+$", lost_to_followup_col = "p191", admin_censor_date = as.Date("2023-10-31"), keep_source_dates = TRUE )ukb_time_skeleton( data, id_col = "eid", baseline_col = "p53_i0", birth_year_col = "p34", birth_month_col = "p52", age_col = "p21022", death_date_cols = "^(participant\\.)?p40000_i[0-9]+$", lost_to_followup_col = "p191", admin_censor_date = as.Date("2023-10-31"), keep_source_dates = TRUE )
data |
A data.frame or data.table containing UK Biobank columns. |
id_col |
Participant identifier column. Default |
baseline_col |
Baseline assessment date column. Default |
birth_year_col |
Year-of-birth column. Default |
birth_month_col |
Month-of-birth column. Default |
age_col |
Age-at-baseline column. Default |
death_date_cols |
Death date columns or a regular expression used to
identify them. Default |
lost_to_followup_col |
Optional date lost to follow-up column. Default
|
admin_censor_date |
Administrative censoring date. |
keep_source_dates |
Logical. If |
A data.table with one row per participant and standardized follow-up time fields.
demo <- data.frame( eid = 1:3, p53_i0 = as.Date(c("2010-01-01", "2011-01-01", "2012-01-01")), p21022 = c(50, 60, 70), p40000_i0 = as.Date(c(NA, "2015-01-01", NA)) ) ukb_time_skeleton(demo, admin_censor_date = as.Date("2020-12-31"))demo <- data.frame( eid = 1:3, p53_i0 = as.Date(c("2010-01-01", "2011-01-01", "2012-01-01")), p21022 = c(50, 60, 70), p40000_i0 = as.Date(c(NA, "2015-01-01", NA)) ) ukb_time_skeleton(demo, admin_censor_date = as.Date("2020-12-31"))
Select top Cox associations by hazard ratio
ukb_top_hr_results( results, n_each_direction = 10, p_col = "p_bonferroni", alpha = 0.05, hr_col = "HR", label_cols = c("gene_symbol", "protein_clean", "variable"), dataset = NULL )ukb_top_hr_results( results, n_each_direction = 10, p_col = "p_bonferroni", alpha = 0.05, hr_col = "HR", label_cols = c("gene_symbol", "protein_clean", "variable"), dataset = NULL )
results |
Cox result table. |
n_each_direction |
Number of HR > 1 and HR < 1 rows to keep. |
p_col |
Adjusted p-value column used for filtering. |
alpha |
Significance threshold. |
hr_col |
Hazard-ratio column. |
label_cols |
Candidate label columns. |
dataset |
Optional dataset label added to output. |
A data.frame.
Fit the same multivariable Cox model series in a training set and validation set, optionally standardizing the main variables using training-set parameters, then summarize replication and log(HR) concordance.
ukb_train_validation_cox( train_data, validation_data, main_vars, covariates, endpoint, standardize_main_vars = TRUE, add_protein_annotation = FALSE, protein_prefix = "^olink_instance_0[.]", train_label = "train", validation_label = "validation", comparison_train_prefix = "train", comparison_validation_prefix = "valid", p_adjust_methods = c("BH", "bonferroni"), alpha = 0.05, ... )ukb_train_validation_cox( train_data, validation_data, main_vars, covariates, endpoint, standardize_main_vars = TRUE, add_protein_annotation = FALSE, protein_prefix = "^olink_instance_0[.]", train_label = "train", validation_label = "validation", comparison_train_prefix = "train", comparison_validation_prefix = "valid", p_adjust_methods = c("BH", "bonferroni"), alpha = 0.05, ... )
train_data |
Training data. |
validation_data |
Validation data. |
main_vars |
Main variables to evaluate. |
covariates |
Adjustment covariates. |
endpoint |
Two-column endpoint passed to |
standardize_main_vars |
Logical. If TRUE, standardize |
add_protein_annotation |
Logical. If TRUE, add parsed protein names and gene symbols for Olink-style protein columns. |
protein_prefix |
Regular expression prefix removed from protein columns. |
train_label |
Training-set label. |
validation_label |
Validation-set label. |
comparison_train_prefix |
Prefix for training columns in the comparison table. |
comparison_validation_prefix |
Prefix for validation columns in the comparison table. |
p_adjust_methods |
P-value adjustment methods. |
alpha |
Significance threshold. |
... |
Additional arguments passed to |
A list containing scaled data, scaling parameters, Cox results, and comparison summaries.
Checks whether requested UKB/RAP columns are present in a data.frame or a
character vector of available column names. The function can optionally treat
participant.p31 and p31 as equivalent.
ukb_validate_columns(data, columns, ignore_entity_prefix = TRUE, error = FALSE)ukb_validate_columns(data, columns, ignore_entity_prefix = TRUE, error = FALSE)
data |
A data.frame/data.table or a character vector of available column names. |
columns |
Character vector of requested column names. |
ignore_entity_prefix |
Logical. If |
error |
Logical. If |
A data.frame of class ukb_column_validation.
dat <- data.frame(eid = 1:3, p31 = c(0, 1, 0)) ukb_validate_columns(dat, c("eid", "p31", "p21022"))dat <- data.frame(eid = 1:3, p31 = c(0, 1, 0)) ukb_validate_columns(dat, c("eid", "p31", "p21022"))
Write a RAP extraction manifest
ukb_write_extraction_manifest(manifest, path, format = c("csv", "rds"))ukb_write_extraction_manifest(manifest, path, format = c("csv", "rds"))
manifest |
A |
path |
Output path. |
format |
Output format: |
The output path, invisibly.
manifest <- ukb_create_extraction_manifest(field_id = c(31, 21022)) tmp <- tempfile(fileext = ".csv") ukb_write_extraction_manifest(manifest, tmp)manifest <- ukb_create_extraction_manifest(field_id = c(31, 21022)) tmp <- tempfile(fileext = ".csv") ukb_write_extraction_manifest(manifest, tmp)
A high-performance R package for processing UK Biobank (UKB) Research Analysis Platform (RAP) data exports. Designed for epidemiological studies requiring efficient extraction of diagnosis records and generation of survival analysis datasets.
Core Capabilities:
Parse ICD-10/ICD-9 diagnosis codes from mixed-format data
Parse OPCS4 operative procedure codes from hospital summary operations
Process self-reported illness data with fractional year conversion
Integrate death registry data as diagnosis sources
Generate Cox regression-ready survival datasets
Support flexible data source selection for sensitivity analyses
Key Functions:
parse_icd10_diagnoses: Extract ICD-10 hospital diagnoses
parse_icd9_diagnoses: Extract ICD-9 hospital diagnoses
parse_opcs4_procedures: Extract OPCS4 hospital procedures
parse_self_reported_illnesses: Extract self-reported conditions
parse_death_records: Extract death registry data
build_survival_dataset: Generate survival analysis data
extract_cases_by_source: Flexible source-specific extraction
UKB Data Fields:
ICD-10: p41270 (codes) + p41280_a* (dates)
ICD-9: p41271 (codes) + p41281_a* (dates)
OPCS4: p41272 (codes) + p41282_a* (dates)
Self-report: p20002_i*_a* (codes) + p20008_i*_a* (years)
Death: p40001/p40002 (causes) + p40000 (dates)
Maintainer: Nan He [email protected] (ORCID)
UK Biobank Data Showcase: https://biobank.ndph.ox.ac.uk/showcase/
Useful links:
Report bugs at https://github.com/Hinna0818/UKBAnalytica/issues