Changes in version 1.0.0 (2026-06-30) Official release - Released UKBAnalytica 1.0.0 as an integrated RAP-oriented workflow package for UK Biobank epidemiological analyses. - Expanded the package description to reflect the current end-to-end scope, including RAP extraction planning, predefined variables and diseases, cohort construction, statistical modeling, machine learning, proteomics analysis, and publication-oriented visualization. Workflow coverage - Consolidated predefined disease definitions and baseline variable mappings for rapid phenotype and covariate setup. - Standardized the analysis flow from RAP data extraction and preprocessing to multi-source endpoint definition, prevalent/incident classification, survival-ready cohort generation, and downstream analyses. - Added high-level machine-learning interfaces for single-model workflows and comparison across feature sets or algorithms, with ROC, threshold selection, calibration, SHAP interpretation, and validation-set evaluation helpers. - Extended proteomics utilities for volcano plots, Gene Ontology enrichment, STRING-based PPI retrieval, topology metrics, fast-greedy community detection, and cluster-level enrichment analysis. - Refined visualization helpers for forest plots, volcano plots, calibration and decision-curve analysis, ROC comparison, SHAP summaries, enrichment plots, and manuscript-ready figure export. Recent updates - Added ukb_time_skeleton() to create a reusable follow-up time skeleton with baseline date, death date, loss-to-follow-up date, administrative censoring, follow-up end reason, and valid follow-up indicators. - Added optional time_skeleton support to build_survival_dataset() while preserving the default survival workflow. - Added RAP-protected dictionary utilities, including ukb_download_rap_dictionary(), ukb_query_dictionary(), and ukb_validate_columns(), for official RAP dictionary lookup, Chinese/English field search, and column validation. - Added the built-in ukb_dictionary_zh metadata dataset for Chinese UKB field-path lookup. - Extended run_regression() with covariate_sets for nested epidemiological models such as crude, partially adjusted, and fully adjusted analyses. - Added lightweight publication-style plotting helpers for heatmaps, stacked bars, violin plots, and scatter plots. - Excluded local tests/ from package builds and remote tracking. Changes in version 0.6.2.2 Disease phenotyping - Added UK Biobank cancer registry support as a new CancerRegistry disease source using fields 40006, 40005, 40011, and 40012. - Added cancer_icd10_pattern, cancer_histology, and cancer_behaviour to create_disease_definition(). - Added predefined Lung_Cancer with cancer registry, ICD-10, and death-registry ascertainment. - Added UK Biobank First Occurrence support as a new FirstOccurrence disease source. - Added first_occurrence_fields and first_occurrence_source_fields to create_disease_definition(). - Added internal parsing for singular p13xxxx First Occurrence date/source fields, including UKB special date coding 819 handling. - Added common First Occurrence field IDs to predefined diseases where the disease definition matches 3-character ICD-10 field granularity. Workflow helpers - Added ukb_clean_missing() for converting common UKB non-response labels and numeric missing codes into analysis-ready values. - Added ukb_snapshot() to record row/column counts, missingness, complete rows, object size, and deltas across analysis pipeline checkpoints. Machine learning - Added the high-level ukb_ml_workflow() API for binary, multiclass, and continuous non-survival ML with a frozen final test set. - Added standardized ML split, feature selection, tuning, threshold-learning, final-refit, and frozen-test evaluation helpers: ukb_ml_as_split(), enhanced ukb_ml_split_data(), ukb_ml_feature_select(), ukb_ml_tune(), ukb_ml_threshold(), ukb_ml_fit_final(), and ukb_ml_evaluate_test(). - Preserved the legacy split_ratio style in ukb_ml_split_data() by keeping $internal_validation as an alias for the held-out split. - Extended ukb_shap() to support ukb_ml_workflow and ukb_ml_final objects, defaulting to the frozen test set for workflow objects. - Added lightweight rpart decision tree and naive_bayes model backends to ukb_ml_workflow(). - Added optional lazy ML dependency installation via options(UKBAnalytica.auto_install_ml = TRUE); by default, optional model packages are checked only when the selected model needs them and are not installed automatically. - Marked legacy ML model, CV, prediction, importance, comparison, and evaluation helpers as deprecated with pointers to the new workflow APIs. - Added ukb_ml_survival_workflow() and survival-specific split, feature-selection, tuning, final-refit, and frozen-test evaluation helpers for time-to-event ML. - Added model = "cox" as the lightweight default survival ML backend and aligned survival prediction output with the new workflow object structure. - Marked legacy ukb_ml_survival() as deprecated in favor of ukb_ml_survival_workflow(). - Added simulated-data tests for binary classification, multiclass classification, continuous regression, manual split validation, and threshold behavior. - Added simulated-data tests for survival ML splitting, manual split validation, Cox workflow fitting, prediction, and frozen-test evaluation. RAP phenotype extraction - Added R-native RAP phenotype extraction helpers that wrap DNAnexus dx extract_dataset and RAP table-exporter. - Added rap_find_dataset(), rap_list_fields(), rap_plan_extract(), rap_extract_pheno(), and rap_submit_extract(). - Added dry-run extraction planning so users can inspect matched fields and table-exporter command metadata before launching jobs. - Added support for variables = ... using UKBAnalytica predefined baseline mappings, while preserving field_id = ... for all instances and arrays of a UKB field. - Kept Python extraction helper scripts under inst/python/ as legacy/helper entry points. Documentation and tests - Expanded the RAP extraction chapter with a complete RAP JupyterLab -> Terminal -> R workflow, including output paths such as /mnt/project. - Updated README examples to recommend R-native RAP extraction for new users. - Split machine learning into a dedicated documentation chapter focused on the simplest standard ukb_ml_workflow() path. - Added offline tests for RAP extraction field parsing, field planning, predefined-variable mapping, and table-exporter dry-run formatting. Changes in version 0.6.2.1 Disease definition and phenotyping updates - Added OPCS4 operative procedure support for hospital summary operations via p41272 + p41282_a*. - Added opcs4_pattern to create_disease_definition() so procedure evidence is opt-in and ignored by default when unspecified. - Extended case extraction and survival workflows to accept OPCS4 in sources, prevalent_sources, and outcome_sources. - Added predefined arrhythmia endpoints combining ICD-10 and OPCS4 where appropriate: Arrhythmia, Ventricular_Arrhythmia, AV_Block, Intraventricular_Block, and SVT. - Extended predefined Atrial_Fibrillation with OPCS4 support for procedure-augmented atrial arrhythmia ascertainment. Documentation - Updated disease definition chapter and main vignette examples to document opcs4_pattern and arrhythmia phenotyping with ICD10 + OPCS4. - Updated README.md with an ICD-10 + OPCS4 phenotyping example and clarified the default opt-in behavior for procedure data. Changes in version 0.6.2 Survival workflow and participant flow reporting - Enhanced build_survival_dataset() with show_flow to print step-by-step participant attrition in terminal for wide output. - Added flow summary fields for each step (n_before, n_after, excluded, retention rates from previous/raw cohort). - Attached the underlying flow table to returned wide-format result via attr(result, "participant_flow"). - Added optional dt_threads in build_survival_dataset() to let users temporarily configure data.table thread count for large runs. Robust date parsing and stability improvements - Added internal .safe_as_date() utility (R/date_utils.R) to parse mixed date formats safely and convert malformed values to NA with warnings instead of stopping execution. - Replaced direct as.Date() calls in key pipelines with .safe_as_date() (ICD, death, baseline, incident-time utilities, and case extraction paths). - Fixed self-report date conversion in parse_self_reported_illnesses() to handle malformed year values (Inf, -Inf, NaN, non-numeric strings) without charToDate crashes. Algorithm source compatibility - Updated algorithm field lookup to support both p{field}_i0 and p{field} naming conventions for date/source fields. - Improved diagnostic messages when algorithm columns are unavailable. - Updated disease definition documentation for algorithm field naming compatibility. Diabetes diagnosis method refinement - Refined diabetes phenotyping workflow for prospective analyses by clarifying multi-source case ascertainment logic (ICD-10, ICD-9, and self-report) under unified earliest-date rules. - Improved practical support for diabetes endpoint setup across predefined disease definitions (Diabetes, T1DM, T2DM) in cohort construction workflows. Machine learning - Added new exported helper ukb_ml_split_data() for train/internal-validation splitting. - Supports optional stratified sampling by a categorical variable and reproducible splitting with seed. - Added manual page man/ukb_ml_split_data.Rd and NAMESPACE export. Changes in version 0.6.1 add sensitivity analysis module and refine the docs. - add select_incident_by_years() utility to split incident cases within or after a year cutoff from enrollment. Changes in version 0.6.0 New Module: Machine Learning Core ML Functions (ml_model.R) - ukb_ml_model(): Unified interface for training ML models - Random Forest (ranger) - XGBoost (xgboost) - Elastic Net (glmnet) - SVM (e1071) - Neural Network (nnet) - Logistic/Linear regression - ukb_ml_predict(): Generate predictions - ukb_ml_cv(): K-fold cross-validation with optional repeats - ukb_ml_compare(): Compare multiple models - ukb_ml_importance(): Extract variable importance Model Evaluation (ml_evaluate.R) - ukb_ml_metrics(): Compute performance metrics (AUC, accuracy, etc.) - ukb_ml_roc(): ROC curve analysis with CI - ukb_ml_calibration(): Calibration curve with Brier score and ECE - ukb_ml_confusion(): Confusion matrix SHAP Interpretation (ml_shap.R) - ukb_shap(): Compute SHAP values for model interpretation - ukb_shap_summary(): Feature importance from SHAP - ukb_shap_dependence(): Single feature SHAP analysis - ukb_shap_force(): Single observation explanation Survival ML (ml_survival.R) - ukb_ml_survival(): Survival machine learning models - Random Survival Forest (randomForestSRC) - GBM Survival (gbm) - CoxNet regularized Cox (glmnet) - ukb_ml_survival_predict(): Survival probability prediction - ukb_ml_survival_importance(): Variable importance - ukb_ml_survival_shap(): SHAP for survival models Visualization - plot_ml_importance(): Variable importance bar/dot plot - plot_ml_roc(): ROC curve plot - plot_ml_calibration(): Calibration curve plot - plot_ml_confusion(): Confusion matrix heatmap - plot_ml_compare(): Model comparison plot - plot_shap_summary(): SHAP beeswarm/bar plot - plot_shap_dependence(): SHAP dependence plot - plot_shap_force(): SHAP waterfall plot Dependencies (Suggests) - Added: ranger, xgboost, glmnet, e1071, nnet, fastshap, pROC, randomForestSRC Documentation - Updated Advanced Analysis chapter with ML module - Updated README with ML examples Changes in version 0.5.0 New Modules: Advanced Statistical Analysis Subgroup Analysis (subgroup.R) - run_subgroup_analysis(): Stratified analysis with interaction p-values - run_multi_subgroup(): Batch analysis across multiple subgroup variables - Supports Cox, logistic, and linear regression models Propensity Score Methods (propensity.R) - estimate_propensity_score(): PS estimation via logistic regression or GBM - match_propensity(): 1:k nearest neighbor matching with caliper - calculate_weights(): IPTW weights (ATE, ATT, ATC) - assess_balance(): Covariate balance assessment with SMD - run_weighted_analysis(): Weighted regression analysis Mediation Analysis (mediation.R) - run_mediation(): Causal mediation analysis (wrapping regmedint) - run_multi_mediator(): Test multiple mediators - run_sensitivity_mediation(): Sensitivity analysis for unmeasured confounding - Supports linear, logistic, and Cox outcome models - Effects: CDE, PNDE, TNIE, TE, Proportion Mediated Multiple Imputation Pooling (mi_pool.R) - pool_mi_models(): Combine regression results using Rubin's Rules - fit_mi_models(): Fit models across imputed datasets - create_imputation_list(): Convert to mitools imputationList - pool_custom_estimates(): Pool custom statistics - Supports lm, logistic, poisson, cox, negbin models - Reports FMI (Fraction of Missing Information) Visualization (visualization.R) - plot_forest(): Forest plots for subgroup/regression results - plot_km_curve(): Kaplan-Meier survival curves - plot_ps_distribution(): Propensity score distribution (histogram/density) - plot_balance(): Covariate balance before/after matching - plot_calibration(): Calibration plots - plot_mediation(): Mediation effect plots (bar, decomposition, path diagram) - plot_mediation_forest(): Multi-mediator forest plot - plot_mi_pooled(): MI pooled results forest plot - plot_mi_diagnostics(): FMI and variance diagnostics Documentation - New chapter: Advanced Analysis Modules (docs/08-advanced-analysis.Rmd) - Updated technical design document with all module specifications - Updated README with advanced analysis examples Dependencies (Suggests) - Added: MatchIt, gbm, regmedint, mitools, MASS, cobalt Changes in version 0.4.0 Fix bug in survival.R: person who has primary disease before initial time will be set NA in survival time (in order to distinguish it from person who has primary disease after initial time, with non-NA survival time). Changes in version 0.3.0 Add variable_preprocess.R module for preprocessing baseline variables. Changes in version 0.2.0 Major changes - Refactored the primary analysis interface to return a cohort-retaining wide-format dataset by default. - Added a primary_disease argument to compute outcome_status and outcome_surv_time for a single primary endpoint. - Added prevalent_sources and outcome_sources argument into build_survival_dataset function to manage self-report bias. New features - Multi-source phenotyping support with configurable sources (ICD-10, ICD-9, self-report, death). - Cohort-level follow-up time computation with administrative censoring and death censoring. - Expanded predefined disease definitions to cover common conditions for rapid prototyping. Data acquisition (RAP) - Added Python utilities under inst/python/ to extract: - Demographic fields (user-specified UKB field IDs; optional ID file input). - Metabolomics (all fields; plus non-ratio subset driven by inst/extdata/metabolites_non_ratio.txt). - Proteomics (batch extract with optional merge). Documentation - Updated README to prioritize data acquisition (RAP) before survival endpoint construction. - Added a package overview figure in man/figures/. Changes in version 0.1.0 - Initial release: parsing UKB RAP exports and generating survival-analysis-ready datasets.