## If you use oscar-methodology, please consider citing the following paper:
## Halkola AS, Joki K, Mirtti T, Mäkelä MM, Aittokallio T, Laajala TD (2023).
## OSCAR: Optimal subset cardinality regression using the L0-pseudonorm with applications to prognostic modelling of prostate cancer.
## PLoS Comput Biol 19(3): e1010333.
## https://doi.org/10.1371/journal.pcbi.1010333
OSCAR (Optimal Subset CArdinality Regression) is an R-package delivering L0-quasinorm penalized regression models for Cox, logistic and gaussian model families. It comes with two different optimizers for the computationally demanding optimization task (DBDC and LMBM), and these have been implemented in Fortran. The R package handles all interaction with Fortran and provides many convenience and visualization functions to assist with L0-penalized regression modelling. This includes for example cross-validation, bootstrapping, and other diagnostics used to inspect an OSCAR model and assessing importance of features.
Please see Halkola et al. [1] for citing OSCAR. The primary solver for the DC-decomposition transforming the L0-pseudonorm optimization problem from discrete to continuous non-convex space is done via DBDC (‘Double Bundle method for nonsmooth DC optimization’), which is described in detail on Joki et al. [2], although an alternate optimization method for larger problems is offered via LMBM (‘Limited Memory Bundle Method for large-scale nonsmooth optimization’) as described in Haarala et al. [3].
A simulated real-world hospital cohort data with patient survival with prostate cancer (PCa) is provided with:
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE HB
## MEDISIMU1 29.39 175 90 142 77 68 10.9
## MEDISIMU2 19.37 176 60 107 77 58 13.3
## MEDISIMU3 23.88 165 65 142 77 71 11.8
## MEDISIMU4 34.54 176 107 126 90 71 13.1
## MEDISIMU5 20.65 188 73 142 77 71 15.3
## MEDISIMU6 28.41 174 86 188 77 88 12.8
## MEDISIMU7 29.63 180 96 142 77 81 14.1
## Time Event
## MEDISIMU1 89 0
## MEDISIMU2 754 1
## MEDISIMU3 783 1
## MEDISIMU4 159 0
## MEDISIMU5 1322 0
## MEDISIMU6 200 1
For further details regarding this dataset and its generation, see reference Laajala et at. [4]
One key feature in OSCAR in addition to its L0 norm penalization is its capability to handle variables as groups, or ‘kits’. This grouping of variables causes them to be selected (or omitted) together. In many clinical applications this is often reality; for example, curating initial data for variables together can occur in cases such as:
In the example data from Turku University Hospital (TYKS), a series of clinical variables are run together. For example a standard blood panel includes red blood cell count, hematocrite, white blood cell count, etc. Additionally, single variables can be run separately; a very common single marker for PCa is Prostate-Specific Antigen (PSA).
The kit structure for such variables that can be obtained together are modelled using a kit indicator matrix:
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE HB
## BaseReport 1 1 1 1 1 1 0
## B-PVKT 0 0 0 0 0 0 1
## TKD 0 0 0 0 0 0 0
## cB-Het-Ion 0 0 0 0 0 0 0
## Pt-GFReEPI 0 0 0 0 0 0 0
## P-LD 0 0 0 0 0 0 0
## P-ASAT 0 0 0 0 0 0 0
## BaseReport B-PVKT TKD cB-Het-Ion Pt-GFReEPI P-LD P-ASAT
## 6 5 8 1 1 1 1
## P-Alb P-AFOS P-PSA P-Krea Pt-Krea-Cl B-Lymf P-Pi
## 1 1 1 1 1 1 1
## P-Ca P-ALAT S -Prot B-Eos S-Testo P-Gluk P-Mg
## 1 1 1 1 1 1 1
## P-Bil
## 1
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE
## 1 1 1 1 1 1
## HB HEMAT RBC WBC PLT LYMperLEU
## 1 1 1 1 1 1
## MONOperLEU NEU POT MONO BASOperLEU EOSperLEU
## 1 1 1 1 1 1
## NEUperLEU NA. CCRC LDH AST ALB
## 1 1 1 1 1 1
## ALP PSA CREAT CREACL LYM PHOS
## 1 1 1 1 1 1
## CA ALT TPRO EOS TESTO GLU
## 1 1 1 1 1 1
## MG TBILI
## 1 1
If the kit matrix is a diagonal indicator matrix with dimensions equal to the number of variables, each variable is L0-penalized without any grouping structure.
Further, these kits can be assigned a ‘price’. This may be a price literally, or a descriptive value indicating how hard the variable is to obtain. With this information, a clinically feasible pareto-front can be constructed, that simultaneously captures modelling generalization as well as clinical applicability.
A cost vector can be provided for the fitting procedure, and a standardized arbitrary unit cost vector is provided in:
## BaseReport B-PVKT TKD cB-Het-Ion Pt-GFReEPI P-LD
## 0.0 2.4 4.8 5.7 1.0 1.1
OSCAR comes with two different optimizers:
These are provided to the fitting procedure with
solver=1
and solver=2
or with their
abbreviations. They have been implemented in Fortran for computational
efficiency.
A quick example run of OSCAR with LMBM together with the default kit structure for Cox regression:
A pareto-optimal front provides information on potential optimal saddle points, where model generalization and clinical applicability pair nicely:
Model convergence in target function is important, while an another axis can be plotted simultaneously to provide supporting information:
Fit model object can be inspected based on visual diagnostics such as cross-validation and bootstrapping:
# Perform 5-fold cross-validation to find out optimal k
cv <- oscar.cv(fit, fold=5, seed=123)
# Visualize model generalization performance as a function of k
oscar.cv.visu(cv)
It appears that k-values around 5 or above present a shoulder-point saturate the model, with no further coefficients contributing to the model’s generalization capability.
Further, there is some uncertainty as to what order the coefficients should be non-zero, based on bootstrapping:
# Bootstrap original data 20 times (sampling with replacement and refitting)
bs <- oscar.bs(fit, bootstrap=20, seed=234)
# Visualize bootstrapped models
oscar.bs.plot(fit=fit, bs=bs, nbins=20)
Based on the bootstrapping, PSA is a clear winner. Close second candidate for survival prediction is Alkaline Phosphatase (ALP). After these single markers the bootstrapping was quite prone to choosing separate kits between immunomarkers and standard bloodwork, with no clear consistent winner.
Useful commands for understanding the model and the oscar S4-object:
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## HB HEMAT RBC WBC PLT LYMperLEU
## -0.17302102 -5.04043911 -0.07065200 0.53173174 -0.00119069 0.00000000
## MONOperLEU NEU POT MONO BASOperLEU EOSperLEU
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## NEUperLEU NA. CCRC LDH AST ALB
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## ALP PSA CREAT CREACL LYM PHOS
## 0.66640203 0.40782914 0.00000000 0.00000000 0.00000000 0.00000000
## CA ALT TPRO EOS TESTO GLU
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## MG TBILI
## 0.00000000 0.00000000
## HB HEMAT RBC WBC PLT ALP
## -0.17302102 -5.04043911 -0.07065200 0.53173174 -0.00119069 0.66640203
## PSA
## 0.40782914
## [1] 9.1
While oscar
-package was originally designed with
biomedical survival data in mind, it naturally supports modelling of
non-survival data. Currently, the family
parameter supports
logistic regression for binary outcomes, and mse/gaussian/normal for
modelling responses that are normally distributed.
# Use example swiss-data for quickness
data(swiss)
set.seed(2)
fit_swiss <- oscar(x=swiss[,-1], y=swiss[,1], family="mse", print=0, solver=2)
# Plot model coefficients as a function of cardinality k
plot(fit_swiss)
Similarly bootstrapping and cross-validation:
# Bootstrap original data 50 times (sampling with replacement and refitting)
bs_swiss <- oscar.bs(fit_swiss, bootstrap=50, seed=234)
# Visualize trajectories of bootstrapped coefficients
oscar.bs.plot(fit=fit_swiss, bs=bs_swiss, nbins=50)
# Perform 5-fold cross-validation to find out optimal k
cv_swiss <- oscar.cv(fit_swiss, fold=10, seed=345)
# Visualize model generalization performance as a function of k
oscar.cv.visu(cv_swiss)
Halkola AS, Joki K, Mirtti T, M{\"a}kel{\"a} MM, Aittokallio T, Laajala TD (2023) OSCAR: Optimal subset cardinality regression using the L0-pseudonorm with applications to prognostic modelling of prostate cancer. PLoS Comput Biol 19(3): e1010333. https://doi.org/10.1371/journal.pcbi.1010333
Joki K, Bagirov AM, Karmitsa N, Makela MM, Taheri S. Double Bundle Method for finding Clarke Stationary Points in Nonsmooth DC Programming. SIAM Journal on Optimization. 2018;28(2):1892-1919.
Haarala M, Miettinen K, Makela MM. New limited memory bundle method for large-scale nonsmooth optimization. Optimization Methods and Software. 2004;19(6):673-692.
Laajala TD, Murtoj{\"a}rvi M, Virkki A, Aittokallio T. ePCR: an R-package for survival and time-to-event prediction in advanced prostate cancer, applied to real-world patient cohorts. Bioinformatics. 2018;34(22):3957-3959.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] oscar_1.2.1 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.3 knitr_1.49 rlang_1.1.4 xfun_0.49
## [5] jsonlite_1.8.9 buildtools_1.0.0 plyr_1.8.9 htmltools_0.5.8.1
## [9] maketools_1.3.1 sys_3.4.3 sass_0.4.9 grid_4.4.2
## [13] evaluate_1.0.1 jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10
## [17] lifecycle_1.0.4 compiler_4.4.2 Rcpp_1.0.13-1 lattice_0.22-6
## [21] digest_0.6.37 R6_2.5.1 pROC_1.18.5 splines_4.4.2
## [25] bslib_0.8.0 Matrix_1.7-1 tools_4.4.2 hamlet_0.9.6
## [29] survival_3.7-0 cachem_1.1.0