--- title: "Linear Biomarker Combination: Empirical Performance Optimization" author: "Yijian Huang (yhuang5@emory.edu)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Linear Biomarker Combination: Empirical Performance Optimization} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `lincom` implements linear combination methods for biomarkers via empirical performance optimization with respect to two performance metrics: (1) specificity at controlled sensitivity (or sensitivity at controlled specificity) (Huang and Sanda, 2022), and (2) weighted average of false positive rate and false negative rate. The second method is a variant of the maximum score estimator (Manski, 1975, 1985). In both cases, the algorithm of Huang and Sanda (2022) is used to provide a solution that balances between computational efficiency and quality. ## Installation `lincom` is available on CRAN: ```{r install, eval=FALSE, message=FALSE, warning=FALSE} install.packages("lincom") ``` 'MOSEK' solver is used and needs to be installed; an academic license for 'MOSEK' is free. ## Simulated dataset for illustration ```{r simulation, eval=TRUE, message=FALSE, warning=FALSE} library(lincom) ## simulate 3 biomarkers for 100 cases and 100 controls n1 <- 100 n0 <- 100 mk <- rbind(matrix(rnorm(3*n1),ncol=3),matrix(rnorm(3*n0),ncol=3)) mk[1:n1,1] <- mk[1:n1,1]/sqrt(2)+1 mk[1:n1,2] <- mk[1:n1,2]*sqrt(2)+1 ``` `mk[1:n1,]` and `mk[(n1+1):(n1+n0),]` contain the case and control data, respectively. ## Empirical maximization of specificity at controlled sensitivity (or sensitivity at controlled specificity) The following code performs empirical maximization of specificity at 95% sensitivity. ```{r eum, message=FALSE, warning=FALSE} ## The following two lines are commented out - require installation of 'MOSEK' to run #lcom1 <- eum(mk, n1=n1, s0=0.95, grdpt=0) #lcom1 ``` Above, `n1` is the case size, `s0` is the control level, and `grdpt` specifies how initial value of the optimization is obtained (logistic regression if `grdpt=0`, and coarse grid search with `grdpt` grid points otherwise). Additional arguments include `fixsens` (fixing sensitivity if `TRUE` and specificity otherwise), and `lbmdis` (larger biomarker values is more associated with cases if `TRUE` and controls otherwise). The outputs include the resulting combination coefficient (`coef`), maximum empirical value of the performance metric (`hs`), and the resulting threshold (`threshold`), along with their initial value counterparts (from logistic regression or coarse grid search). ## Empirical minimization of weighted average of false positive rate and false negative rate ```{r wmse, message=FALSE, warning=FALSE} ## default relative weight r=1. ## Require installation of 'MOSEK' to run ## The following two lines are commented out - require installation of 'MOSEK' to run #lcom2 <- wmse(mk, n1=n1) #lcom2 ``` The inputs and outputs are similar to those of `eum`. However, the initial value here is obtained through logistic regression only. With cohort design, setting `r=n0/n1` leads to Manski's original estimator. ## References Huang, Y. and Sanda, M. G. (2022). Linear biomarker combination for constrained classification. _The Annals of Statistics_ 50, 2793--2815. Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice. _Journal of Econometrics_ 3, 205--228. Manski, C. F. (1985). Semiparametric analysis of discrete response. Asymptotic properties of the maximum score estimator. _Journal of Econometrics_ 27, 313--333.