The R package BioPred offers a suite of tools for subgroup and biomarker analysis in precision medicine. Leveraging Extreme Gradient Boosting (XGBoost) along with propensity score weighting and A-learning methods, BioPred facilitates the optimization of individualized treatment rules (ITR) to streamline sub-group identification. BioPred also enables the identification of predictive biomarkers and obtaining their importance rankings. Moreover, the package provides graphical plots tailored for biomarker analysis. This tool enables clinical researchers seeking to enhance their understanding of biomarkers and patient population in drug development.
install_github(“deeplearner0731/BioPred”)
install.packages(“BioPred”)
The ‘tutorial_data’ dataset is included with the package. Let’s load and inspect it:
library(BioPred)
kable_styling(kable(x=head(tutorial_data), format="html", caption = "The first 6 subjects"),
bootstrap_options="striped",font_size=16)
x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | y.con | y.bin | y.time | y.event | treatment | subgroup_label | risk_category | treatment_categorical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-2.3744988 | 0.0167914 | -2.5412961 | -2.0901502 | -2.6896868 | -1.3455564 | -2.4931001 | -0.3483110 | -1.5673804 | 0.4836952 | 0.0992838 | 0 | 0.0874949 | 0 | 0 | 1 | High Risk | Placebo |
-0.1838847 | 1.8222234 | 0.4601912 | 0.5497404 | 1.6449981 | 0.5433581 | 0.2364084 | -2.7868360 | -0.4006381 | 0.6874968 | 0.0711574 | 0 | 0.9534603 | 0 | 0 | 0 | Intermediate Risk | Placebo |
0.0539828 | -1.9142628 | 1.3112035 | 0.6543154 | -0.0653439 | 0.5491687 | 1.8201585 | 2.5079087 | -0.4686039 | 1.1696815 | -1.2151804 | 0 | 0.0005331 | 0 | 1 | 1 | High Risk | Treatment |
1.1048102 | 0.8346022 | -1.9808589 | -0.5632687 | 0.3344664 | 1.4275303 | 0.1324287 | 2.1463655 | 1.3107915 | 0.4411784 | 0.2773950 | 0 | 0.0435684 | 0 | 1 | 0 | High Risk | Treatment |
-0.4179985 | -2.2484687 | 3.3396929 | 3.1896045 | -0.5779621 | -1.4963500 | 0.4454772 | 0.1323151 | 1.4211205 | -1.5240744 | -2.0069643 | 0 | 0.1050067 | 0 | 1 | 1 | High Risk | Treatment |
1.0177850 | 0.1045029 | 1.4632138 | 1.0294900 | 2.0912418 | 2.2324576 | 0.5073059 | -1.4397678 | 1.5456243 | 0.4183628 | 1.4521578 | 1 | 0.1018182 | 0 | 0 | 0 | High Risk | Placebo |
The tutorial_data dataset contains the following columns:
We will begin by training the XGBoostSub_con
model for a
continuous outcome. This involves using the A-learning or
Weight-learning loss function. Note that the treatment variable must be
converted to (1, -1). If your treatment variable is (1, 0), you will
need to convert it accordingly.
X_feature=tutorial_data[,c("x1", "x2" ,"x3" ,"x4","x5","x6","x7","x8","x9","x10")]
y_label=tutorial_data$y.con
# Convert treatment variable to (1 -1) format (1 for treatment, -1 for control)
trt=ifelse(tutorial_data$treatment==1, 1, -1)
true_label=tutorial_data$subgroup_label
# Estimate the propensity score using logistic regression
data_logit=tutorial_data[,c("treatment","x1", "x2" ,"x3" ,"x4" , "x5", "x6" , "x7","x8","x9","x10")]
logit_model <- glm(treatment~ ., data = data_logit, family = binomial)
pi <- predict(logit_model, type = "response")
# Train the XGBoostSub_con model using the specified parameters
model <- XGBoostSub_con(X_feature, y_label, trt ,pi,Loss_type = "A_learning", params = list(learning_rate = 0.005, max_depth = 1, lambda = 5, tree_method = 'hist'), nrounds = 200, disable_default_eval_metric = 0, verbose = FALSE)
After training the model, you can print the loss value at each epoch using the following code:
cat("Evaluation loss:\n")
#> Evaluation loss:
print(model$evaluation_log)
#> iter train_A_loss
#> <num> <num>
#> 1: 1 1.375787
#> 2: 2 1.375424
#> 3: 3 1.375065
#> 4: 4 1.374708
#> 5: 5 1.374355
#> ---
#> 196: 196 1.339571
#> 197: 197 1.339486
#> 198: 198 1.339402
#> 199: 199 1.339319
#> 200: 200 1.339236
You can also evaluate the loss on independent datasets using the
eval_metric_con function
. In this example, we’ll use the
training data as test data for illustration purposes. However, you may
use a new dataset to see the loss based on the trained model.
Next, we evaluate the importance of predictive biomarkers using the
predictive_biomarker_imp
function.
kable_styling(kable(x=biomarker_imp, format="html", caption = "Biomarker importance table"),
bootstrap_options="striped",font_size=16)
Biomarker | Importance |
---|---|
x2 | 0.4904407 |
x1 | 0.4793201 |
x10 | 0.0302392 |
From the results, we observe that the most important predictive
biomarker is x2
.
You can obtain the subgroup results using the
get_subgroup_results
function.
subgroup_results=get_subgroup_results(model, X_feature, subgroup_label=true_label, cutoff = 0.5)
kable_styling(kable(x=head(subgroup_results$assignment), format="html", caption = "The first 6 subjects subgroup assigentment"),
bootstrap_options="striped",font_size=16)
Patient_ID | Treatment_Assignment |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 0 |
5 | 1 |
6 | 1 |
cat("Prediction accuracy of true subgroup label:\n")
#> Prediction accuracy of true subgroup label:
cat(subgroup_results$acc)
#> 0.634
Note that when using the get_subgroup_results
function
with real data, the true subgroup label is typically unknown. In such
cases, set subgroup_label = NULL. This will return only the subgroup
assignment without the prediction accuracy of the subgroup.
Given that x2
was identified as the most important
predictive biomarker by XGBoostSub_con
, we will use the
cdf_plot
function to evaluate the distribution of this
biomarker by assessing the percentage of subjects falling within
different cutoff values.
cdf_plot (xvar="x2", data=tutorial_data, y.int=5, xlim=NULL, xvar.display="Biomarker_X2", group=NULL)
We will also evaluate x2
prognostic capability. To
determine if this biomarker is prognostic, we can assess the association
between the response variable y.con
and the biomarker
x2
using the scat_cont_plot
function.
scat_cont_plot(
yvar='y.con',
xvar='x2',
data=tutorial_data,
ybreaks = NULL,
xbreaks = NULL,
yvar.display = 'y',
xvar.display = "Biomarker_X2"
)
#> $Cor.Pearson
#> [1] -0.12
#>
#> $Cor.Spearman
#> [1] -0.12
#>
#> $fig
#>
#> $slope
#> [1] -0.1013384
#>
#> $intercept
#> [1] 0.6769244
Selecting the cutoff value for a predictive biomarker is important in
clinical practice, as it can, for example, help enrich responders. We
select the optimal cutoff for the identified biomarker x2
from a list of candidate cutoff values using the fixcut_con
function.
cutoff_result_con=fixcut_con(yvar='y.con', xvar="x2", dir=">", cutoffs=c(-0.1,-0.3,-0.5,0.1,0.3,0.5), data=tutorial_data, method="t.test", yvar.display='y.con', xvar.display='biomarker x2', vert.x=F)
cutoff_result_con$fig
From the results, the optimal cutoff value is 0.5. Next, we define
the biomarker positive group using a cutoff of 0.5. The biomarker
positive group includes subjects with x2
values less than
or equal to 0.5, while the biomarker negative group includes subjects
with x2 values greater than 0.5. We first use the
cut_perf()
function to evaluate the performance between the
biomarker positive and negative groups.
res=cut_perf (yvar="y.con", censorvar=NULL, xvar="x2", cutoff=c(0.5), dir="<=", xvars.adj=NULL, data=tutorial_data, type='c', yvar.display='y.con', xvar.display="biomarker x2")
kable_styling(kable(x=res$comp.res.display[,-2:-4], format="html", caption = "Performace at optimal cutoff"),
bootstrap_options="striped",font_size=16)
Response | Median.Diff | Mean.Diff.CI | t.pvalue | WRS.pvalue |
---|---|---|---|---|
y.con | -0.3 | -0.38 (NA , NA) | NA | NA |
Next, we will further assess the predictive model performance by examining the differences between the treatment and control groups within both the biomarker positive and negative groups.
tutorial_data$biogroup=ifelse(tutorial_data$x2<=0.5,'biomarker_positive','biomarker_negative')
res = subgrp_perf_pred(yvar="y.time", censorvar="y.event", grpvar="biogroup", grpname=c("biomarker_positive",'biomarker_negative'),trtvar="treatment_categorical", trtname=c("Placebo" , "Treatment"), xvars.adj=NULL,data=tutorial_data, type="s")
kable_styling(kable(x=res$group.res.display, format="html", caption = "BioSubgroup Stat"),
bootstrap_options="striped",font_size=16)
Response | N | Events | Mean.Surv.CI | Med.Surv.CI | |
---|---|---|---|---|---|
treatment_categorical=Placebo,biogroup=biomarker_positive | y.time | 321 | 127 | 3.7 (2.49 , 4.92) | 1.99 (1.65 , 2.7) |
treatment_categorical=Placebo,biogroup=biomarker_negative | y.time | 192 | 76 | 4.49 (2.37 , 6.61) | 1.83 (1.58 , 2.27) |
treatment_categorical=Treatment,biogroup=biomarker_positive | y.time | 321 | 133 | 4.84 (3.66 , 6.02) | 2.85 (2.11 , 3.52) |
treatment_categorical=Treatment,biogroup=biomarker_negative | y.time | 166 | 75 | 2.63 (1.93 , 3.32) | 1.51 (0.84 , 2.14) |
From the Kaplan-Meier (KM) curve, a difference between the treatment and
control groups is observed in the biomarker positive group. This
difference is not observed in the biomarker negative group, validating
that biomarker x2
identified by XGBoostSub_con is
predictive.
Similar to training the XGBoostSub_con
model for
continuous outcomes, we can also train the XGBoostSub_bin
model for binary outcomes.
kable_styling(kable(x=biomarker_imp, format="html", caption = "Biomarker importance table"),
bootstrap_options="striped",font_size=16)
Biomarker | Importance |
---|---|
x10 | 0.6100485 |
x2 | 0.1428244 |
x8 | 0.1083907 |
x5 | 0.0805473 |
x1 | 0.0581890 |
Based on the results, biomarker x10
emerges as the most
important biomarker.
subgroup_results=get_subgroup_results(model, X_feature, subgroup_label=true_label, cutoff = 0.5)
kable_styling(kable(x=head(subgroup_results$assignment), format="html", caption = "The first 6 subjects subgroup assigentment"),
bootstrap_options="striped",font_size=16)
Patient_ID | Treatment_Assignment |
---|---|
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 1 |
6 | 0 |
cat("Prediction accuracy of true subgroup label:\n")
#> Prediction accuracy of true subgroup label:
cat(subgroup_results$acc)
#> 0.535
Users can freely tune the parameters in the XGBoostSub-based models to enhance the performance of identifying true subgroups. In this example, we provide a random parameter setting for illustration purposes. However, when applying this model to real data, the ground truth labels are unknown, which means you may not be able to tune the parameters based on prediction accuracy.
Continuing our assessment, we aim to determine whether the identified
biomarker is prognostic or not. To do so, we examine the association
between the binary response variable y.bin
and biomarker
x10
.
roc_bin_plot (yvar='y.bin', xvars="x10", dirs="auto", data=tutorial_data, yvar.display='y.bin', xvars.display="Biomarker x10")
The association between the binary response variable
y.bin
and biomarker x2
, identified as the
second most important biomarker by the XGBoostSub_bin
model, is shown below.
roc_bin_plot (yvar='y.bin', xvars="x2", dirs="auto", data=tutorial_data, yvar.display='y.bin', xvars.display="Biomarker x2")
We select the optimal cutoff of identified biomarker x10
from a candidate list of cutoff values This capability is particularly
valuable for companion diagnostics (CDx) development when working with a
limited set of candidate cutoffs (e.g., IHC).
cutoff_result_bin=fixcut_bin (yvar='y.bin', xvar="x10", dir=">", cutoffs=c(-0.1,-0.3,-0.5,0.1,0.3,0.5), data=tutorial_data, method="Fisher", yvar.display='Response_Y', xvar.display='Biomarker_X10', vert.x=F)
cutoff_result_bin$fig
Here, users can select the optimal cutoff based on various metrics such as sensitivity, specificity, PPV (Positive Predictive Value), and NPV (Negative Predictive Value). The choice of metric depends on the specific research context. For instance, if the goal is to enrich responders, focusing on PPV might be more appropriate. Based on the highest PPV value, users can then select the optimal cutoff.
If accuracy is our primary concern, the optimal cutoff value is 0.5.
In this case, we define the biomarker positive group using a cutoff of
0.5. This group comprises subjects with x10
values greater
than or equal to 0.5, while the biomarker negative group consists of
subjects with x10
values less than 0.5. To assess the
performance between the biomarker positive and negative groups, we
initially employ the cut_perf
function.
res=cut_perf (yvar="y.con", censorvar=NULL, xvar="x10", cutoff=c(0.5), dir=">", xvars.adj=NULL, data=tutorial_data, type='c', yvar.display='y.con', xvar.display="biomarker x2")
kable_styling(kable(x=res$comp.res.display[,-2:-4], format="html", caption = "caption"),
bootstrap_options="striped",font_size=16)
Response | Median.Diff | Mean.Diff.CI | t.pvalue | WRS.pvalue |
---|---|---|---|---|
y.con | 0.22 | 0.2 (NA , NA) | NA | NA |
Next, the subgrp_perf_pred
function will provide the
predictive model performance based on the defined subgroup.
tutorial_data$biogroup=ifelse(tutorial_data$x10>0.5,'biomarker_positive','biomarker_negative')
res = subgrp_perf_pred(yvar="y.time", censorvar="y.event", grpvar="biogroup", grpname=c("biomarker_positive",'biomarker_negative'),trtvar="treatment_categorical", trtname=c("Placebo" , "Treatment"), xvars.adj=NULL,data=tutorial_data, type="s")
kable_styling(kable(x=res$group.res.display, format="html", caption = "BioSubgroup Stat"),
bootstrap_options="striped",font_size=16)
Response | N | Events | Mean.Surv.CI | Med.Surv.CI | |
---|---|---|---|---|---|
treatment_categorical=Placebo,biogroup=biomarker_positive | y.time | 191 | 73 | 3.94 (2.27 , 5.6) | 1.83 (1.46 , 2.47) |
treatment_categorical=Placebo,biogroup=biomarker_negative | y.time | 322 | 130 | 3.98 (2.52 , 5.43) | 2.03 (1.74 , 2.44) |
treatment_categorical=Treatment,biogroup=biomarker_positive | y.time | 181 | 79 | 3.19 (2.35 , 4.04) | 1.71 (1.16 , 2.14) |
treatment_categorical=Treatment,biogroup=biomarker_negative | y.time | 306 | 129 | 4.68 (3.46 , 5.89) | 2.96 (2.12 , 3.73) |
From the KM curve, it appears that 0.5 for X10
may not be
the optimal cutoff. Further evaluation using the fixcut_bin
function with different candidate optimal cutoffs is suggested.
# Prepare the data for training the XGBoostSub_sur
y_label=tutorial_data$y.time
delta=tutorial_data$y.event
# Train the XGBoostSub_sur model using the specified parameters
model <- XGBoostSub_sur(X_feature, y_label, trt ,pi,delta,Loss_type = "A_learning", params=list(learning_rate = 0.005, max_depth = 1,lambda = 9, gamma=1, min_child_weight=1,
max_bin=128, tree_method = 'exact',subsample=0.8), nrounds = 250, disable_default_eval_metric = 1, verbose = FALSE)
kable_styling(kable(x=biomarker_imp, format="html", caption = "Biomarker importance table"),
bootstrap_options="striped",font_size=16)
Biomarker | Importance |
---|---|
x6 | 0.6514610 |
x2 | 0.2944645 |
x1 | 0.0166948 |
x4 | 0.0127312 |
x10 | 0.0077685 |
x7 | 0.0070446 |
x8 | 0.0051927 |
x9 | 0.0024906 |
x5 | 0.0021522 |
subgroup_results=get_subgroup_results(model, X_feature, subgroup_label=true_label, cutoff = 0.5)
kable_styling(kable(x=head(subgroup_results$assignment), format="html", caption = "The first 6 subjects subgroup assigentment"),
bootstrap_options="striped",font_size=16)
Patient_ID | Treatment_Assignment |
---|---|
1 | 1 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 1 |
6 | 0 |
Subgroup analysis often involves dividing subjects into different
groups based on categorical variables, such as baseline characteristics
or genetic mutation variables. In this package, we provide the
cat_summary function to summarize these categorical variables. For
example, in the tutorial dataset, there is a variable called
risk_category
, which takes values like High Risk,
Intermediate Risk, and Low Risk. First, we use this function to
determine the distribution of subjects in each subgroup.
res = cat_summary(yvar="risk_category", yname=c("High Risk","Intermediate Risk" ,"Low Risk"), xvars="treatment_categorical", xname.list=list(c("Placebo" , "Treatment")), data=tutorial_data)
kable_styling(kable(x=res$cont.display, format="html", caption = "Contingency Table"),
bootstrap_options="striped",font_size=16)
treatment_categorical | risk_category = High Risk | risk_category = Intermediate Risk | risk_category = Low Risk |
---|---|---|---|
Placebo | 128 (25%) | 263 (51.3%) | 122 (23.8%) |
Treatment | 122 (25.1%) | 237 (48.7%) | 128 (26.3%) |
Then, we can also visualize the KM curves based on predefined risk
groups and treatment/control groups using the
subgrp_perf_pred
function.
res = subgrp_perf_pred(yvar="y.time", censorvar="y.event", grpvar="risk_category", grpname=c("High Risk","Intermediate Risk" ,"Low Risk"),trtvar="treatment_categorical", trtname=c("Placebo" , "Treatment"), xvars.adj=NULL,data=tutorial_data, type="s")
kable_styling(kable(x=res$group.res.display, format="html", caption = "Subgroup Stat"),
bootstrap_options="striped",font_size=16)
Response | N | Events | Mean.Surv.CI | Med.Surv.CI | |
---|---|---|---|---|---|
treatment_categorical=Placebo,risk_category=High Risk | y.time | 128 | 4 | 22.57 (3.96 , 41.17) | NA (0.13 , NA) |
treatment_categorical=Placebo,risk_category=Intermediate Risk | y.time | 263 | 110 | 2.95 (-0.03 , 5.92) | 1.12 (0.99 , 1.21) |
treatment_categorical=Placebo,risk_category=Low Risk | y.time | 122 | 89 | 6.18 (4.38 , 7.98) | 3.31 (2.87 , 4.33) |
treatment_categorical=Treatment,risk_category=High Risk | y.time | 122 | 4 | 0.12 (0.11 , 0.13) | 0.13 (0.12 , NA) |
treatment_categorical=Treatment,risk_category=Intermediate Risk | y.time | 237 | 108 | 0.96 (0.89 , 1.04) | 0.99 (0.82 , 1.11) |
treatment_categorical=Treatment,risk_category=Low Risk | y.time | 128 | 96 | 6.49 (5.19 , 7.79) | 4.41 (3.86 , 5.2) |