BioPred Package Tutorial

Introduction

The R package BioPred offers a suite of tools for subgroup and biomarker analysis in precision medicine. Leveraging Extreme Gradient Boosting (XGBoost) along with propensity score weighting and A-learning methods, BioPred facilitates the optimization of individualized treatment rules (ITR) to streamline sub-group identification. BioPred also enables the identification of predictive biomarkers and obtaining their importance rankings. Moreover, the package provides graphical plots tailored for biomarker analysis. This tool enables clinical researchers seeking to enhance their understanding of biomarkers and patient population in drug development.

Install from GitHub

install_github(“deeplearner0731/BioPred”)

Install from CRAN

install.packages(“BioPred”)

Loading the Data

The ‘tutorial_data’ dataset is included with the package. Let’s load and inspect it:

library(BioPred)

kable_styling(kable(x=head(tutorial_data), format="html", caption = "The first 6 subjects"),
              bootstrap_options="striped",font_size=16)

The first 6 subjects
x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	y.con	y.bin	y.time	treatment	subgroup_label	risk_category	treatment_categorical
-2.3744988	0.0167914	-2.5412961	-2.0901502	-2.6896868	-1.3455564	-2.4931001	-0.3483110	-1.5673804	0.4836952	0.0992838	0	0.0874949	0	1	High Risk	Placebo
-0.1838847	1.8222234	0.4601912	0.5497404	1.6449981	0.5433581	0.2364084	-2.7868360	-0.4006381	0.6874968	0.0711574	0	0.9534603	0	0	Intermediate Risk	Placebo
0.0539828	-1.9142628	1.3112035	0.6543154	-0.0653439	0.5491687	1.8201585	2.5079087	-0.4686039	1.1696815	-1.2151804	0	0.0005331	1	1	High Risk	Treatment
1.1048102	0.8346022	-1.9808589	-0.5632687	0.3344664	1.4275303	0.1324287	2.1463655	1.3107915	0.4411784	0.2773950	0	0.0435684	1	0	High Risk	Treatment
-0.4179985	-2.2484687	3.3396929	3.1896045	-0.5779621	-1.4963500	0.4454772	0.1323151	1.4211205	-1.5240744	-2.0069643	0	0.1050067	1	1	High Risk	Treatment
1.0177850	0.1045029	1.4632138	1.0294900	2.0912418	2.2324576	0.5073059	-1.4397678	1.5456243	0.4183628	1.4521578	1	0.1018182	0	0	High Risk	Placebo


rownames(tutorial_data)=NULL

Data Description

The tutorial_data dataset contains the following columns:

x1 to x10: Biomarker variables.
y.con: A continuous outcome variable.
y.bin: A binary outcome variable (0 or 1).
y.time: Time in months, used for survival analysis.
y.event: Event indicator (0 for censoring, 1 for event occurrence).
subgroup_label: Ground truth of subgroup label. In real-world scenarios, this information is typically unavailable.
treatment: Treatment indicator (0 for control, 1 for treatment).
treatment_categorical: Categorical treatment variable with levels “Placebo” and “Treatment”.
risk_category: Risk category

1 Example Analysis for Continuous Outcomes

1.1 Training the XGBoostSub_con Model

We will begin by training the XGBoostSub_con model for a continuous outcome. This involves using the A-learning or Weight-learning loss function. Note that the treatment variable must be converted to (1, -1). If your treatment variable is (1, 0), you will need to convert it accordingly.

X_feature=tutorial_data[,c("x1", "x2" ,"x3" ,"x4","x5","x6","x7","x8","x9","x10")]
y_label=tutorial_data$y.con
# Convert treatment variable to (1 -1) format (1 for treatment, -1 for control)
trt=ifelse(tutorial_data$treatment==1, 1, -1)
true_label=tutorial_data$subgroup_label
# Estimate the propensity score using logistic regression
data_logit=tutorial_data[,c("treatment","x1", "x2" ,"x3" ,"x4" , "x5", "x6" , "x7","x8","x9","x10")]
logit_model <- glm(treatment~ ., data = data_logit, family = binomial)
pi <- predict(logit_model, type = "response")
# Train the XGBoostSub_con model using the specified parameters
model <- XGBoostSub_con(X_feature, y_label, trt ,pi,Loss_type = "A_learning", params = list(learning_rate = 0.005, max_depth = 1, lambda = 5, tree_method = 'hist'), nrounds = 200, disable_default_eval_metric = 0, verbose = FALSE)

After training the model, you can print the loss value at each epoch using the following code:

cat("Evaluation loss:\n")
#> Evaluation loss:
print(model$evaluation_log)
#>       iter train_A_loss
#>      <num>        <num>
#>   1:     1     1.375787
#>   2:     2     1.375424
#>   3:     3     1.375065
#>   4:     4     1.374708
#>   5:     5     1.374355
#>  ---                   
#> 196:   196     1.339571
#> 197:   197     1.339486
#> 198:   198     1.339402
#> 199:   199     1.339319
#> 200:   200     1.339236

You can also evaluate the loss on independent datasets using the eval_metric_con function. In this example, we’ll use the training data as test data for illustration purposes. However, you may use a new dataset to see the loss based on the trained model.

eval_metric_test <- eval_metric_con(model, X_feature, y_label, pi, trt, Loss_type = "A_learning")
cat("Testing Evaluation Result:\n")
#> Testing Evaluation Result:
print(eval_metric_test$value)
#> [1] 1.339236

1.2 Evaluating Predictive Biomarker Importance with XGBoostSub_con

Next, we evaluate the importance of predictive biomarkers using the predictive_biomarker_imp function.

biomarker_imp=predictive_biomarker_imp(model)

kable_styling(kable(x=biomarker_imp, format="html", caption = "Biomarker importance table"),
              bootstrap_options="striped",font_size=16)

Biomarker importance table
Biomarker	Importance
x2	0.4904407
x1	0.4793201
x10	0.0302392

From the results, we observe that the most important predictive biomarker is x2.

1.3 Obtaining subgroup results with XGBoostSub_con

You can obtain the subgroup results using the get_subgroup_results function.

subgroup_results=get_subgroup_results(model, X_feature, subgroup_label=true_label, cutoff = 0.5)
kable_styling(kable(x=head(subgroup_results$assignment), format="html", caption = "The first 6 subjects subgroup assigentment"),
              bootstrap_options="striped",font_size=16)

The first 6 subjects subgroup assigentment
Patient_ID	Treatment_Assignment
1	1
2	1
3	1
4	0
5	1
6	1

cat("Prediction accuracy of true subgroup label:\n")
#> Prediction accuracy of true subgroup label:
cat(subgroup_results$acc)
#> 0.634

Note that when using the get_subgroup_results function with real data, the true subgroup label is typically unknown. In such cases, set subgroup_label = NULL. This will return only the subgroup assignment without the prediction accuracy of the subgroup.

1.4 Commonly used subgroup/biomarker analysis tools

Given that x2 was identified as the most important predictive biomarker by XGBoostSub_con, we will use the cdf_plot function to evaluate the distribution of this biomarker by assessing the percentage of subjects falling within different cutoff values.

cdf_plot (xvar="x2", data=tutorial_data, y.int=5, xlim=NULL, xvar.display="Biomarker_X2", group=NULL)

We will also evaluate x2 prognostic capability. To determine if this biomarker is prognostic, we can assess the association between the response variable y.con and the biomarker x2 using the scat_cont_plot function.

scat_cont_plot(
  yvar='y.con',
  xvar='x2',
  data=tutorial_data,
  ybreaks = NULL,
  xbreaks = NULL,
  yvar.display = 'y',
  xvar.display = "Biomarker_X2"
)
#> $Cor.Pearson
#> [1] -0.12
#> 
#> $Cor.Spearman
#> [1] -0.12
#> 
#> $fig

#> 
#> $slope
#> [1] -0.1013384
#> 
#> $intercept
#> [1] 0.6769244

Selecting the cutoff value for a predictive biomarker is important in clinical practice, as it can, for example, help enrich responders. We select the optimal cutoff for the identified biomarker x2 from a list of candidate cutoff values using the fixcut_con function.

cutoff_result_con=fixcut_con(yvar='y.con', xvar="x2", dir=">", cutoffs=c(-0.1,-0.3,-0.5,0.1,0.3,0.5), data=tutorial_data, method="t.test", yvar.display='y.con', xvar.display='biomarker x2', vert.x=F)

cutoff_result_con$fig

From the results, the optimal cutoff value is 0.5. Next, we define the biomarker positive group using a cutoff of 0.5. The biomarker positive group includes subjects with x2 values less than or equal to 0.5, while the biomarker negative group includes subjects with x2 values greater than 0.5. We first use the cut_perf() function to evaluate the performance between the biomarker positive and negative groups.

res=cut_perf (yvar="y.con", censorvar=NULL, xvar="x2", cutoff=c(0.5), dir="<=", xvars.adj=NULL, data=tutorial_data, type='c', yvar.display='y.con', xvar.display="biomarker x2")
kable_styling(kable(x=res$comp.res.display[,-2:-4], format="html", caption = "Performace at optimal cutoff"),
              bootstrap_options="striped",font_size=16)

Performace at optimal cutoff
Response	Median.Diff	Mean.Diff.CI	t.pvalue	WRS.pvalue
y.con	-0.3	-0.38 (NA , NA)	NA	NA

Next, we will further assess the predictive model performance by examining the differences between the treatment and control groups within both the biomarker positive and negative groups.

tutorial_data$biogroup=ifelse(tutorial_data$x2<=0.5,'biomarker_positive','biomarker_negative')

res = subgrp_perf_pred(yvar="y.time", censorvar="y.event", grpvar="biogroup", grpname=c("biomarker_positive",'biomarker_negative'),trtvar="treatment_categorical", trtname=c("Placebo" , "Treatment"), xvars.adj=NULL,data=tutorial_data, type="s")
kable_styling(kable(x=res$group.res.display, format="html", caption = "BioSubgroup Stat"),
              bootstrap_options="striped",font_size=16)

BioSubgroup Stat
	Response	N	Events	Mean.Surv.CI	Med.Surv.CI
treatment_categorical=Placebo,biogroup=biomarker_positive	y.time	321	127	3.7 (2.49 , 4.92)	1.99 (1.65 , 2.7)
treatment_categorical=Placebo,biogroup=biomarker_negative	y.time	192	76	4.49 (2.37 , 6.61)	1.83 (1.58 , 2.27)
treatment_categorical=Treatment,biogroup=biomarker_positive	y.time	321	133	4.84 (3.66 , 6.02)	2.85 (2.11 , 3.52)
treatment_categorical=Treatment,biogroup=biomarker_negative	y.time	166	75	2.63 (1.93 , 3.32)	1.51 (0.84 , 2.14)

res$fig

From the Kaplan-Meier (KM) curve, a difference between the treatment and control groups is observed in the biomarker positive group. This difference is not observed in the biomarker negative group, validating that biomarker x2 identified by XGBoostSub_con is predictive.

2 Example Analysis for Binary Outcomes

2.1 Train the XGBoostSub_bin Model

Similar to training the XGBoostSub_con model for continuous outcomes, we can also train the XGBoostSub_bin model for binary outcomes.

y_label=tutorial_data$y.bin
model <- XGBoostSub_bin(X_feature, y_label, trt ,pi,Loss_type = "A_learning", params = list(learning_rate = 0.01, max_depth = 1, lambda = 5, tree_method = 'hist'), nrounds = 300, disable_default_eval_metric = 0, verbose = FALSE)

2.2 Predictive biomarker importance

biomarker_imp=predictive_biomarker_imp(model)

kable_styling(kable(x=biomarker_imp, format="html", caption = "Biomarker importance table"),
              bootstrap_options="striped",font_size=16)

Biomarker importance table
Biomarker	Importance
x10	0.6100485
x2	0.1428244
x8	0.1083907
x5	0.0805473
x1	0.0581890

Based on the results, biomarker x10 emerges as the most important biomarker.

2.3 Get subgroup results

subgroup_results=get_subgroup_results(model, X_feature, subgroup_label=true_label, cutoff = 0.5)
kable_styling(kable(x=head(subgroup_results$assignment), format="html", caption = "The first 6 subjects subgroup assigentment"),
              bootstrap_options="striped",font_size=16)

The first 6 subjects subgroup assigentment
Patient_ID	Treatment_Assignment
1	0
2	0
3	0
4	0
5	1
6	0

cat("Prediction accuracy of true subgroup label:\n")
#> Prediction accuracy of true subgroup label:
cat(subgroup_results$acc)
#> 0.535

Users can freely tune the parameters in the XGBoostSub-based models to enhance the performance of identifying true subgroups. In this example, we provide a random parameter setting for illustration purposes. However, when applying this model to real data, the ground truth labels are unknown, which means you may not be able to tune the parameters based on prediction accuracy.

2.4 Post-Hoc analysis based on subgroup and biomarker results

Continuing our assessment, we aim to determine whether the identified biomarker is prognostic or not. To do so, we examine the association between the binary response variable y.bin and biomarker x10.

roc_bin_plot (yvar='y.bin', xvars="x10", dirs="auto", data=tutorial_data, yvar.display='y.bin', xvars.display="Biomarker x10")

The association between the binary response variable y.bin and biomarker x2, identified as the second most important biomarker by the XGBoostSub_bin model, is shown below.

roc_bin_plot (yvar='y.bin', xvars="x2", dirs="auto", data=tutorial_data, yvar.display='y.bin', xvars.display="Biomarker x2")

We select the optimal cutoff of identified biomarker x10 from a candidate list of cutoff values This capability is particularly valuable for companion diagnostics (CDx) development when working with a limited set of candidate cutoffs (e.g., IHC).

cutoff_result_bin=fixcut_bin (yvar='y.bin', xvar="x10", dir=">", cutoffs=c(-0.1,-0.3,-0.5,0.1,0.3,0.5), data=tutorial_data, method="Fisher", yvar.display='Response_Y', xvar.display='Biomarker_X10', vert.x=F)
cutoff_result_bin$fig

Here, users can select the optimal cutoff based on various metrics such as sensitivity, specificity, PPV (Positive Predictive Value), and NPV (Negative Predictive Value). The choice of metric depends on the specific research context. For instance, if the goal is to enrich responders, focusing on PPV might be more appropriate. Based on the highest PPV value, users can then select the optimal cutoff.

If accuracy is our primary concern, the optimal cutoff value is 0.5. In this case, we define the biomarker positive group using a cutoff of 0.5. This group comprises subjects with x10 values greater than or equal to 0.5, while the biomarker negative group consists of subjects with x10 values less than 0.5. To assess the performance between the biomarker positive and negative groups, we initially employ the cut_perf function.

res=cut_perf (yvar="y.con", censorvar=NULL, xvar="x10", cutoff=c(0.5), dir=">", xvars.adj=NULL, data=tutorial_data, type='c', yvar.display='y.con', xvar.display="biomarker x2")
kable_styling(kable(x=res$comp.res.display[,-2:-4], format="html", caption = "caption"),
              bootstrap_options="striped",font_size=16)

caption
Response	Median.Diff	Mean.Diff.CI	t.pvalue	WRS.pvalue
y.con	0.22	0.2 (NA , NA)	NA	NA

Next, the subgrp_perf_pred function will provide the predictive model performance based on the defined subgroup.

tutorial_data$biogroup=ifelse(tutorial_data$x10>0.5,'biomarker_positive','biomarker_negative')

res = subgrp_perf_pred(yvar="y.time", censorvar="y.event", grpvar="biogroup", grpname=c("biomarker_positive",'biomarker_negative'),trtvar="treatment_categorical", trtname=c("Placebo" , "Treatment"), xvars.adj=NULL,data=tutorial_data, type="s")
kable_styling(kable(x=res$group.res.display, format="html", caption = "BioSubgroup Stat"),
              bootstrap_options="striped",font_size=16)

BioSubgroup Stat
	Response	N	Events	Mean.Surv.CI	Med.Surv.CI
treatment_categorical=Placebo,biogroup=biomarker_positive	y.time	191	73	3.94 (2.27 , 5.6)	1.83 (1.46 , 2.47)
treatment_categorical=Placebo,biogroup=biomarker_negative	y.time	322	130	3.98 (2.52 , 5.43)	2.03 (1.74 , 2.44)
treatment_categorical=Treatment,biogroup=biomarker_positive	y.time	181	79	3.19 (2.35 , 4.04)	1.71 (1.16 , 2.14)
treatment_categorical=Treatment,biogroup=biomarker_negative	y.time	306	129	4.68 (3.46 , 5.89)	2.96 (2.12 , 3.73)

res$fig

From the KM curve, it appears that 0.5 for X10 may not be the optimal cutoff. Further evaluation using the fixcut_bin function with different candidate optimal cutoffs is suggested.

3 Example Analysis for time-to-event outcome

3.1 Train the XGBoostSub_sur model

# Prepare the data for training the XGBoostSub_sur
y_label=tutorial_data$y.time
delta=tutorial_data$y.event

# Train the XGBoostSub_sur model using the specified parameters
model <- XGBoostSub_sur(X_feature, y_label, trt ,pi,delta,Loss_type = "A_learning", params=list(learning_rate = 0.005, max_depth = 1,lambda = 9, gamma=1, min_child_weight=1,
max_bin=128, tree_method = 'exact',subsample=0.8), nrounds = 250, disable_default_eval_metric = 1, verbose = FALSE)

3.1 Predictive biomarker importance using XGBoostSub_sur

biomarker_imp=predictive_biomarker_imp(model)

kable_styling(kable(x=biomarker_imp, format="html", caption = "Biomarker importance table"),
              bootstrap_options="striped",font_size=16)

Biomarker importance table
Biomarker	Importance
x6	0.6514610
x2	0.2944645
x1	0.0166948
x4	0.0127312
x10	0.0077685
x7	0.0070446
x8	0.0051927
x9	0.0024906
x5	0.0021522

3.2 Get subgroup results based on XGBoostSub_sur

subgroup_results=get_subgroup_results(model, X_feature, subgroup_label=true_label, cutoff = 0.5)
kable_styling(kable(x=head(subgroup_results$assignment), format="html", caption = "The first 6 subjects subgroup assigentment"),
              bootstrap_options="striped",font_size=16)

The first 6 subjects subgroup assigentment
Patient_ID	Treatment_Assignment
1	1
2	0
3	0
4	0
5	1
6	0

cat("Prediction accuracy of true subgroup label:\n")
#> Prediction accuracy of true subgroup label:
cat(subgroup_results$acc)
#> 0.527

4. Other functions for biomarker analysis in clinical practice

Subgroup analysis often involves dividing subjects into different groups based on categorical variables, such as baseline characteristics or genetic mutation variables. In this package, we provide the cat_summary function to summarize these categorical variables. For example, in the tutorial dataset, there is a variable called risk_category, which takes values like High Risk, Intermediate Risk, and Low Risk. First, we use this function to determine the distribution of subjects in each subgroup.

res = cat_summary(yvar="risk_category", yname=c("High Risk","Intermediate Risk" ,"Low Risk"), xvars="treatment_categorical", xname.list=list(c("Placebo" , "Treatment")), data=tutorial_data)

kable_styling(kable(x=res$cont.display, format="html", caption = "Contingency Table"),
              bootstrap_options="striped",font_size=16)

Contingency Table
treatment_categorical	risk_category = High Risk	risk_category = Intermediate Risk	risk_category = Low Risk
Placebo	128 (25%)	263 (51.3%)	122 (23.8%)
Treatment	122 (25.1%)	237 (48.7%)	128 (26.3%)

Then, we can also visualize the KM curves based on predefined risk groups and treatment/control groups using the subgrp_perf_pred function.

res = subgrp_perf_pred(yvar="y.time", censorvar="y.event", grpvar="risk_category", grpname=c("High Risk","Intermediate Risk" ,"Low Risk"),trtvar="treatment_categorical", trtname=c("Placebo" , "Treatment"), xvars.adj=NULL,data=tutorial_data, type="s")
kable_styling(kable(x=res$group.res.display, format="html", caption = "Subgroup Stat"),
              bootstrap_options="striped",font_size=16)

Subgroup Stat
	Response	N	Events	Mean.Surv.CI	Med.Surv.CI
treatment_categorical=Placebo,risk_category=High Risk	y.time	128	4	22.57 (3.96 , 41.17)	NA (0.13 , NA)
treatment_categorical=Placebo,risk_category=Intermediate Risk	y.time	263	110	2.95 (-0.03 , 5.92)	1.12 (0.99 , 1.21)
treatment_categorical=Placebo,risk_category=Low Risk	y.time	122	89	6.18 (4.38 , 7.98)	3.31 (2.87 , 4.33)
treatment_categorical=Treatment,risk_category=High Risk	y.time	122	4	0.12 (0.11 , 0.13)	0.13 (0.12 , NA)
treatment_categorical=Treatment,risk_category=Intermediate Risk	y.time	237	108	0.96 (0.89 , 1.04)	0.99 (0.82 , 1.11)
treatment_categorical=Treatment,risk_category=Low Risk	y.time	128	96	6.49 (5.19 , 7.79)	4.41 (3.86 , 5.2)

res$fig