--- title: "Bayesian calibration of two-arm one-stage Bayes factor designs with binary endpoints" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Bayesian calibration of two-arm one-stage Bayes factor designs with binary endpoints} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} author: | | Riko Kelter | Institute of Medical Statistics and Computational Biology | Faculty of Medicine | University of Cologne | Cologne, Germany date: "`r format(Sys.Date(), '%d %B %Y')`" --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, echo = TRUE, eval = TRUE, # default, but heavy chunks override with eval = FALSE message = FALSE, warning = FALSE ) options(bfbin2arm.ncores = 1L) library(bfbin2arm) ``` # Introduction and Overview In this vignette, we illustrate how to calibrate a two-arm one stage phase II design with binary endpoints from a Bayesian perspective. Details on the methodology can be found in [@kelter_power_2026]. Our main assumption here is that the observed data in both groups are from two random variables $Y_1,Y_2$ which both follow a binomial distribution with parameters $n_1$ and $n_2$ and $p_1$ respectively $p_2$, $$Y_1\sim \mathrm{Bin}(n_1,p_1), \hspace{1cm} Y_2\sim \mathrm{Bin}(n_2,p_2)$$ ## Hypothesis tests In its current form, the package implements four different hypothesis tests for such trials: $$H_0:p_1=p_2 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:p_1\neq p_2$$ Alternatively, a well-known parameterization of this test introduces a difference parameter $\eta=p_2-p_1$ and the grand mean $\zeta=\frac{1}{2}(p_1+p_2)$. Using this parameterization, we have $$p_1=\zeta-\frac{\eta}{2}, \hspace{1cm} p_2=\zeta+\frac{\eta}{2}$$ and the hypotheses can be rewritten as: $$H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta \neq 0$$ Next to this two-sided test, three directional tests are available in the package: - $$H_0:\eta \leq 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta > 0$$ - $$H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta > 0$$ - $$H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta < 0$$ For each of the four tests, a separate Bayes factor exists and can be used. For the two-sided test, we denote the Bayes factor as $BF_{01}$, and for the three directional tests above we denote the Bayes factors as $BF_{+-}$, $BF_{+0}$ and $BF_{-0}$. Thus, the test of $H_0:\eta \leq 0$ versus $H_1:\eta >0$ can also be written as $H_-:p_2 \leq p_1$ versus $H_+:p_2 > p_1$. ## Design and analysis priors The $\mathrm{Beta}(a_0,b_0)$ distribution is a conjugate prior for the binomial likelihood, and when chosen as the prior, the posterior $P_{p \mid Y}$ is also Beta-distributed. A natural choice for the priors is the beta distribution. We assume independent Beta design priors $H_0$ as follows: $$p_1 =p_2 = p\mid H_0 \sim \mathrm{Beta}(a_0^d,b_0^d)$$ Thus, under $H_0:\eta = 0$, both probabilities are identical, $p_1=p_2$, and take some value $p\in [0,1]$, which has a beta design prior. Likewise, we pick independent Beta design priors under $H_1:\eta \neq 0$: $$p_1 \mid H_1 \sim \mathrm{Beta}(a_1^d,b_1^d), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_2^d,b_2^d)$$ For the analysis priors $P_{p_1}^a$, $P_{p_2}^a$ under $H_1$, we also choose independent Beta priors, with possibly different values $a_i^a$ and $b_i^a$ for $i=1,2$, where the superscript signals that the hyperparameters belong to our analysis instead of design prior: $$p_1 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a)$$ Lastly, for the analysis prior $P_{p}^a$ under $H_0:\eta=0$, we choose a Dirac prior with all probability on $\eta=p_2-p_1=0$ conditionally on a uniform prior on $\zeta$, that is $$p_1=p_2=p|H_0 \sim 1_{\{\eta=0\}}| \zeta \sim U(0,1)$$ for the analysis with the Bayes factor. # Using the package First, we load the package after installation: ```{r} library(bfbin2arm) ``` Next, we illustrate the main calibration function for a two-arm one-stage trial by re-analyzing a phase II trial in the context of oncology. While no Bayesian approach was used in the original statistical analysis of the trial, the step-by-step walktrough below showcases how a structured approach to designing and calibrating a Bayesian two-arm one-stage phase II trial with the `bfbin2arm` package looks like. Importantly, the trial must have two trial arms (treatment and control) and binary endpoints. We assume further that one of the four tests detailed above is carried out using Bayes factors as the test criterion. ## ICT-107 Phase II Trial Overview The ICT-107 trial [@wenRandomizedDoubleBlindPlaceboControlled2019] was a randomized phase II study in newly diagnosed glioblastoma patients (n=124, 2:1 randomization). The primary binary endpoint is progression status at 6 months (PFS6), and the secondary binary endpoint immunologic status. Here, we focus on the secondary endpoint for illustration purposes. **Reported results** (ITT population): - ICT-107 (n=82): 49/82 responders= **59.7% response rate** - Control (n=42): 12/42 responders = **35.7% response rate** ## 1. Bayes Factor Analysis We start by calculating the Bayes factor(s) for the ICT-107 trial data: ```{r} ## ------------------------------------------------------------- ## 2. ICT-107 trial (immunologic response) ## Placebo (control): 12 responders, 31 non-responders ## ICT-107 (treatment): 49 responders, 32 non-responders ## ------------------------------------------------------------- y1_ict <- 12 # control successes n1_ict <- 12 + 31 y2_ict <- 49 # treatment successes n2_ict <- 49 + 32 cat("\n=== ICT-107 Trial (n1 =", n1_ict, ", n2 =", n2_ict, ") ===\n") # BF01 BF01_ict = twoarmbinbf01(y1_ict, y2_ict, n1_ict, n2_ict, a_0_a = 1, b_0_a = 1, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1) # BF+1 BFp1_ict = BFplus1(y1_ict, y2_ict, n1_ict, n2_ict, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1) # BF-1 BFm1_ict = BFminus1(y1_ict, y2_ict, n1_ict, n2_ict, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1) # BF+0 cat("=== ICT-107 Trial === Bayes factor BF+0 results in ", BFplus0(BFp1_ict, BF01_ict)) # BF+- cat("=== ICT-107 Trial === Bayes factor BF+- results in ", BFplusMinus(BFp1_ict, BFm1_ict)) ``` The most relevant Bayes factor here is $BF_{+-}$, because it is directional and leaves open the possibility of the placebo group having a larger response rate than the treatment group. Note that the hyperparameters of the beta analysis priors are specified in `twoarmbinbf01` via `a_0_a = 1, b_0_a = 1` et cetera. ## 2. Operating characteristics for actual sample sizes Now, a key question is which operating characteristics can be expected based on the actual sample sizes used in the trial. The `powertwoarmbinbf01` function can provide the answer: ```{r, eval = FALSE} ict_results <- powertwoarmbinbf01( n1 = n1_ict, n2 = n2_ict, k = 1/3, k_f = 3, test = "BF+-", # H+: p2 > p1 vs H-: p2 <= p1 a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1, a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1, output = "numeric", compute_freq_t1e = TRUE, ) print(ict_results) ``` ``` Power Type1_Error 0.8788106 0.0214111 CE_H0 Frequentist_Type1_Error 0.8788106 0.2871811 attr(,"hypothesis") [1] "H[+]:~p[2] > p[1] ~~ vs ~~ H[-]:~p[2] <= p[1]" attr(,"compute_freq_t1e") [1] TRUE ``` We see that based on the actual sample sizes and a moderate evidence threshold $k=1/3$, the Bayesian power is sufficiently large with $87.8\%$. Still, the frequentist type-I-error rate is way too high with $28.7\%$, so we increase the evidence threshold to $k=1/10$ (strong evidence) and use the `ntwoarmbinbf01` function to calibrate the design based on our requirements next. ## 3. Power and sample size planning The core working function to design a Bayesian two-arm one-stage trial with the package is the `design_twoarm_onestage_bf()` function. It searches over a grid of total sample sizes and returns a **design object** that contains - the selected sample sizes in each arm (`n1`, `n2`) and their sum (`n_total`) - Bayesian and frequentist operating characteristics at the chosen design - the full search grid with pointwise and sustained feasibility indicators - the calibration targets and input priors used in the search. Internally, the function uses the same numerical engine as the legacy `ntwoarmbinbf01()` function, but exposes a richer, object-based interface and S3 methods for printing, summarizing, and plotting. The old function `ntwoarmbinbf01()` remains available as a compatibility wrapper that now returns the same design object. First, we perform a sample size search for an ICT-107-type trial (balanced arms) under flat design priors and substantial evidence thresholds, using the directional Bayes factor \(BF_{+-}\). Note that evidence in favour of $H_-$ happens when $BF_{+-} 0L && all(!is.na(win)) && all(win)) { return(i) } } NA_integer_ } ok_pointwise <- oc_vals >= target i_first_pointwise <- which(ok_pointwise)[1L] i_first_sustained <- first_sustained_index(ok_pointwise, sustain_n) n_first_pointwise <- ns[i_first_pointwise] n_first_sustained <- ns[i_first_sustained] n_window_max <- ns[min(length(ns), i_first_sustained + sustain_n)] df <- data.frame( n_total = ns, oc = oc_vals, pointwise_ok = ok_pointwise ) ggplot(df, aes(x = n_total, y = oc)) + annotate( "rect", xmin = n_first_sustained - 0.5, xmax = n_window_max + 0.5, ymin = -Inf, ymax = Inf, fill = "steelblue", alpha = 0.08 ) + geom_hline( yintercept = target, linetype = "dashed", colour = "grey35", linewidth = 0.8 ) + geom_line(colour = "grey35", linewidth = 0.9) + geom_point(aes(colour = pointwise_ok), size = 2.4) + geom_vline( xintercept = n_first_pointwise, colour = "#d95f02", linetype = "dotted", linewidth = 0.9 ) + geom_vline( xintercept = n_first_sustained, colour = "#1b3a57", linewidth = 1.1 ) + annotate( "text", x = n_first_pointwise, y = 0.715, label = paste0("first pointwise crossing\nn = ", n_first_pointwise), colour = "#d95f02", size = 3.2, hjust = -0.02, vjust = 1 ) + annotate( "text", x = n_first_sustained, y = 0.905, label = paste0("first sustained crossing\nn = ", n_first_sustained), colour = "#1b3a57", size = 3.2, hjust = -0.02, vjust = 1 ) + annotate( "text", x = (n_first_sustained + n_window_max) / 2, y = 0.735, label = paste0("forward window: sustain_n + 1 = ", window_len), colour = "#1b3a57", size = 3.1 ) + scale_colour_manual( values = c(`TRUE` = "#2c7fb8", `FALSE` = "#d7301f"), labels = c(`FALSE` = "Below target", `TRUE` = "At or above target"), name = NULL ) + scale_x_continuous(breaks = ns) + coord_cartesian(ylim = c(0.70, 0.92)) + labs( x = "Total sample size", y = "Operating characteristic", title = "Illustration of sustained feasibility" ) + theme_minimal(base_size = 11) + theme( plot.title = element_text(hjust = 0.5), legend.position = "bottom", panel.grid.minor = element_blank() ) ``` The above figure illustrates the sustained feasibility logic which currently is implemented in the calibration algorithm. For $n=71$ in this toy example, and `sustain_n + 1 = 5`, even though the threshold of 80% is achieved, the sample size eventually selected is $n=76$. For $n=76$, the next $5$ sample sizes up to $n=80$ satisfy the operating characteristic threshold of at leat 80%, which is not the case for $n=71$. Now, back to our calibrated design. We plot the results: ```{r, out.width="80%", eval = FALSE} plot(des_informative) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 3: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint, using informative design priors under the alternative hypothesis"} knitr::include_graphics("figures/twoarm_onestage_fig3.png") ``` We see that now the Bayesian power is calibrated for $n=74$ patients per trial arm and does not drop below the required 80% for at least the next ten sample sizes (it does not drop below the 80% for any sample size up to $n=100$, as can be verified by the plot). Frequentist power is calibrated for $n=81$ patients trial arm. The Bayesian type-I-error is already calibrated for $n=10$, requiring only $5$ patients per trial arm. Importantly, the frequentist type-I-error is also calibrated and is $0.034<0.05$, as can be inspected by ```{r, eval = FALSE} print(des_informative) ``` ``` One-stage two-arm Bayes factor design ------------------------------------ Mode: optimal Status: Smallest feasible one-stage two-arm design found. Calibration: Bayesian Optional freq. Type-I reporting: on Design: n_total = 74, n1 = 37, n2 = 37 Operating characteristics Power = 0.8004 Type-I error = 0.0021 CE(H0) = 0.6697 Freq. Type-I = 0.0340 Freq. Power = 0.7778 ``` The probability of compelling evidence for $H_-$ is shown in the bottom plot. It is calibrated for $n=49$, so the trial design is fully calibrated from a Bayesian perspective if $n=74$ patients are recruited in total ($n_1=37$ in the control and $n_2=37$ in the treatment group). Then, the probability of compelling evidence is also calibrated. Based on the above plot we can see that the probability of compelling evidence does not reach 80% in the sample size range up to $n=100$ patients. However, suppose we want a trial design which achieves such a high probability of compelling evidence for $H_0$, but we cannot afford to recruit more than $n=100$ patients in total. A possible solution is to modify the design priors under $H_-$ to express more information about our expectation of the effect the novel drug or treatment has. Thus, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds, and change the design prior under H- to achieve the target probability of compelling evidence PCE(H0) for even smaller sample sizes. Note that now, additionally, the design prior hyperparameters of the Beta design priors for $p_1$ and $p_2$ under $H_-$ are specified in `a_1_d_Hminus = 2, b_1_d_Hminus = 1` and `a_2_d_Hminus = 1, b_2_d_Hminus = 2`. Note that we increased `target_ce_h0 = 60` to `target_ce_h0 = 0.80`: ```{r, out.width='80%', cache = TRUE, eval = FALSE} des_informative_higher_ce <- design_twoarm_onestage_bf( n_min = 10, n_max = 100, k = 1/30, k_f = 30, test = "BF+-", calibration = "Bayesian", target_power = 0.80, target_type1 = 0.05, target_ce_h0 = 0.80, # design and analysis priors: flat Beta(1,1) everywhere a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1, a_1_d = 1, b_1_d = 2, a_2_d = 2, b_2_d = 1, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1, # design prior parameters under H_- a_1_d_Hminus = 2, b_1_d_Hminus = 1, a_2_d_Hminus = 1, b_2_d_Hminus = 2, # assumed true proportions for frequentist power (optional here) p1_power = 0.3, p2_power = 0.6, # report frequentist type-I-error? (optional here) report_freq_type1 = TRUE, # equal randomisation alloc1 = 0.5, alloc2 = 0.5, # require sustained feasibility over the next 10 larger n sustain_n = 10L, progress = FALSE ) ``` We check the results: ```{r, eval = FALSE} summary(des_informative_higher_ce) ``` ``` Summary: One-stage two-arm Bayes factor design --------------------------------------------- Mode: optimal Status: Smallest feasible one-stage two-arm design found. Calibration: Bayesian Feasible: yes Search overview n evaluated = 91 pointwise feasible = 28 sustained feasible = 27 first pointwise n = 72 first sustained n = 74 Selected design n_total = 74, n1 = 37, n2 = 37 ``` The design has not changed. Why is that? We plot the results: ```{r, eval = FALSE} plot(des_informative_higher_ce) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 4: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint, using informative design priors under both hypotheses and a stronger requirement on the probability of compelling evidence (80% instead of only 60%)"} knitr::include_graphics("figures/twoarm_onestage_fig4.png") ``` The plot shows that the calibration sample sizes for Bayesian power, type-I-error and frequentist power remain identical to the previous function call. The only thing which changed are the design priors under $H_-$ in the top panel, and the bottom panel for the probability of compelling evidence. First, the design priors under $H_-:p_2 \leq p_1$ have a form which puts more prior probability mass to small success probabilities in the treatment group with parameter $p_2$, and more prior probability mass to large success probabilities in the control group with parameter $p_1$. This is precisely expressed by $H_-:p_2 \leq p_1$, and thus under $H_0$, we can expect that evidence for $H_0$ accumulates faster. This is reflected in the bottom panel for the probability of compelling evidence, as now $n=74$ patients suffice to reach 80% probability of compelling evidence for $H_0$. The result is a fully calibrated Bayesian design which meets Bayesian power demands of 80%, Bayesian type-I-error rate requirements of less than 5%, and our requirement of 80% on the probability of compelling evidence for $H_0$ (that is, $H_-$ in this case). What about the frequentist operating characteristics of this design? We see that $n=81$ patients in total suffice to calibrate the design additionally in terms of frequentist power. ```{r, eval = FALSE} print(des_informative_higher_ce) ``` ``` One-stage two-arm Bayes factor design ------------------------------------ Mode: optimal Status: Smallest feasible one-stage two-arm design found. Calibration: Bayesian Optional freq. Type-I reporting: on Design: n_total = 74, n1 = 37, n2 = 37 Operating characteristics Power = 0.8004 Type-I error = 0.0011 CE(H0) = 0.8004 Freq. Type-I = 0.0340 Freq. Power = 0.7778 ``` The type-I-error is still calibrated, so choosing $n=81$ patients in total even yields a fully calibrated design both from a Bayesian and frequentist perspective. The calibration function `design_twoarm_onestage_bf` reveals several aspects. If a balanced design with equal randomization probabilities is desired, then: - **n=81 patients in total** (41 patients per trial arm) are needed for 80% frequentist power at ICT-107 effect size when evidence threshold $k=1/30$ is used. Here, the assumption is that the true proportions are $p_1=0.3$ and $p_2=0.6$, which can easily be modified if a more optimistic or pessimistic assumption is warranted - **n=74 patients in total** (37 patients per trial arm) are needed for 80% Bayesian power at ICT-107 effect size when evidence threshold $k=1/30$ is used, and slightly informative Beta design priors are assumed under $H_+$. - **Type-I error control** both from a frequentist perspective (≤5% across designs when $k=1/30$ is used) and from a Bayesian perspective, where for the latter only **$n=10$ patients in total** (5 patients per trial arm) are required. - **High P(CE|H-)** guarantees that under $H_-$ there is 80\% probability to find a Bayes factor of at least $k_f=30$ in favour of $H_-$. **n=74 patients in total** (37 patients per trial arm) are required to assert this probability of compelling evidence for $H_-$. ## 5. Unequal randomization probabilities In the original ICT-107 trial, $2/3$ of the patients was randomized into the treatment group, while $1/3$ of the patients was randomized into the control group. We can use the parameters `alloc1` and `alloc2` to specify randomization probabilities for the control and treatment arms and carry out the Bayesian sample size calculations based on these randomization probabilities. As an example, we rerun the last calibration, but use the randomization probabilities of the ICT-107 trial: ```{r, out.width='80%', cache = TRUE, eval = FALSE} des_informative_higher_ce_uneq_alloc <- design_twoarm_onestage_bf( n_min = 10, n_max = 100, k = 1/30, k_f = 30, test = "BF+-", calibration = "Bayesian", target_power = 0.80, target_type1 = 0.05, target_ce_h0 = 0.80, # design and analysis priors: flat Beta(1,1) everywhere a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1, a_1_d = 1, b_1_d = 2, a_2_d = 2, b_2_d = 1, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1, # design prior parameters under H_- a_1_d_Hminus = 2, b_1_d_Hminus = 1, a_2_d_Hminus = 1, b_2_d_Hminus = 2, # assumed true proportions for frequentist power (optional here) p1_power = 0.3, p2_power = 0.6, # report frequentist type-I-error? (optional here) report_freq_type1 = TRUE, # equal randomisation alloc1 = 1/3, alloc2 = 2/3, # require sustained feasibility over the next 10 larger n sustain_n = 10L, progress = FALSE ) ``` We summarize the results: ```{r, eval = FALSE} summary(des_informative_higher_ce_uneq_alloc) ``` ``` Summary: One-stage two-arm Bayes factor design --------------------------------------------- Mode: optimal Status: Smallest feasible one-stage two-arm design found. Calibration: Bayesian Feasible: yes Search overview n evaluated = 91 pointwise feasible = 18 sustained feasible = 18 first pointwise n = 83 first sustained n = 83 Selected design n_total = 83, n1 = 28, n2 = 55 ``` We plot the results: ```{r, eval = FALSE} plot(des_informative_higher_ce_uneq_alloc) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 5: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint, using informative design priors under both hypotheses and a stronger requirement on the probability of compelling evidence (80% instead of only 60%). Additionally, unequal randomization probabilities are used when calibrating the design."} knitr::include_graphics("figures/twoarm_onestage_fig5.png") ``` Remember that the sample size shown at the x-axis in the power and type-I-error rate plot as well as in the probability of compelling evidence plot is the total sample size in both arms. We see that now we need $n=83$ patients in total to reach Bayesian power of 80\%, while $n=88$ patients in total are required for frequentist power calibration of 80\%. The probability of compelling evidence reaches 80\% at $n=83$ patients in total. The frequentist type-I-error rate is still below the required 5% threshold, too: ```{r, eval = FALSE} print(des_informative_higher_ce_uneq_alloc) ``` ``` One-stage two-arm Bayes factor design ------------------------------------ Mode: optimal Status: Smallest feasible one-stage two-arm design found. Calibration: Bayesian Optional freq. Type-I reporting: on Design: n_total = 83, n1 = 28, n2 = 55 Operating characteristics Power = 0.8018 Type-I error = 0.0011 CE(H0) = 0.8018 Freq. Type-I = 0.0369 Freq. Power = 0.7829 ``` ## 6. Design Recommendations based on the calibration If the original 2:1 randomization of the ICT-107 trial is used and two thirds of the patients are randomized into the treatment group, then: - **n=83 patients in total** (28 patients in the control arm and 55 in the treatment arm) are needed for 80% Bayesian power at ICT-107 effect size when evidence threshold $k=1/30$ is used, and slightly informative Beta design priors are assumed under $H_+$. - **Type-I error control** both from a frequentist perspective (≤5% across designs when $k=1/30$ is used) and from a Bayesian perspective, where for the latter only **$n=10$ patients in total** (both arms) are required. - **High P(CE|H-)** guarantees that under $H_-$ there is 80\% probability to find a Bayes factor of at least $k_f=30$ in favour of $H_-$. **n=83 patients in total** (28 in the control arm and 55 in the treatment arm) are required to assert this probability of compelling evidence for $H_-$. - **n=88 patients in total** (29 patients in the control arm and 59 in the treatment arm) are needed for 80% frequentist power at ICT-107 effect size when evidence threshold $k=1/30$ is used. Here, the assumption is that the true proportions are $p_1=0.3$ and $p_2=0.6$, which can easily be modified if a more optimistic or pessimistic assumption is warranted To fulfill all four requirements, it thus suffices if $n_1=29$ patients in the control arm and $n_2=59$ in the treatment arm are enrolled in the trial, and the Bayes factor thresholds $k=1/30$ and $k_f=30$ are used for decision making about the hypotheses $H_+$ and $H_-$ under consideration. For a Bayesian calibration only, it suffices if $n_1=28$ patients in the control arm and $n_2=55$ in the treatment arm are enrolled in the trial. ## Summary This vignette has illustrated how to design and calibrate two‑arm one‑stage Bayes factor trials with binary endpoints using the `bfbin2arm` package. The core workflow starts from specifying a Bayes factor test (two‑sided or directional), choosing coherent design and analysis priors under the competing hypotheses, and then mapping clinical requirements onto calibration targets for Bayesian power, Bayesian type‑I error, and the Bayesian probability of compelling evidence for the null (or \(H_-\) in directional tests). The central calibration function `design_twoarm_onestage_bf()` searches over a user‑defined grid of total sample sizes and returns a design object that contains the selected allocation \((n_1, n_2)\), the corresponding total sample size \(n_{\text{total}}\), and both Bayesian and frequentist operating characteristics at the chosen design. A key innovation is the use of a sustained feasibility constraint, controlled by the argument `sustain_n`, which guards against oscillatory behaviour of operating characteristics driven by the discreteness of the binomial model. Instead of treating a sample size as feasible as soon as it meets its calibration thresholds pointwise, the algorithm only accepts a candidate \(n_{\text{total}}\) if all relevant targets hold at that \(n_{\text{total}}\) and continue to hold for at least the next `sustain_n` larger total sample sizes within the search range. The diagnostic plots reflect this logic: for each operating characteristic (Bayesian power, Bayesian type‑I error, CE(H0), and optional frequentist power), the vertical reference line is drawn at the first total sample size where the corresponding metric attains its target in this sustained sense. As a result, the graphical summaries and numerical design recommendations are aligned and directly interpretable as robust to local oscillations in the operating characteristic curves. Using the ICT‑107 phase II trial as a running example, we have shown how flat design priors can be replaced by more informative priors that encode realistic expectations about treatment and control response rates. This shift often allows one (1) to achieve the desired calibration targets at substantially smaller total sample sizes compared to flat priors and (2) achieve higher constraints on certain operating characteristics such as the probability of compelling evidence for identical sample sizes, especially when strong evidence thresholds (e.g. \(k = 1/30\), \(k_f = 30\)) are required. The vignette has also demonstrated how to handle equal and unequal randomization, how to request frequentist type‑I error and power alongside the Bayesian criteria, and how to interpret the resulting design recommendations in terms of total sample size \(n_{\text{total}}\) and arm‑specific allocations. Overall, the `bfbin2arm` package provides a flexible, unified framework in which Bayesian, frequentist, hybrid, and fully dual calibrations can be performed and visualised in a way that is directly tied to clinically meaningful decision thresholds. Further details on the methodology can be found in [@kelter_power_2026]. ## References