Title: | Confidence Bound Target Algorithm |
---|---|
Description: | The Confidence Bound Target (CBT) algorithm is designed for infinite arms bandit problem. It is shown that CBT algorithm achieves the regret lower bound for general reward distributions. Reference: Hock Peng Chan and Shouri Hu (2018) <arXiv:1805.11793>. |
Authors: | Hock Peng Chan and Shouri Hu |
Maintainer: | Shouri Hu <[email protected]> |
License: | GPL-2 |
Version: | 1.0 |
Built: | 2024-10-31 22:26:42 UTC |
Source: | CRAN |
CBT
and EMp_CBT
provide simution to infinite arms with Bernoulli Rewards.
CBT
assumes prior ditribution in known whereas EMp_CBT
does not. Ana_CBT
performs analysis to real data.
CBT(n, prior, bn = log(log(n)), cn = log(log(n))) Emp_CBT(n, prior, bn = log(log(n)), cn = log(log(n))) Ana_CBT(n, data, bn = log(log(n)), cn = log(log(n)))
CBT(n, prior, bn = log(log(n)), cn = log(log(n))) Emp_CBT(n, prior, bn = log(log(n)), cn = log(log(n))) Ana_CBT(n, data, bn = log(log(n)), cn = log(log(n)))
n |
total number of rewards. |
prior |
prior distribution on mean of the rewards. Currently avaiable priors: "Uniform", "Sine" and "Cosine". |
bn |
bn should increse slowly to infinity with n. |
cn |
cn should increse slowly to infinity with n. |
data |
A matrix or dataframe. Each column is a population. |
If bn
or cn
are not specified they assume the default value of log(log(n))
.
The confidence bound for an arm with observations is
where xbar and sigma are the mean and standatd deviation of the rewards from that paticular arm.
CBT is a non-recalling algorithm. An arm is played until its confidence bound drops below the target mean
, and it is not played after that.
If the prior distribution is unknown, we shall apply empirical CBT, in which the target mean is replaced by
, with
the sum of rewards among all arms played at current stage. Unlike CBT howerver empirical CBT is a recalling algorithm which decides from among all arms which to play further, rather than to consider only the current arm.
A list including elements
regret |
cumulative regret generated by n rewards. |
K |
total number of experimented arms. |
Hock Peng Chan and Shouri Hu
H.P. Chan and S. Hu (2018) Infinite Arms Bandit: Optimality via Confidence Bounds <arXiv:1805.11793>
R = 1000 cum_regret = numeric(R) arms = numeric(R) for(i in 1:R){ result = CBT(n = 10000, prior = "Sine") cum_regret[i] = result$regret arms[i] = result$K } mean(cum_regret) sd(cum_regret)/sqrt(R) mean(arms) sd(arms)/sqrt(R)
R = 1000 cum_regret = numeric(R) arms = numeric(R) for(i in 1:R){ result = CBT(n = 10000, prior = "Sine") cum_regret[i] = result$regret arms[i] = result$K } mean(cum_regret) sd(cum_regret)/sqrt(R) mean(arms) sd(arms)/sqrt(R)