CORe: Capitalizing On Rewards in Bandit Exploration
Wang, Nan, Kveton, Branislav, Karimzadehgan, Maryam
A multi-armed bandit can be considered as a special case We propose a bandit algorithm that explores purely of linear bandits, where the feature vector of each arm is by randomizing its past observations. In particular, a one-hot vector indicating the index of the arm, and the the sufficient optimism in the mean reward estimates parameter vector is a vector of corresponding mean rewards. is achieved by exploiting the variance in Arguably, the most popular and well-studied exploration the past observed rewards. We name the algorithm strategies for solving bandit problems are Thompson sampling Capitalizing On Rewards (CORe). The algorithm (TS) [Thompson, 1933, Agrawal and Goyal, 2013] is general and can be easily applied to different and Optimism in the Face of Uncertainty (OFU) [Auer et al., bandit settings. The main benefit of CORe is that 2002]. TS maintains a posterior distribution over each arm's its exploration is fully data-dependent.
Mar-7-2021