Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

Neural Information Processing Systems 

We study the structure of regret-minimizing policies in the {\em many-armed} Bayesian multi-armed bandit problem: in particular, with k the number of arms and T the time horizon, we consider the case where k \geq \sqrt{T} . We first show that {\em subsampling} is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in the many-armed regime. However, a subsampled UCB (SS-UCB), which samples \Theta(\sqrt{T}) arms and executes UCB only on that subset, is rate-optimal. Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms.