Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms
–Neural Information Processing Systems
We study the structure of regret-minimizing policies in the {\em many-armed} Bayesian multi-armed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$. We first show that {\em subsampling} is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in the many-armed regime. However, a subsampled UCB (SS-UCB), which samples $\Theta(\sqrt{T})$ arms and executes UCB only on that subset, is rate-optimal. Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms.
Neural Information Processing Systems
Dec-23-2025, 18:48:22 GMT
- Technology: