Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

Dec-23-2025, 18:48:22 GMT–Neural Information Processing Systems

We study the structure of regret-minimizing policies in the {\em many-armed} Bayesian multi-armed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$. We first show that {\em subsampling} is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in the many-armed regime. However, a subsampled UCB (SS-UCB), which samples $\Theta(\sqrt{T})$ arms and executes UCB only on that subset, is rate-optimal. Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms.

greedy algorithm, multi-armed bandit, unreasonable effectiveness, (7 more...)

Neural Information Processing Systems

Dec-23-2025, 18:48:22 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Artificial Intelligence (0.65)
  - Data Science > Data Mining
    - Big Data (0.96)