ACloser Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Apr-25-2026, 18:05:24 GMT–Neural Information Processing Systems

One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O(logn) regret in instances with "large" gaps, and a near-optimal O p nlogn minimax regret when the gap can be arbitrarily "small." This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity.

artificial intelligence, data mining, machine learning, (17 more...)

Neural Information Processing Systems

Apr-25-2026, 18:05:24 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > New Finding (0.66)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.95)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning (1.00)
  - Data Science > Data Mining
    - Big Data (1.00)

Duplicate Docs Excel Report

Title
49ef08ad6e7f26d7f200e1b2b9e6e4ac-Paper.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found