Fighting Bandits with a New Kind of Smoothness

Neural Information Processing Systems 

We provide a new analysis framework for the adversarial multi-armed bandit problem. Using the notion of convex smoothing, we define a novel family of algorithms with minimax optimal regret guarantees. First, we show that regularization via the Tsallis entropy, which includes EXP3 as a special case, matches the O(p NT) minimax regret with a smaller constant factor. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as O(p NT log N), as long as the perturbation distribution has a bounded hazard function. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property and lead to near-optimal algorithms.