A Best-of-Both-Worlds Proof for Tsallis-INF without Fenchel Conjugates
Lee, Wei-Cheng, Orabona, Francesco
The multi-armed bandit problem is a classic framework for sequential decision-making under uncertainty, encapsulating the fundamental trade-off between exploration and exploitation. This problem is typically studied in two primary settings: the stochastic setting, where the loss for each arm is drawn independently and identically from a fixed distribution, and the adversarial setting, where an arbitrary and potentially adaptive sequence of losses is generated. While numerous algorithms have been developed that are optimal for one of these settings, a significant challenge has been to design a single algorithm that performs well in both, without prior knowledge of the environment. Such an algorithm is said to have a "best-of-both-worlds" guarantee. By choosing an appropriate regularization function, a Follow-The-Regularized-Leader (FTRL) [Gordon, 1999a,b, Shalev-Shwartz and Singer, 2006a,b, Abernethy et al., 2008, Hazan and Kale, 2008] approach can achieve low regret in the adversarial setting while adapting to the easier stochastic setting to attain near-optimal, logarithmically-scaling regret, that is, a best-of-both-worlds guarantee. The Tsallis-INF algorithm [Audibert and Bubeck, 2009], when used in a FTRL algorithm was shown to have this property by Zimmert and Seldin [2021]. The aim of this note is to provide a concise and self-contained proof of this guarantee for the Tsallis-INF algorithm. In particular, the derivation relies on standard and modern techniques from online convex optimization, particularly the FTRL regret lemma and its local norm analysis.
Nov-17-2025
- Country:
- Genre:
- Research Report (0.40)
- Technology: