A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

Neural Information Processing Systems 

We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin [2020] simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is \mathcal{O}(\sqrt{TK} \sqrt{dT\log K}), where T is the time horizon, K is the number of arms, and d is the fixed delay, whereas the stochastic regret guarantee is \mathcal{O}\left(\sum_{i eq i *}(\frac{1}{\Delta_i} \log(T) \frac{d}{\Delta_{i}}) d K {1/3}\log K\right), where \Delta_i are the suboptimality gaps. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.