Nonstochastic Multiarmed Bandits with Unrestricted Delays
Thune, Tobias Sommer, Cesa-Bianchi, Nicolò, Seldin, Yevgeny
–Neural Information Processing Systems
We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the $O(\sqrt{(KT D)\ln K})$ regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of $D$ and $T$.
Neural Information Processing Systems
Jan-11-2020, 19:52:12 GMT
- Technology: