Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

Neural Information Processing Systems 

Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays \left{ d {t} at round t, the player receives the cost of playing this arm d {t}>T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the delayed feedback upon its arrival) achieves a regret of O\left(\sqrt{\ln K\left(KT+\sum {t}\right)}\right). For the case where \sum {t} and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O\left(\sqrt{\ln K\left(K^{2}T+\sum {t}\right)}\right). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the "no-regret property", (e.g., where d {t} that is not summable but is square summable, and proving a "weighted regret bound" for this general case.