Online EXP3 Learning in Adversarial Bandits with Delayed Feedback
Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet
–Neural Information Processing Systems
Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays {dt} that are unknown to the player. After picking arm at at round t, the player receives the cost of playing this arm dt rounds later. In cases where t + dt > T, this feedback is simply missing.
Neural Information Processing Systems
Feb-13-2026, 14:15:39 GMT