Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

Neural Information Processing Systems 

Restless bandit problems are instances of non-stationary multi-armed bandits. These problems have been studied well from the optimization perspective, where the goal is to efficiently find a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parameters are unknown. In this paper, we analyze the performance of Thompson sampling in episodic restless bandits with unknown parameters. We consider a general policy map to define our competitor and prove an \tilde{\bigO}(\sqrt{T}) Bayesian regret bound.