Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning
–Neural Information Processing Systems
In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order eO(H p dl1T) in the time inhomogeneous reinforcement learning problem where H is the episode length and dl1 is the Kolmogorov l1 dimension of the space of environments. We then find concrete bounds of dl1 in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.
Neural Information Processing Systems
Apr-27-2026, 01:40:14 GMT
- Country:
- North America > United States (0.68)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Education > Focused Education > Special Education (0.45)
- Technology: