Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

Shipra Agrawal, Randy Jia

Neural Information Processing Systems 

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found