Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
–Neural Information Processing Systems
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter.
Neural Information Processing Systems
Oct-3-2024, 04:20:16 GMT
- Country:
- North America > United States
- California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom
- England > Greater London > London (0.04)
- North America > United States