Reviews: Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
–Neural Information Processing Systems
Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds This paper presents a new algorithm for efficient exploration in Markov decision processes. This algorithm is an optimistic variant of posterior sampling, similar in flavour to BOSS. The authors prove new performance bounds for this approach in a minimax setting that are state of the art in this setting. There are a lot of things to like about this paper: - The paper is well written and clear overall. I would say that most of the key insights do come from the earlier "Gaussian-Dirichlet dominance" of Osband et al, but there are some significant extensions and results that may be of wider interest to the community.
Neural Information Processing Systems
Oct-7-2024, 17:53:21 GMT