Export Reviews, Discussions, Author Feedback and Meta-Reviews
–Neural Information Processing Systems
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary: This paper provides a Bayesian expected regret bound for the Posterior Sampling for the Reinforcement Learning (PSRL) algorithm. PSRL has been introduced by [Strens2000], and can be seen as the application of Thompson sampling for RL problems: a model is sampled from the (posterior) distribution over models, the optimal policy for the sampled model is calculated, the policy is followed until the end of the horizon, and the distribution over models is updated. PSRL for finite MDPs has been analyzed by [OVRR2013], but the main contribution of this paper is to analyze PSRL for MDPs with general state and action space. In the analysis, the authors use the concept of eluder dimension introduced by [RVR2013]. Eluder dimension was previously used in the analysis of bandit problems (for both Thompson Sampling and the Optimism in Face of Uncertainty (OFU) approaches).
Neural Information Processing Systems
Oct-3-2025, 02:18:15 GMT