Reviews: Learning Unknown Markov Decision Processes: A Thompson Sampling Approach

Oct-7-2024, 21:32:00 GMT–Neural Information Processing Systems

The paper proposes TSDE, a posterior sampling algorithm for RL in the average reward infinite horizon setting. This algorithm uses dynamic episodes but unlike Lazy-PSRL avoids technical issues by not only terminating an episode when an observation count doubled but also terminating episodes when they become too long. This ensures that the episode length cannot grow faster than linear and ultimately a Bayesian regret bound of O(HS(AT) .5) is shown. Posterior sampling methods typically outperform UCB-type algorithms and therefore a posterior sampling algorithm for non-episodic RL with rigorous regret bounds is desirable. This paper proposes such an algorithm, which is of high interest.

algorithm, learning unknown markov decision process, thompson sampling approach, (4 more...)

Neural Information Processing Systems

Oct-7-2024, 21:32:00 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.40)