Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Simchowitz, Max, Jamieson, Kevin

arXiv.org Machine Learning 

Reinforcement learning (RL) is a powerful paradigm for modeling a learning agent's interactions with an unknown environment, in an attempt to accumulate as much reward as possible. Because of its flexibility, RL can encode such a vast array of different problem settings - many of which are entirely intractable. Therefore, it is crucial to understand what conditions make it possible for an RL agent to effectively learn about its environment. In this paper, we consider tabular Markov decision processes (MDPs), a canonical RL setting where the agent seeks to learn a policy mapping discrete states x S to one of finitely many actions a A, in attempt to maximize cumulative reward over an episode horizon H. We shall study the regret setting, where the learner plays a policy π for a sequence of episodes k 1, . . .

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found