Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Dong, Kefan, Wang, Yuanhao, Chen, Xiaoyu, Wang, Liwei

arXiv.org Machine Learning 

The goal of reinforcement learning is to construct algorithms that learn and plan in sequential decision making systems when the underlying system dynamics are unknown. A typical model in RL is Markov Decision Process (MDP). At each time step, the environment is in state s. The agent may take an action a, obtain a reward, and then the environment may transit to another state. In reinforcement learning, the transition probability distribution is unknown. The algorithm needs to learn the transition dynamics of MDP, while aiming to maximize the cumulative reward. This causes an exploration-exploitation dilemma: whether to act to gain new information (explore) or to act consistently with past experience to maximize reward (exploit). Theoretical analysis of reinforcement learning falls into two broad categories: those assuming a simulator (a.k.a.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found