Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
Dong, Kefan, Wang, Yuanhao, Chen, Xiaoyu, Wang, Liwei
The goal of reinforcement learning is to construct algorithms that learn and plan in sequential decision making systems when the underlying system dynamics are unknown. A typical model in RL is Markov Decision Process (MDP). At each time step, the environment is in state s. The agent may take an action a, obtain a reward, and then the environment may transit to another state. In reinforcement learning, the transition probability distribution is unknown. The algorithm needs to learn the transition dynamics of MDP, while aiming to maximize the cumulative reward. This causes an exploration-exploitation dilemma: whether to act to gain new information (explore) or to act consistently with past experience to maximize reward (exploit). Theoretical analysis of reinforcement learning falls into two broad categories: those assuming a simulator (a.k.a.
Jan-26-2019
- Country:
- North America > United States (0.14)
- Asia > China (0.14)
- Europe > United Kingdom
- England (0.14)
- Genre:
- Research Report (0.50)
- Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.41)
- Energy > Oil & Gas (0.34)