Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Dong, Kefan, Wang, Yuanhao, Chen, Xiaoyu, Wang, Liwei

Jan-26-2019–arXiv.org Machine Learning

The goal of reinforcement learning is to construct algorithms that learn and plan in sequential decision making systems when the underlying system dynamics are unknown. A typical model in RL is Markov Decision Process (MDP). At each time step, the environment is in state s. The agent may take an action a, obtain a reward, and then the environment may transit to another state. In reinforcement learning, the transition probability distribution is unknown. The algorithm needs to learn the transition dynamics of MDP, while aiming to maximize the cumulative reward. This causes an exploration-exploitation dilemma: whether to act to gain new information (explore) or to act consistently with past experience to maximize reward (exploit). Theoretical analysis of reinforcement learning falls into two broad categories: those assuming a simulator (a.k.a.

algorithm, artificial intelligence, health & medicine, (18 more...)

arXiv.org Machine Learning

Jan-26-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)
- Asia > China (0.14)
- Europe > United Kingdom
  - England (0.14)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.41)
- Energy > Oil & Gas (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found