Target Network and Truncation Overcome The Deadly Triad in $Q$-Learning

Chen, Zaiwei, Clarke, John Paul, Maguluri, Siva Theja

May-3-2022–arXiv.org Machine Learning

The Deep Q -Network (Mnih et al., 2015), as a typical example of Q -learning with function approximation, is one of the most successful algorithms to solve the reinforcement learning (RL) problem, and hence is viewed as a milestone in the development of modern RL. On the other hand, the behavior of Q -learning with function approximation is theoretically not well understood, and was identified in Sutton (1999) as one of four most important theoretical open problems. In fact, the infamous deadly triad (Sutton, 2015) is present in Q -learning with function approximation, and hence even in the basic setting where linear function approximation is used, the algorithm was shown to be unstable in general (Baird, 1995). While theoretically unclear, it was empirically evident from Mnih et al. (2015) that the following three ingredients: experience replay, target network, and truncation together overcome the divergence of Q - learning with function approximation. In this work, we focus on Q -learning with linear function approximation for infinite horizon discounted Markov decision processes (MDPs), and show theoretically that target network together with truncation is sufficient to provably stabilize Q -learning. The main contributions of this work are summarized in the following.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

May-3-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada > Alberta (0.14)
  - United States > Texas
    - Travis County > Austin (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found