Target Network and Truncation Overcome The Deadly Triad in $Q$-Learning
Chen, Zaiwei, Clarke, John Paul, Maguluri, Siva Theja
The Deep Q -Network (Mnih et al., 2015), as a typical example of Q -learning with function approximation, is one of the most successful algorithms to solve the reinforcement learning (RL) problem, and hence is viewed as a milestone in the development of modern RL. On the other hand, the behavior of Q -learning with function approximation is theoretically not well understood, and was identified in Sutton (1999) as one of four most important theoretical open problems. In fact, the infamous deadly triad (Sutton, 2015) is present in Q -learning with function approximation, and hence even in the basic setting where linear function approximation is used, the algorithm was shown to be unstable in general (Baird, 1995). While theoretically unclear, it was empirically evident from Mnih et al. (2015) that the following three ingredients: experience replay, target network, and truncation together overcome the divergence of Q - learning with function approximation. In this work, we focus on Q -learning with linear function approximation for infinite horizon discounted Markov decision processes (MDPs), and show theoretically that target network together with truncation is sufficient to provably stabilize Q -learning. The main contributions of this work are summarized in the following.
May-3-2022
- Country:
- North America
- Canada > Alberta (0.14)
- United States > Texas
- Travis County > Austin (0.04)
- North America
- Genre:
- Research Report (0.64)