Towards Characterizing Divergence in Deep Q-Learning

Achiam, Joshua, Knight, Ethan, Abbeel, Pieter

arXiv.org Artificial Intelligence 

The most common failure algorithms for control, employs three techniques mode is divergence, where the Q-function approximator collectively known as the'deadly triad' in learns to ascribe unrealistically high values to state-action reinforcement learning: bootstrapping, off-policy pairs, in turn destroying the quality of the greedy control learning, and function approximation. Prior work policy derived from Q (van Hasselt et al., 2018). Divergence has demonstrated that together these can lead to in DQL is often attributed to three components common divergence in Q-learning algorithms, but the conditions to all DQL algorithms, which are collectively considered under which divergence occurs are not the'deadly triad' of reinforcement learning (Sutton, 1988; well-understood. In this note, we give a simple Sutton & Barto, 2018): analysis based on a linear approximation to the Q-value updates, which we believe provides insight - function approximation, in this case the use of deep into divergence under the deadly triad. The neural networks, central point in our analysis is to consider when the leading order approximation to the deep-Q - off-policy learning, the use of data collected on one update is or is not a contraction in the sup norm.