Efficient Deep Reinforcement Learning Requires Regulating Overfitting
Li, Qiyang, Kumar, Aviral, Kostrikov, Ilya, Levine, Sergey
–arXiv.org Artificial Intelligence
Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on the validation TD error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks. Reinforcement learning (RL) methods, when combined with high-capacity deep neural net function approximators, have shown promise in domains such as robot manipulation (Andrychowicz et al., 2020), chip placement (Mirhoseini et al., 2020), games (Silver et al., 2016), and data-center cooling (Lazic et al., 2018). Since every unit of active online data collection comes at an expense (e.g., running real robots, chip evaluation using simulation), it is important to develop sample-efficient deep RL algorithms, that can learn efficiently even with limited amount of experience. Devising such efficient RL algorithm has been an important thread of research in recent years (Janner et al., 2019; Chen et al., 2021; Hiraoka et al., 2021). In principle, off-policy RL methods (e.g., SAC (Haarnoja et al., 2018), TD3 (Fujimoto et al., 2018), Rainbow (Hessel et al., 2018)) should provide good sample efficiency, because they make it possible to improve the policy and value functions for many gradient steps per step of data collection. However, this benefit does not appear to be realizable in practice, as taking too many training steps per each collected transition actually harms performance in many environments.
arXiv.org Artificial Intelligence
Apr-20-2023