Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning
Wang, Che, Wu, Yanqiu, Vuong, Quan, Ross, Keith
–arXiv.org Artificial Intelligence
A BSTRACT The field of Deep Reinforcement Learning (DRL) has recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In this paper, we seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that the entropy term in Soft Actor Critic (SAC) principally addresses the bounded nature of the action spaces. With this insight, we propose a simple normalization scheme which allows a streamlined algorithm without entropy maximization match the performance of SAC. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. We also propose a simple nonuniform sampling method for selecting transitions from the replay buffer during training. We further show that the streamlined algorithm with the simple nonuniform sampling scheme outperforms SAC and achieves state-of-the-art performance on challenging continuous control tasks. 1 I NTRODUCTION Off-policy deep Reinforcement Learning (RL) algorithms aim to improve sample efficiency by reusing past experience. Recently a number of new off-policy Deep Reinforcement Learning algorithms have been proposed for control tasks with continuous state and action spaces, including Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) (Lillicrap et al., 2015; Fuji-moto et al., 2018). TD3, in particular, has been shown to be significantly more sample efficient than popular on-policy methods for a wide range of Mujoco benchmarks. The field of Deep Reinforcement Learning (DRL) has also recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks.
arXiv.org Artificial Intelligence
Oct-10-2019
- Country:
- North America > United States
- Massachusetts > Middlesex County
- Belmont (0.04)
- California > San Diego County
- San Diego (0.04)
- Arizona > Maricopa County
- Phoenix (0.04)
- Massachusetts > Middlesex County
- Asia > China
- North America > United States
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Leisure & Entertainment (0.67)
- Technology: