An Investigation of Batch Normalization in Off-Policy Actor-Critic Algorithms

Wang, Li, Sudun, null, Zhang, Xingjian, Wu, Wenjun, Huang, Lei

arXiv.org Artificial Intelligence 

Batch Normalization (BN) has played a pivotal role in the success of deep learning by improving training stability, mitigating overfitting, and enabling more effective optimization. However, its adoption in deep reinforcement learning (DRL) has been limited due to the inherent non-i.i.d. In this paper, we argue that, despite these challenges, BN retains unique advantages in DRL settings, particularly through its stochasticity and its ability to ease training. When applied appropriately, BN can adapt to evolving data distributions and enhance both convergence speed and final performance. To this end, we conduct a comprehensive empirical study on the use of BN in off-policy actor-critic algorithms, systematically analyzing how different training and evaluation modes impact performance. We further identify failure modes that lead to instability or divergence, analyze their underlying causes, and propose the Mode-A ware Batch Normalization (MA-BN) method with practical actionable recommendations for robust BN integration in DRL pipelines. We also empirically validate that, in RL settings, MA-BN accelerates and stabilizes training, broadens the effective learning rate range, enhances exploration, and reduces overall optimization difficulty. Batch Normalization (BN) (Ioffe & Szegedy, 2015) has been a foundational technique in deep learning, playing a critical role in improving training stability(Santurkar et al., 2018), introducing stochasticity (Shekhovtsov & Flach, 2018; Huang et al., 2020), and enabling domain adaptation(Wang et al., 2020; Schneider et al., 2020). It has become a milestone in the development of deep neural networks due to its effectiveness in mitigating overfitting.