Understanding the Failure of Batch Normalization for Transformers in NLP Jiaxi Wang 1, Ji Wu1,2, Lei Huang 3 1 Department of Electronic Engineering, Tsinghua University

Neural Information Processing Systems 

Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks.