Understanding the Failure of Batch Normalization for Transformers in NLP

Open in new window