Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, David Z. Pan Chandra Department of Electrical and Computer Engineering

Neural Information Processing Systems 

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers.