Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

Kim, Jeonghoon, Lee, Byeongchan, Park, Cheonbok, Oh, Yeontaek, Kim, Beomjun, Yoo, Taehwan, Shin, Seongjin, Han, Dongyoon, Shin, Jinwoo, Yoo, Kang Min

Feb-6-2025–arXiv.org Artificial Intelligence

Designing Transformer architectures with the optimal layer normalization (LN) strategy that ensures large-scale training stability and expedite convergence has remained elusive, even in this era of large language models (LLMs). To this end, we present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformer training. Until recently, Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. However, several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. This strategy places layer normalization (LN) peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising empirical performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis shows that Peri-LN strikes an ideal balance in variance growth -- unlike Pre-LN and Post-LN, which are prone to vanishing gradients and ``massive activations.'' To validate our theoretical insight, we conduct large-scale experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement and application of LN.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Feb-6-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)