The Curse of Depth in Large Language Models

Jun-14-2026, 07:10:38 GMT–Neural Information Processing Systems

In this paper, we re-introduce the Curse of Depth, a concept that re-introduces, explains, and addresses the recent observation in modern Large Language Models (LLMs) where deeper layers are much less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs, such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Jun-14-2026, 07:10:38 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)