The Curse of Depth in Large Language Models

Neural Information Processing Systems 

In this paper, we re-introduce the Curse of Depth, a concept that re-introduces, explains, and addresses the recent observation in modern Large Language Models (LLMs) where deeper layers are much less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs, such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth.