Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws

Petrov, Egor, Kiselev, Nikita, Meshkov, Vladislav, Grabovoy, Andrey

Oct-21-2025–arXiv.org Artificial Intelligence

The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.

layernorm, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.28)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.86)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found