From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics

Jun-11-2026, 18:28:42 GMT–Neural Information Processing Systems

Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in \cite{zhou2022towards} to systematically investigate linearized Transformer training dynamics.

machine learning, natural language, proceedings, (5 more...)

Neural Information Processing Systems

Jun-11-2026, 18:28:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.61)
  - Machine Learning (0.41)