From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
–Neural Information Processing Systems
Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in \cite{zhou2022towards} to systematically investigate linearized Transformer training dynamics.
Neural Information Processing Systems
Jun-11-2026, 18:28:42 GMT
- Technology: