From Condensation to Rank Collapse: ATwo-Stage Analysis of Transformer Training Dynamics
–Neural Information Processing Systems
Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in Zhou et al. [2022] to systematically investigate linearized Transformer training dynamics.
Neural Information Processing Systems
Jun-16-2026, 13:18:22 GMT