Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Neural Information Processing Systems 

In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameter-ization, attention linearization, special initialization, and lazy regime.