Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Mar-20-2026, 13:56:36 GMT–Neural Information Processing Systems

Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training.

artificial intelligence, machine learning, natural language, (7 more...)

Neural Information Processing Systems

Mar-20-2026, 13:56:36 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.59)
  - Machine Learning (0.36)