Unraveling the Gradient Descent Dynamics of Transformers

Dec-26-2025, 21:03:51 GMT–Neural Information Processing Systems

While the Transformer architecture has achieved remarkable success across various domains, a thorough theoretical foundation explaining its optimization dynamics is yet to be fully developed. In this study, we aim to bridge this understanding gap by answering the following two core questions: (1) Which types of Transformer architectures allow Gradient Descent (GD) to achieve guaranteed convergence?

large language model, machine learning, natural language, (7 more...)

Neural Information Processing Systems

Dec-26-2025, 21:03:51 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report > New Finding (0.58)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.51)
    - Statistical Learning (0.47)
  - Natural Language > Large Language Model (0.51)