Unraveling the Gradient Descent Dynamics of Transformers
–Neural Information Processing Systems
By analyzing the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels, our work provides concrete answers to these questions.
Neural Information Processing Systems
Feb-17-2026, 06:30:27 GMT
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- California > Santa Clara County
- Stanford (0.04)
- Minnesota (0.04)
- California > Santa Clara County
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (1.00)
- Research Report
- Technology: