Unraveling the Gradient Descent Dynamics of Transformers

Neural Information Processing Systems 

By analyzing the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels, our work provides concrete answers to these questions.