Local to Global: Learning Dynamics and Effect of Initialization for Transformers

Neural Information Processing Systems 

In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling.