Local to Global: Learning Dynamics and Effect of Initialization for Transformers

Open in new window