Infinite Limits of Multi-head Transformer Dynamics

Neural Information Processing Systems 

In this work, we analyze various scaling limits of the training dynamics of transformer models in the feature learning regime.