JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Open in new window