Impact of Layer Norm on Memorization and Generalization in Transformers

Open in new window