Transformers without Tears: Improving the Normalization of Self-Attention

Open in new window