On the Role of Attention Masks and LayerNorm in Transformers

Neural Information Processing Systems 

Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models.