First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training

Open in new window