First Attentions Last: Better Exploiting First Attentions for Efficient Parallel Training

Open in new window