Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Open in new window