The Role of Sparsity for Length Generalization in Transformers

Open in new window