Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids

Open in new window