Understanding and Improving Length Generalization in Recurrent Models

Ruiz, Ricardo Buitrago, Gu, Albert

arXiv.org Artificial Intelligence 

In addition to matching the performance of Transformers (Vaswani et al. 2017) across many tasks, the recurrent mechanism brings two benefits: the ability to efficiently process long sequences thanks to its linear complexity, and the capacity to easily process tokens beyond their training context by simply rolling out the state. Nevertheless, in practice these benefits are often unrealized, given that their performance can drop considerably when the sequence length exceeds their training context (Ben-Kish et al. 2024; Waleffe et al. 2024; Ye et al. 2025; Yuan et al. 2024). This naturally leads to two questions: (1) why do these models fail to length generalize? and (2) how can we efficiently enable length generalization across several recurrent models? Recently, some works have studied the length generalization of Mamba (Dao and Gu 2024) and have proposed solutions such as forcing the model to forget previous context (Yingfa Chen et al. 2024) or skipping tokens in the state update to reduce the effective context of the processed sequence (Ben-Kish et al. 2024; Ye et al. 2025). However, these methods require changing the internal mechanism of Mamba and might not be easily transferable to other architectures. Other works have linked length generalization to state capacity and overfitting (Yingfa Chen et al. 2024; S. Wang 2024), proposing training on longer sequences and with Truncated Backpropagation Through Time (TBTT) (Sutskever 2013; Williams and J. Peng 1990) as a way to enable length generalization. In this work, we reason about the distribution of states that the model is trained on to introduce a precise hypothesis that explains why recurrent models fail to length generalize. Moreover, we perform comprehensive interventions that elucidate on what distributions recurrent models need to be trained to enable length generalization. 1