ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Danieli, Federico, Rodriguez, Pau, Sarabia, Miguel, Suau, Xavier, Zappella, Luca

arXiv.org Artificial Intelligence 

Since its introduction by Vaswani et al. (2023), the Transformer architecture has quickly imposed itself as the de-facto choice for sequence modeling, surpassing previous state-of-the-art, RNN-based models such as GRU and LSTM (Cho et al., 2014; Hochreiter & Schmidhuber, 1997). One key reason behind the rapid adoption of Transformers lies in the efficiency of their application at training time: their core sequence mixer, the attention mechanism, can in fact be applied in parallel along the length of the input sequence. This effectively overcomes one main limitation of classical RNNs, whose application must be unrolled sequentially along the input sequence. In more recent times, however, interest in RNNs has been rekindled, largely due to their reduced memory footprint and improved efficiency at inference time. In particular, recent advancements in State Space Models (SSMs) such as Mamba (Gu & Dao, 2023; Dao & Gu, 2024) have started gaining popularity and are imposing themselves as a potential alternative to Transformers, at least for small-to-mid-sized models (Zuo et al., 2024). To grant parallelization during training (and hence a comparable performance to attention), SSMs simplify the recurrence relationship at their core to one that is purely linear in the hidden state. This simplification enables leveraging associativity and parallel reduction operations to quickly compute the output of the application of an SSM to a whole input sequence in parallel along its length. Despite the success of modern SSMs, the linearity constraint remains a limitation which hinders their expressive power (Merrill et al., 2025; Cirone et al., 2025), and is dictated by necessity rather than choice. With ParaRNN, we aim to overcome the constraint of linearity for SSMs, while unlocking parallelization also for nonlinear RNNs, thus enriching the space of viable options for sequence modeling.