Enhancing Linear Attention with Residual Learning

Lai, Xunhao, Kang, Jialiang, Lu, Jianqiao, Lin, Tong, Zhao, Pengyu

Oct-1-2025–arXiv.org Artificial Intelligence

Linear attention offers a linear-time alternative to self-attention but often struggles to capture long-range patterns. We revisit linear attention through a prediction-correction lens and show that prevalent variants can be written as a combination of a historical prediction and a single-token correction, which creates an expressivity bottleneck. To address this bottleneck, we introduce Residual Linear Attention (RLA), a framework that equips linear attention with an explicit residual-fitting mechanism. RLA maintains an auxiliary recurrent state that learns to accumulate residual errors over time and correct the base prediction. Our implementation leverages highly optimized linear attention kernels and preserves linear time and memory. Across language modeling and recall-intensive evaluations, RLA and RDN consistently outperform their respective baselines and other modern linear-attention methods, narrowing the gap to standard Transformers while retaining linear scaling. The Transformer (V aswani et al., 2017) architecture has become the standard for large language models. However, the quadratic time complexity of its self-attention mechanism remains a critical bottleneck, limiting its application to long sequences (Li et al., 2024).

arxiv preprint arxiv, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language (1.00)
  - Representation & Reasoning (1.00)