Parallelizing Linear Transformers with the Delta Rule over Sequence Length Songlin Y ang Bailin Wang Y u Zhang Yikang Shen Y oon Kim Massachusetts Institute of Technology Soochow University

Feb-18-2026, 06:03:05 GMT–Neural Information Processing Systems

Transformers with linear attention (i.e., linear transfor mers) and state-space models have recently been suggested as a viable linear-time alt ernative to transformers with softmax attention. However, these models still underp erform transformers especially on tasks that require in-context retrieval. Whil e more expressive variants of linear transformers which replace the additive upda te in linear transformers with the delta rule [DeltaNet; 101 ] have been found to be more effective at associative recall, existing algorithms for training such mode ls do not parallelize over sequence length and are thus inefficient to train on modern ha rdware. This work describes a hardware-efficient algorithm for training line ar transformers with the delta rule, which exploits a memory-efficient representati on for computing products of Householder matrices [ 11 ]. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B mode l for 100B tokens and find that it outperforms recent linear-time baselines su ch as Mamba [ 31 ] and GLA [ 124 ] in terms of perplexity and zero-shot performance on downst ream tasks. We also experiment with two hybrid models which combine Delt aNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transf ormer baselines.

arxiv preprint, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Feb-18-2026, 06:03:05 GMT

Conferences PDF

Add feedback

Country:
- Oceania > Australia
  - New South Wales > Sydney (0.04)
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Maryland > Baltimore (0.04)
    - Hawaii (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California
      - San Diego County > San Diego (0.04)
      - Monterey County > Monterey (0.04)
  - Canada > British Columbia
    - Vancouver (0.04)
- Europe
  - Austria (0.04)
  - Germany > Berlin (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Singapore (0.04)
  - Indonesia > Bali (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
- Africa
  - Rwanda > Kigali
    - Kigali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education (0.67)
- Health & Medicine (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (0.88)
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)

Duplicate Docs Excel Report

Title
Parallelizing Linear Transformers with the Delta Rule over Sequence Length Songlin Y ang Bailin Wang Y u Zhang Yikang Shen Y oon Kim Massachusetts Institute of Technology Soochow University

Similar Docs Excel Report more

Title	Similarity	Source
None found