Sub-Linear Memory: How to Make Performers SLiM

Apr-25-2026, 11:00:30 GMT–Neural Information Processing Systems

Transformer architectures have become very popular yet the original implementation requires O(L2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory, and still requires O(L) time. Due to complete backwardcompatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Apr-25-2026, 11:00:30 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Minnesota (0.28)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
Sub-LinearMemory: HowtoMakePerformersSLiM

Similar Docs Excel Report more

Title	Similarity	Source
None found