ZeroS: Zero-Sum Linear Attention for Efficient Transformers
Lu, Jiecheng, Han, Xu, Sun, Yan, Pati, Viresh, Kim, Yubin, Somani, Siddhartha, Yang, Shihao
Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
Feb-6-2026
- Country:
- Europe
- Germany (0.04)
- Switzerland (0.04)
- North America > United States
- Michigan > Washtenaw County > Ann Arbor (0.04)
- South America > Chile
- Europe
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Energy (0.46)
- Information Technology (0.46)
- Technology: