Gated Slot Attention for Efficient Linear-Time Sequence Modeling
–Neural Information Processing Systems
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC [63]) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA [96]). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R [41]) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
Neural Information Processing Systems
Mar-27-2025, 09:31:43 GMT
- Country:
- Europe (0.92)
- North America > United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Pennsylvania > Allegheny County
- Pittsburgh (0.14)
- Minnesota > Hennepin County
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Education (0.92)
- Health & Medicine > Consumer Health (0.54)
- Technology: