Radar: Fast Long-Context Decoding for Any Transformer
Hao, Yongchang, Zhai, Mengyao, Hajimirsadeghi, Hossein, Hosseini, Sepidehsadat, Tung, Frederick
–arXiv.org Artificial Intelligence
Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers. The code is publicly available at https://github.com/BorealisAI/ In particular, Transformer models take each input as a sequence of tokens and compute the embedding of each token for downstream tasks. Among all components, the dot-product attention has been shown to be critical to the success of Transformer models (Choromanski et al., 2021). It not only enables parallel computation of sequences during training (Vyas et al., 2020), but also provides a high-quality method for sequence modeling (Sanford et al., 2023). Despite being at the core of Transformer models, the dot-product attention is not ideal for long-context data: the time to process each token increases with context lengths, significantly slowing down the throughput on long-context data. Moreover, the maximum context length is limited during training, resulting in an inability to perform inference on long-context tasks. Y et, many real-world applications are naturally long-context (Tay et al., 2021; Beltagy et al., 2020; Wu et al., 2024). For example, a code file could have more than 10K tokens (Lozhkov et al., 2024; Kocetkov et al., 2022).
arXiv.org Artificial Intelligence
Mar-13-2025
- Country:
- South America > Chile
- North America
- United States (0.28)
- Canada > Alberta (0.14)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Government (0.46)
- Technology: