Star Attention: Efficient LLM Inference over Long Sequences

Acharya, Shantanu, Jia, Fei, Ginsburg, Boris

arXiv.org Artificial Intelligence 

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the selfattention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy. Recent Large Language Models (LLMs) can support contexts up to millions of tokens in length (Gemini-Team, 2024; Anthropic, 2024; Meta-AI, 2024), unlocking applications such as repositorylevel code analysis, multi-document summarization, and large corpus retrieval. However, processing such long sequences with LLMs requires substantial computational and memory resources due to the quadratic complexity of the self-attention mechanism. To address these challenges, various techniques have been proposed to reduce memory usage and increase inference speed. For example, Flash Attention introduces an efficient GPU block-wise implementation of the global attention, achieving significant reductions in memory overhead and runtime (Dao et al., 2022; Dao, 2024).