Dynamic Sparse Attention on Mobile SoCs
Yin, Wangsong, Xu, Daliang, Xu, Mengwei, Huang, Gang, Liu, Xuanzhe
–arXiv.org Artificial Intelligence
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
arXiv.org Artificial Intelligence
Aug-26-2025
- Country:
- Asia
- China
- Beijing > Beijing (0.04)
- Zhejiang Province > Hangzhou (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- China
- Europe
- Finland > Uusimaa
- Helsinki (0.04)
- Netherlands > South Holland
- Rotterdam (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Finland > Uusimaa
- North America > United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New York > New York County
- New York City (0.05)
- Oregon > Multnomah County
- Portland (0.04)
- Texas (0.04)
- Louisiana > Orleans Parish
- Asia
- Genre:
- Research Report (0.85)
- Workflow (0.68)
- Industry:
- Information Technology (0.46)
- Technology: