Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
Bu, Tao, Wang, Qiangang, Zeng, Bowen, Sun, Hanwen, Huang, Yunpeng, Cao, Chun, Xu, Jingwei
–arXiv.org Artificial Intelligence
Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.
arXiv.org Artificial Intelligence
Oct-22-2025
- Country:
- Asia
- China > Jiangsu Province
- Nanjing (0.04)
- Middle East > Saudi Arabia
- Asir Province > Abha (0.04)
- China > Jiangsu Province
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (0.04)
- South America > Chile
- Asia
- Genre:
- Research Report > New Finding (0.68)
- Technology: