Core Context Aware Attention for Long Context Language Modeling
Chen, Yaofo, You, Zeng, Zhang, Shuhai, Li, Haokun, Li, Yirui, Wang, Yaowei, Tan, Mingkui
–arXiv.org Artificial Intelligence
Transformer-based Large Language Models (LLMs) have exhibited remarkable success in various natural language processing tasks primarily attributed to selfattention mechanism, which requires a token to consider all preceding tokens as its context to compute the attention score. However, when the context length L becomes very large (e.g., 32K), more redundant context information will be included w.r.t. L; 2) The presence of redundant context information may hamper the model to capture dependencies among crucial tokens, which may degrade the representation performance. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling, which consists of two components: 1) Globality-pooling attention that divides input tokens into groups and then dynamically merges tokens within each group into one core token based on their significance; 2) Locality-preserved attention that incorporates neighboring tokens into the attention calculation. The two complementary attentions will then be fused to the final attention, maintaining comprehensive modeling ability as the full self-attention. In this way, the core context information w.r.t. a given token will be automatically focused and strengthened, while the context information in redundant groups will be diminished during the learning process. As a result, the computational and memory complexity will be significantly reduced. More importantly, the CCA-Attention can improve the long-context modeling ability by diminishing the redundant context information. Extensive experimental results demonstrate that our CCA-Attention significantly outperforms state-of-theart models in terms of computational efficiency and long-context modeling ability. Large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023a) have demonstrated exceptional proficiency across various applications by effectively modeling extended contexts, particularly in tasks involving natural language understanding and generation (Ouyang et al., 2022; Chang et al., 2024). The remarkable success of LLMs is predominantly credited to the self-attention mechanism (Vaswani et al., 2017), which requires each token in the input sequence to calculate attention with all preceding tokens. Nonetheless, the computational and memory requirements of self-attention grow quadratically with the increase in sequence length, posing challenges for long context understanding tasks (Liu et al., 2024; Shaham et al., 2023). Recent studies (Beltagy et al., 2020; Zaheer et al., 2020; Xiao et al., 2024b) have demonstrated that the majority of layers within autoregressive LLMs exhibit redundant tokens across various attention heads and input tokens. This redundancy is visually exemplified in Figure 1, where we present a detailed visualization of the attention weights within LLaMA2 (Touvron et al., 2023a).
arXiv.org Artificial Intelligence
Dec-16-2024