gqa
- Asia > Thailand > Bangkok > Bangkok (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (7 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Information Technology (0.67)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (7 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Information Technology (0.67)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models
Common Clarifications: (CC1) Evaluation with other datasets (VQA-CP, GQA) @R1, R2, R4: The main focus and
We thank all the reviewers for their insightful questions, comments and commendations (novelty, clarity, performance). T ask 1 motivation, complexity @R3, R4: Motivation behind Task 1 is efficacy, not complexity [L46-49]. GVQA (from VQA-CP) builds on stacked attention networks (SAN). Thus, they are orthogonal to MGN in problem setting and architecture. We show results using MAC (from the GQA authors, L291) with both CLEVR and GQA results.
Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization
Song, Guanghui, Liao, Dongping, Zhao, Yiren, Ye, Kejiang, Xu, Cheng-zhong, Gao, Xitong
Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA's superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > Macao (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Cost-Optimal Grouped-Query Attention for Long-Context LLMs
Chen, Yingfa, Wu, Yutong, Han, Xu, Liu, Zhiyuan, Sun, Maosong
Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.
- North America > United States (0.46)
- Asia > China (0.14)
TransMLA: Multi-Head Latent Attention Is All You Need
Meng, Fanxu, Yao, Zengwei, Zhang, Muhan
Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.
Reviews: Learning by Abstraction: The Neural State Machine
As far as I can tell, the model is relatively simple and is mostly operating over and recomputing probability distributions of discrete elements in the image and tokens in the sentence. It's not a surprising next step in this area, but this approach is a good step in that direction. One concern is assumptions placed on the image content space by using a dataset like Visual Genome/GQA. Visual Genome uses a fixed ontology of properties and possible property values and (as the paper states in L129) ignores fine-grained statistics of the image (e.g., information about the background, like what color the sky is). Requiring this fixed ontology may work for a dataset like GQA, which is generated from such an ontology, but may be harder to extend to other, more realistic datasets where topics don't have to be limited to objects included in the gold scene graph.
Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets
Eiter, Thomas, Hadl, Jan, Higuera, Nelson, Oetsch, Johannes
Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning component have a clear advantage over end-to-end trained systems regarding interpretability. The downside is that crafting the rules for such a component can be an additional burden on the developer. We address this challenge by presenting an approach for declarative knowledge distillation from Large Language Models (LLMs). Our method is to prompt an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task. Examples from the VQA dataset are used to guide the LLM, validate the results, and mend rules if they are not correct by using feedback from the ASP solver. We demonstrate that our approach works on the prominent CLEVR and GQA datasets. Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Jönköping County > Jönköping (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Khan, Zohaib, Khaquan, Muhammad, Tafveez, Omer, Samiwala, Burhanuddin, Raza, Agha Ali
The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)