Optimised Grouped-Query Attention Mechanism for Transformers

Chen, Yuang, Zhang, Cheng, Gao, Xitong, Mullins, Robert D., Constantinides, George A., Zhao, Yiren

Jun-21-2024–arXiv.org Artificial Intelligence

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

group size, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Jun-21-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)
- Europe
  - Austria (0.28)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.35)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Search (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found