Optimised Grouped-Query Attention Mechanism for Transformers
Chen, Yuang, Zhang, Cheng, Gao, Xitong, Mullins, Robert D., Constantinides, George A., Zhao, Yiren
–arXiv.org Artificial Intelligence
Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
arXiv.org Artificial Intelligence
Jun-21-2024
- Country:
- Europe
- Austria > Vienna (0.14)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Greater London > London (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia > China
- Guangdong Province > Shenzhen (0.04)
- Europe
- Genre:
- Research Report (0.50)
- Technology: