MoH: Multi-Head Attention as Mixture-of-Head Attention

Jin, Peng, Zhu, Bo, Yuan, Li, Yan, Shuicheng

Oct-15-2024–arXiv.org Artificial Intelligence

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-15-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia
  - Singapore (0.04)
  - Middle East > Jordan (0.04)
  - China > Guangdong Province
    - Shenzhen (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Health & Medicine (0.67)
- Education > Educational Setting (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)