Towards Better Multi-head Attention via Channel-wise Sample Permutation

Oct-14-2024–arXiv.org Artificial Intelligence

Transformer [48] has been widely adopted in the deep learning domain. Recent large language models like GPT [4, 36] and LLaMA [45, 46] series are built based on the Transformer and its variants, which demonstrate their remarkable abilities in natural language processing. In the field of computer vision, Vision Transformers (ViTs) [14], such as EfficientViT [5, 26] and SHViT [53], exhibit exceptional performance and consistently push their limits. In addition, the Transformer-based models have been designed for the complex structured data in various applications, including the Informer [57] for time series broadcasting, the Transformer Hawkes process [58] for continuous-time event sequence prediction, the Graphormer [51] for molecular representation, the Mesh Transformer [24] for 3D mesh representation, the Set-Transformer [22] and Point-Transformer [56] for point cloud modeling, and so on. Although some new alternatives like Mamba [15] and RWKV [33] have been proposed and shown their competitiveness in some aspects, Transformer still maintains a dominant position when developing deep learning models because of its strong performance and outstanding universality. The effectiveness of Transformer is mainly attributed to its multi-head attention (MHA) mechanism [48]. However, MHA's quadratic complexity concerning sequence length leads to a heavy, even Hongteng Xu is the corresponding author of this work.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-14-2024

arXiv.org PDF

Add feedback

Country:
- North America (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language (1.00)
  - Vision (1.00)