Towards Better Multi-head Attention via Channel-wise Sample Permutation

Yuan, Shen, Xu, Hongteng

arXiv.org Artificial Intelligence 

Transformer [48] has been widely adopted in the deep learning domain. Recent large language models like GPT [4, 36] and LLaMA [45, 46] series are built based on the Transformer and its variants, which demonstrate their remarkable abilities in natural language processing. In the field of computer vision, Vision Transformers (ViTs) [14], such as EfficientViT [5, 26] and SHViT [53], exhibit exceptional performance and consistently push their limits. In addition, the Transformer-based models have been designed for the complex structured data in various applications, including the Informer [57] for time series broadcasting, the Transformer Hawkes process [58] for continuous-time event sequence prediction, the Graphormer [51] for molecular representation, the Mesh Transformer [24] for 3D mesh representation, the Set-Transformer [22] and Point-Transformer [56] for point cloud modeling, and so on. Although some new alternatives like Mamba [15] and RWKV [33] have been proposed and shown their competitiveness in some aspects, Transformer still maintains a dominant position when developing deep learning models because of its strong performance and outstanding universality. The effectiveness of Transformer is mainly attributed to its multi-head attention (MHA) mechanism [48]. However, MHA's quadratic complexity concerning sequence length leads to a heavy, even Hongteng Xu is the corresponding author of this work.