SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Csordás, Róbert, Piękos, Piotr, Irie, Kazuki, Schmidhuber, Jürgen

arXiv.org Artificial Intelligence 

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead--a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. Switch-Head uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchAll" Transformer model. Large language models (LLMs) have shown remarkable capabilities (Radford et al., 2019; Brown et al., 2020; OpenAI, 2022; 2023) and great versatility (Bubeck et al., 2023). However, training enormous Transformers (Vaswani et al., 2017; Schmidhuber, 1992) requires a considerable amount of computing power and memory, which is not accessible to most researchers, academic institutions, and even companies. Even running them in inference mode, which is much less resource-intensive, requires significant engineering effort (Gerganov, 2023). Accelerating big Transformers remains an important open research question. However, in these works, the parameter efficiency of MoEs has not been studied; MoE models have been typically compared to dense baselines with the same number of FLOPs but with much less parameters.