AITopics | switchhead

Collaborating Authors

switchhead

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Neural Information Processing SystemsMar-21-2026, 11:31:43 GMT

Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE SwitchAll Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.

artificial intelligence, natural language, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.98)

Add feedback

87be61bf9338389702712f5e9754a986-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 09:57:49 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Germany > Berlin (0.04)
Europe > France (0.04)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

87be61bf9338389702712f5e9754a986-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 08:38:42 GMT

max 4, projection, switchhead, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Germany > Berlin (0.04)
Europe > France (0.04)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Neural Information Processing SystemsMay-27-2025, 07:43:29 GMT

mixture-of-expert attention, switchhead, transformer, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.83)

Add feedback

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Csordás, Róbert, Piękos, Piotr, Irie, Kazuki, Schmidhuber, Jürgen

arXiv.org Artificial IntelligenceDec-14-2023

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead--a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. Switch-Head uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchAll" Transformer model. Large language models (LLMs) have shown remarkable capabilities (Radford et al., 2019; Brown et al., 2020; OpenAI, 2022; 2023) and great versatility (Bubeck et al., 2023). However, training enormous Transformers (Vaswani et al., 2017; Schmidhuber, 1992) requires a considerable amount of computing power and memory, which is not accessible to most researchers, academic institutions, and even companies. Even running them in inference mode, which is much less resource-intensive, requires significant engineering effort (Gerganov, 2023). Accelerating big Transformers remains an important open research question. However, in these works, the parameter efficiency of MoEs has not been studied; MoE models have been typically compared to dense baselines with the same number of FLOPs but with much less parameters.

attention matrix, max 4, switchhead, (14 more...)

arXiv.org Artificial Intelligence

2312.07987

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)
(11 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.86)

Add feedback