A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts
Nguyen, Viet, Pham, Tuan Minh, Cao, Thinh, Dinh, Tan, Nguyen, Huy, Ho, Nhat, Rinaldo, Alessandro
Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention's benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.
Feb-3-2026
- Country:
- Asia
- Middle East > Jordan (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Texas > Travis County > Austin (0.04)
- Asia
- Genre:
- Research Report (0.63)
- Technology: