Improving Transformer with an Admixture of Attention Heads

Jan-18-2025, 14:41:57 GMT–Neural Information Processing Systems

Transformers with multi-head self-attention have achieved remarkable success in sequence modeling and beyond. However, they suffer from high computational and memory complexities for computing the attention matrix at each head. Recently, it has been shown that those attention matrices lie on a low-dimensional manifold and, thus, are redundant. We propose the Transformer with a Finite Admixture of Shared Heads (FiSHformers), a novel class of efficient and flexible transformers that allow the sharing of attention matrices between attention heads. At the core of FiSHformer is a novel finite admixture model of shared heads (FiSH) that samples attention matrices from a set of global attention matrices.

admixture, attention matrix, transformer, (5 more...)

Neural Information Processing Systems

Jan-18-2025, 14:41:57 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.62)