Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective

Yan, Fanqi, Nguyen, Huy, Akbarian, Pedram, Ho, Nhat, Rinaldo, Alessandro

Jan-31-2025–arXiv.org Artificial Intelligence

Transformer models [54] have been known as the state-of-the-art architecture for a wide range of machine learning and deep learning applications, including language modeling [16, 3, 47, 51], computer vision [17, 4, 46, 35], and reinforcement learning [5, 31, 25], etc. One of the central components that contribute to the success of the Transformer models is the self-attention mechanism, which enables sequence-to-sequence models to concentrate on relevant parts of the input data. In particular, for each token in an input sequence, the self-attention mechanism computes a context vector formulated as a weighted sum of the tokens, where more relevant tokens to the context are assigned larger weights than others (see Section 2.1 for a formal definition). Therefore, self-attention is able to capture long-range dependencies and complex relationships within the data. However, since the weights in the context vector are normalized by the softmax function, there might be an undesirable competition among the tokens, that is, an increase in the weight of a token leads to a decrease in the weights of others. As a consequence, the traditional softmax self-attention mechanism might focus only on a few aspects of the data and possibly ignore other informative features [48]. Additionally, [22] also discovered that the tokens' inner dependence on the attention scores owing to the softmax normalization partly causes the attention sink phenomenon occurring

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Jan-31-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- Europe (0.27)
- North America > United States
  - Texas (0.14)

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (1.00)
  - Statistical Learning (1.00)