Rethinking Softmax: Self-Attention with Polynomial Activations
Saratchandran, Hemanth, Zheng, Jianqiao, Ji, Yiping, Zhang, Wenbo, Lucey, Simon
This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax. A key component in the transformer architecture is the softmax attention block, enabling transformers to evaluate the importance of individual input elements during output generation. This feature facilitates an efficient method to attend to diverse input elements throughout training, allowing transformers to effectively capture spatial dependencies within sequential data.
Oct-24-2024
- Country:
- Europe > Switzerland > Zürich > Zürich (0.14)
- Genre:
- Research Report > New Finding (1.00)
- Technology: