Softmax Attention with Constant Cost per Token

Apr-27-2024–arXiv.org Artificial Intelligence

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.

dimension, exp, modification, (16 more...)

arXiv.org Artificial Intelligence

Apr-27-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.43)