Universal Approximation with Softmax Attention

Hu, Jerry Yao-Chieh, Liu, Hude, Chen, Hong-Yu, Wu, Weimin, Liu, Han

Apr-22-2025–arXiv.org Machine Learning

We prove that either two-layer self-attention or one-layer self-attention followed by a softmax (each equipped only with linear transformations) is capable of approximating any sequence-to-sequence continuous function on a compact domain. Different from previous studies [Y un et al., 2019, Jiang and Li, 2023, Takakura and Suzuki, 2023, Kajitsuka and Sato, 2023, Hu et al., 2024], our results highlight the expressive power of Transformers derived only from the attention module. By focusing exclusively on attention, our analysis demonstrates that the softmax operation itself suffices as a piecewise linear approximator. Furthermore, we extend this framework to broader applications, such as in-context learning [Brown et al., 2020, Bai et al., 2024], using the same attention-only architecture. Prior studies of Transformer-based universality lean on deep attention stacks [Y un et al., 2019] or feed-forward (FFN) sub-layers [Kajitsuka and Sato, 2023, Hu et al., 2024] or strong assumptions on data or architecture [Takakura and Suzuki, 2023, Petrov et al., 2024]. These results make it unclear whether attention alone is essential or auxiliary. To combat this, we develop a new interpolation-based technique for analyzing attention 1 .

artificial intelligence, attn, machine learning, (15 more...)

arXiv.org Machine Learning

Apr-22-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois > Cook County > Evanston (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Neural Networks > Deep Learning (0.65)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found