Universal Approximation with Softmax Attention
Hu, Jerry Yao-Chieh, Liu, Hude, Chen, Hong-Yu, Wu, Weimin, Liu, Han
We prove that either two-layer self-attention or one-layer self-attention followed by a softmax (each equipped only with linear transformations) is capable of approximating any sequence-to-sequence continuous function on a compact domain. Different from previous studies [Y un et al., 2019, Jiang and Li, 2023, Takakura and Suzuki, 2023, Kajitsuka and Sato, 2023, Hu et al., 2024], our results highlight the expressive power of Transformers derived only from the attention module. By focusing exclusively on attention, our analysis demonstrates that the softmax operation itself suffices as a piecewise linear approximator. Furthermore, we extend this framework to broader applications, such as in-context learning [Brown et al., 2020, Bai et al., 2024], using the same attention-only architecture. Prior studies of Transformer-based universality lean on deep attention stacks [Y un et al., 2019] or feed-forward (FFN) sub-layers [Kajitsuka and Sato, 2023, Hu et al., 2024] or strong assumptions on data or architecture [Takakura and Suzuki, 2023, Petrov et al., 2024]. These results make it unclear whether attention alone is essential or auxiliary. To combat this, we develop a new interpolation-based technique for analyzing attention 1 .
Apr-22-2025
- Country:
- Asia > China
- North America > United States
- Illinois > Cook County > Evanston (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Technology: