A phase transition between positional and semantic learning in a solvable model of dot-product attention

Cui, Hugo, Behrens, Freya, Krzakala, Florent, Zdeborová, Lenka

Feb-6-2024–arXiv.org Artificial Intelligence

We investigate how a dot-product attention layer learns a positional attention matrix (with tokens attending to each other based on their respective positions) and a semantic attention matrix (with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally show how the same simple architecture can learn to implement a solution using either the positional or semantic mechanism. On the theoretical side, we study the learning of a non-linear self-attention layer with trainable tied and low-rank query and key matrices. In the asymptotic limit of high-dimensional data and a comparably large number of training samples, we provide a closed-form characterization of the global minimum of the non-convex empirical loss landscape. We show that this minimum corresponds to either a positional or a semantic mechanism and evidence an emergent phase transition from the former to the latter with increasing sample complexity. Finally, we compare the dot-product attention layer to linear positional baseline, and show that it outperforms the latter using the semantic mechanism provided it has access to sufficient data.

dot-product attention, matrix, mechanism, (14 more...)

arXiv.org Artificial Intelligence

Feb-6-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.04)
- North America
  - United States (0.14)
  - Canada > Ontario
    - Toronto (0.04)
- Asia > Middle East
  - Israel (0.04)
  - UAE > Abu Dhabi Emirate
    - Abu Dhabi (0.04)

Genre:
- Research Report (0.81)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning
    - Neural Networks (1.00)
    - Statistical Learning (0.68)