Dynamic metastability in the self-attention model

Geshkovski, Borjan, Koubbi, Hugo, Polyanskiy, Yury, Rigollet, Philippe

Oct-9-2024–arXiv.org Artificial Intelligence

We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.

artificial intelligence, configuration, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Oct-9-2024

arXiv.org PDF

Add feedback

Country:
- Africa > Namibia
  - Kalahari Desert (0.04)
- Asia > Japan
  - Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- Europe
  - France
    - Grand Est > Bas-Rhin
      - Strasbourg (0.04)
    - Île-de-France > Paris
      - Paris (0.04)
  - Ireland (0.04)
  - Italy (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
- North America > United States
  - Connecticut > New Haven County
    - New Haven (0.04)
  - Massachusetts > Middlesex County
    - Cambridge (0.04)

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)