Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Zucker, Shai, Wang, Xiong, Lu, Fei, Seroussi, Inbar

Oct-15-2025–arXiv.org Machine Learning

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Oct-15-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)
- Asia > Middle East (0.28)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found