Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

May-27-2025, 07:22:49 GMT–Neural Information Processing Systems

Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., N,P\rightarrow\infty, P/N \mathcal{O}(1), where N is the network width and P is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different "attention paths", defined as information pathways through different attention heads across layers. The kernels are weighted according to a "task-relevant kernel combination" mechanism that aligns the total kernel with the task labels.

attention path, statistical mechanics theory, transformer, (6 more...)

Neural Information Processing Systems

May-27-2025, 07:22:49 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.62)