Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers
–Neural Information Processing Systems
Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., N,P\rightarrow\infty, P/N \mathcal{O}(1), where N is the network width and P is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different "attention paths", defined as information pathways through different attention heads across layers. The kernels are weighted according to a "task-relevant kernel combination" mechanism that aligns the total kernel with the task labels.
Neural Information Processing Systems
May-27-2025, 07:22:49 GMT
- Technology: