Phase transitions for the noisy transformer model in arbitrary dimension

Mun, Kyunghoo, Rosenzweig, Matthew

arXiv.org Machine Learning 

We study the McKean--Vlasov free energy on the unit sphere associated with the unnormalized self-attention (USA) model for noisy transformer dynamics. We prove a sharp global-minimizer dichotomy in every dimension $d\ge2$. There is a unique $β_*^{(d)}>0$ such that \begin{equation*} \frac{I_{d/2+1}(β_*^{(d)})}{I_{d/2}(β_*^{(d)})}=\frac1d, \end{equation*} where $I_ν$ is the modified Bessel function of the first kind. For $0<β\le β_*^{(d)}$, the uniform density remains the unique global minimizer up to the linear-stability threshold \begin{equation*} K_\#^{(d)}(β)=\frac{β^{d/2}}{2^{d/2}Γ(d/2)I_{d/2}(β)}, \end{equation*} and the phase transition is continuous. For $β>β_*^{(d)}$, the uniform density is not globally minimizing at $K_\#^{(d)}(β)$, so the critical coupling satisfies $K_c