Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Agazzi, Andrea, Bruno, Giuseppe, García, Eloy Mosig, Saviozzi, Samuele, Romito, Marco
The transformer architecture [52], which underlies present-day Large Language Models, has been one of the main drivers of recent advances in machine learning and artificial intelligence. At each layer, the hidden state of the network is updated by sequentially applying two distinct operations: attention modules [3], which capture long-range interactions in the input sequence, and classical MultiLayer Perceptrons (MLPs), acting separately on each element of that sequence. Despite their empirical success, the mechanisms governing information propagation through depth, and the way attention and MLP blocks jointly shape internal representations, remain only partially understood from a theoretical viewpoint. Recent progress has come from viewing transformers in suitable scaling limits as deterministic mean-field interacting particle systems modeling the evolution of N tokens1 through the layers of the neural network architecture (the so-called residual stream dynamics), see, among others, [46, 26, 27, 45]. In these descriptions, depth plays the role of a continuous time variable, and, in the large-context regime (N), the evolution of token representations is encoded by a PDE for their empirical distribution. This viewpoint is closely connected to the literature on scaling laws, where the effect of various scaling exponents controlling the relative size of the network's hyperparameters (e.g., depth, width, context length) on the effective dynamics of the model
Apr-30-2026