deterministic equivalence
A Random Matrix Theory Perspective on the Consistency of Diffusion Models
Wang, Binxu, Zavatone-Veth, Jacob, Pehlevan, Cengiz
Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation $σ^2 \mapsto κ(σ^2)$, explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textit{anisotropy} across eigenmodes, \textit{inhomogeneity} across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Atanasov, Alexander, Bordelon, Blake, Zavatone-Veth, Jacob A., Paquette, Courtney, Pehlevan, Cengiz
Modern deep learning practice is governed by the surprising predictability of performance improvement with increases in the scale of data, model size, and compute [17]. Often, the scaling of performance as a function of these quantities exhibits remarkably regular power law behavior, termed a neural scaling law [2, 6, 12, 13, 15, 16, 18, 19, 22, 32]. Here, performance is usually measured by some differentiable loss on the predictions of the model on a held out test set representative of the population. Given the relatively universal behavior of the exponents across architectures and optimizers [11, 18, 19], one might hope that relatively simple models of information processing systems might be able to recover the same types of scaling laws. The (stochastic) gradient descent (SGD) dynamics in random feature models were analyzed in recent works [7, 20, 26] which exhibits a surprising breadth of scaling behavior and captures several interesting phenomena in deep network training. Each of the above works has isolated various effects that can hurt performance compared to the idealized infinite data and infinite model size limits. The model was first studied in [7], where the bottlenecks due to finite width and finite dataset size were computed and, for certain data structure, resulted in a Chinchilla-type scaling result as in [18].
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
Re-examining Double Descent and Scaling Laws under Norm-based Capacity via Deterministic Equivalence
Wang, Yichen, Chen, Yudong, Rosasco, Lorenzo, Liu, Fanghui
The number of parameters, i.e., model size, provides a basic measure of the capacity of a machine learning (ML) model. However it is well known that it might not describe the effective model capacity (Bartlett, 1998), especially for over-parameterized neural networks (Belkin et al., 2018; Zhang et al., 2021) and large language models (Brown et al., 2020). The focus on the number of parameters results in an inaccurate characterization of the relationship between the test risk R, training data size n, and model size p, which is central in ML to understand the bias-variance trade-off (Vapnik, 1995), double descent (Belkin et al., 2019) and scaling laws (Kaplan et al., 2020; Xiao, 2024). For example, even for the same architecture (model size), the test error behavior can be totally different (Nakkiran et al., 2020, 2021), e.g., double descent may disappear. Here we shift the focus from model size to weights and consider their norm, a perspective pioneered in the classical results in Bartlett (1998). Indeed, norm based capacity/complexity are widely considered to be more effective in characterizing generalization behavior, see e.g.
- Europe > Italy > Liguria > Genoa (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Risk and cross validation in ridge regression with correlated samples
Atanasov, Alexander, Zavatone-Veth, Jacob A., Pehlevan, Cengiz
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging recent techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)