Goto

Collaborating Authors

 local learning coefficient


Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory

arXiv.org Machine Learning

We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression.


From Global to Local: A Scalable Benchmark for Local Posterior Sampling

arXiv.org Machine Learning

Degeneracy is an inherent feature of the loss landscape of neural networks, but it is not well understood how stochastic gradient MCMC (SGMCMC) algorithms interact with this degeneracy. In particular, current global convergence guarantees for common SGMCMC algorithms rely on assumptions which are likely incompatible with degenerate loss landscapes. In this paper, we argue that this gap requires a shift in focus from global to local posterior sampling, and, as a first step, we introduce a novel scalable benchmark for evaluating the local sampling performance of SGMCMC algorithms. We evaluate a number of common algorithms, and find that RMSProp-preconditioned SGLD is most effective at faithfully representing the local geometry of the posterior distribution. Although we lack theoretical guarantees about global sampler convergence, our empirical results show that we are able to extract non-trivial local information in models with up to O(100M) parameters.


Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent

arXiv.org Artificial Intelligence

We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.


Estimating the Local Learning Coefficient at Scale

arXiv.org Artificial Intelligence

The \textit{local learning coefficient} (LLC) is a principled way of quantifying model complexity, originally derived in the context of Bayesian statistics using singular learning theory (SLT). Several methods are known for numerically estimating the local learning coefficient, but so far these methods have not been extended to the scale of modern deep learning architectures or data sets. Using a method developed in {\tt arXiv:2308.12108 [stat.ML]} we empirically show how the LLC may be measured accurately and self-consistently for deep linear networks (DLNs) up to 100M parameters. We also show that the estimated LLC has the rescaling invariance that holds for the theoretical quantity.


The Developmental Landscape of In-Context Learning

arXiv.org Artificial Intelligence

We show that in-context learning emerges in transformers in discrete developmental stages, when they are trained on either language modeling or linear regression tasks. We introduce two methods for detecting the milestones that separate these stages, by probing the geometry of the population loss in both parameter space and function space. We study the stages revealed by these new methods using a range of behavioral and structural metrics to establish their validity.


Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition

arXiv.org Artificial Intelligence

The apparent simplicity of the Toy Model of Superposition (TMS) proposed in Elhage et al. (2022) conceals a remarkably intricate phase structure. During training, a plateau in the loss is often followed by a sudden discrete drop, suggesting some development in the network's internal structure. To shed light on these transitions and their significance, this paper examines the dynamical transitions in TMS during SGD training, connecting them to phase transitions of the Bayesian posterior with respect to sample size n. While the former transitions have been observed in several recent works in deep learning (Olsson et al., 2022; McGrath et al., 2022; Wei et al., 2022a), their formal status has remained elusive. In contrast, phase transitions of the Bayesian posterior are mathematically well-defined in Singular Learning Theory (SLT) (Watanabe, 2009). Using SLT, we can show formally that the Bayesian posterior is subject to an internal model selection mechanism in the following sense: the posterior prefers, for small training sample size n, critical points with low complexity but potentially high loss. The opposite is true for high n where the posterior prefers low loss critical points at the cost of higher complexity. The measure of complexity here is very specific: it is the local learning coefficient, λ, of the critical points, first alluded to by Watanabe (2009, 7.6) and clarified recently in Lau et al. (2023). We can think of this internal model selection as a discrete dynamical process: at various critical sample sizes the posterior concentration "jumps" from one region W


Quantifying degeneracy in singular models via the learning coefficient

arXiv.org Artificial Intelligence

Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the \emph{learning coefficient} introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of "flat" directions. We propose a computationally scalable approximation of a localized version of the learning coefficient using stochastic gradient Langevin dynamics. To validate our approach, we demonstrate its accuracy in low-dimensional models with known theoretical values. Importantly, the local learning coefficient can correctly recover the ordering of degeneracy between various parameter regions of interest. An experiment on MNIST shows the local learning coefficient can reveal the inductive bias of stochastic opitmizers for more or less degenerate critical points.