Murfet, Daniel
You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation
Lehalleur, Simon Pepin, Hoogland, Jesse, Farrugia-Roberts, Matthew, Wei, Susan, Oldenziel, Alexander Gietelink, Wang, George, Carroll, Liam, Murfet, Daniel
In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.
Dynamics of Transient Structure in In-Context Linear Regression Transformers
Carroll, Liam, Hoogland, Jesse, Farrugia-Roberts, Matthew, Murfet, Daniel
Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
Open Problems in Mechanistic Interpretability
Sharkey, Lee, Chughtai, Bilal, Batson, Joshua, Lindsey, Jack, Wu, Jeff, Bushnaq, Lucius, Goldowsky-Dill, Nicholas, Heimersheim, Stefan, Ortega, Alejandro, Bloom, Joseph, Biderman, Stella, Garriga-Alonso, Adria, Conmy, Arthur, Nanda, Neel, Rumbelow, Jessica, Wattenberg, Martin, Schoots, Nandi, Miller, Joseph, Michaud, Eric J., Casper, Stephen, Tegmark, Max, Saunders, William, Bau, David, Todd, Eric, Geiger, Atticus, Geva, Mor, Hoogland, Jesse, Murfet, Daniel, McGrath, Tom
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing. This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences. The perspectives presented here do not necessarily reflect the views of any individual author or the institutions with which they are affiliated.
Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient
Wang, George, Hoogland, Jesse, van Wingerden, Stan, Furman, Zach, Murfet, Daniel
Structure in the data distribution has long been recognized as central to the development of internal structure in artificial and biological neural networks (Rumelhart et al., 1986; Olshausen & Field, 1996; Rogers & McClelland, 2004). Recent observations have renewed interest in this topic: language models progress through distinct stages of development during training, acquiring increasingly sophisticated linguistic and reasoning abilities in ways that seem to reflect the structure of the data distribution (Olsson et al., 2022; Chen et al., 2024; Belrose et al., 2024; Tigges et al., 2024; Edelman et al., 2024; Hoogland et al., 2024). A deeper understanding of how structure in the data determines internal structure in trained models requires tools that provide information about which components of a model are being shaped in response to what structure in the data distribution. Our foundation for the study of such questions begins with the local learning coefficient (LLC; Lau et al. 2023) from singular learning theory (SLT; Watanabe 2009), which is a measure of model complexity. In this paper, we introduce the refined local learning coefficient (rLLC), which measures the complexity of a component of the model with respect to an arbitrary data distribution. We focus mainly on the rLLCs of individual attention heads and demonstrate the utility of these metrics in studying the progressive differentiation and specialization of heads. The diversity of attention heads at the end of training has been established in recent years through mechanistic interpretability, which has provided numerous examples of attention heads that appear to have specialized functions, including previous-token heads (Voita et al., 2019; Clark et al., 2019) and induction heads (Olsson et al., 2022) among other kinds (Wang et al., 2023; Gould et al., 2024).
The Developmental Landscape of In-Context Learning
Hoogland, Jesse, Wang, George, Farrugia-Roberts, Matthew, Carroll, Liam, Wei, Susan, Murfet, Daniel
We show that in-context learning emerges in transformers in discrete developmental stages, when they are trained on either language modeling or linear regression tasks. We introduce two methods for detecting the milestones that separate these stages, by probing the geometry of the population loss in both parameter space and function space. We study the stages revealed by these new methods using a range of behavioral and structural metrics to establish their validity.
Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition
Chen, Zhongtian, Lau, Edmund, Mendel, Jake, Wei, Susan, Murfet, Daniel
The apparent simplicity of the Toy Model of Superposition (TMS) proposed in Elhage et al. (2022) conceals a remarkably intricate phase structure. During training, a plateau in the loss is often followed by a sudden discrete drop, suggesting some development in the network's internal structure. To shed light on these transitions and their significance, this paper examines the dynamical transitions in TMS during SGD training, connecting them to phase transitions of the Bayesian posterior with respect to sample size n. While the former transitions have been observed in several recent works in deep learning (Olsson et al., 2022; McGrath et al., 2022; Wei et al., 2022a), their formal status has remained elusive. In contrast, phase transitions of the Bayesian posterior are mathematically well-defined in Singular Learning Theory (SLT) (Watanabe, 2009). Using SLT, we can show formally that the Bayesian posterior is subject to an internal model selection mechanism in the following sense: the posterior prefers, for small training sample size n, critical points with low complexity but potentially high loss. The opposite is true for high n where the posterior prefers low loss critical points at the cost of higher complexity. The measure of complexity here is very specific: it is the local learning coefficient, ฮป, of the critical points, first alluded to by Watanabe (2009, 7.6) and clarified recently in Lau et al. (2023). We can think of this internal model selection as a discrete dynamical process: at various critical sample sizes the posterior concentration "jumps" from one region W
Quantifying degeneracy in singular models via the learning coefficient
Lau, Edmund, Murfet, Daniel, Wei, Susan
Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the \emph{learning coefficient} introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of "flat" directions. We propose a computationally scalable approximation of a localized version of the learning coefficient using stochastic gradient Langevin dynamics. To validate our approach, we demonstrate its accuracy in low-dimensional models with known theoretical values. Importantly, the local learning coefficient can correctly recover the ordering of degeneracy between various parameter regions of interest. An experiment on MNIST shows the local learning coefficient can reveal the inductive bias of stochastic opitmizers for more or less degenerate critical points.
Geometry of Program Synthesis
Clift, James, Murfet, Daniel, Wallbridge, James
When we say the code on the description tape of the physical UTM "is" We re-evaluate universal computation based on w what we actually mean is, adopting the thermodynamic the synthesis of Turing machines. This leads to a language, that the system is in a phase (a local minima of view of programs as singularities of analytic varieties the free energy) including the microstate c we associate to or, equivalently, as phases of the Bayesian w. However, when the system is in this phase its microstate posterior of a synthesis problem. This new point is not equal to c but rather undergoes rapid spontaneous of view reveals unexplored directions of research transitions between many microstates "near" c. in program synthesis, of which neural networks are a subset, for example in relation to phase transitions, So in any possible physical realisation of a UTM, a program complexity and generalisation. We also is realised by a phase of the physical system. Does this have lay the empirical foundations for these new directions any computational significance?
Deep Learning is Singular, and That's Good
Murfet, Daniel, Wei, Susan, Gong, Mingming, Li, Hui, Gell-Redman, Jesse, Quella, Thomas
In singular models, the optimal set of parameters forms an analytic set with singularities and classical statistical inference cannot be applied to such models. This is significant for deep learning as neural networks are singular and thus "dividing" by the determinant of the Hessian or employing the Laplace approximation are not appropriate. Despite its potential for addressing fundamental issues in deep learning, singular learning theory appears to have made little inroads into the developing canon of deep learning theory. Via a mix of theory and experiment, we present an invitation to singular learning theory as a vehicle for understanding deep learning and suggest important future work to make singular learning theory directly applicable to how deep learning is performed in practice.
Logic and the $2$-Simplicial Transformer
Clift, James, Doryn, Dmitry, Murfet, Daniel, Wallbridge, James
The most successful examples of such representations, those learned by convolutional neural networks, are structured by the scale and translational symmetries of the underlying space (e.g. a two-dimensional Euclidean space for images). It has been suggested that in humans the ability to make rich inferences based on abstract reasoning is rooted in the same neural mechanisms underlying relational reasoning in space [16, 19, 6, 7] and more specifically that abstract reasoning is facilitated by the learning of structural representations which serve to organise other learned representations in the same way that space organises the representations that enable spatial navigation [68, 41]. This raises a natural question: are there any ideas from mathematics that might be useful in designing general inductive biases for learning such structural representations? As a motivating example we take the recent progress on natural language tasks based on the Transformer architecture [66] which simultaneously learns to represent both entities (typically words) and relations between entities (for instance the relation between "cat" and "he" in the sentence "There was a cat and he liked to sleep"). These representations of relations take the form of query and key vectors governing the passing of messages between entities; messages update entity representations over several rounds of computation until the final representations reflect not just the meaning of words but also their context in a sentence.