Plotting

 Goldt, Sebastian


Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions

arXiv.org Machine Learning

Deep neural networks learn structured features from complex, non-Gaussian inputs, but the mechanisms behind this process remain poorly understood. Our work is motivated by the observation that the first-layer filters learnt by deep convolutional neural networks from natural images resemble those learnt by independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we leverage this connection to investigate the interplay between data structure and optimisation in feature learning for the most popular ICA algorithm, FastICA, and stochastic gradient descent (SGD), which is used to train deep networks. We rigorously establish that FastICA requires at least $n\gtrsim d^4$ samples to recover a single non-Gaussian direction from $d$-dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity $n \gtrsim d^2$ can be reached by smoothing the loss, albeit in a data-dependent way. We finally demonstrate the existence of a search phase for FastICA on ImageNet, and discuss how the strong non-Gaussianity of said images compensates for the poor sample complexity of FastICA.


On How Iterative Magnitude Pruning Discovers Local Receptive Fields in Fully Connected Neural Networks

arXiv.org Artificial Intelligence

Iterative magnitude pruning (IMP) [1] has emerged as a powerful tool for identifying sparse subnetworks ("winning tickets") that can be trained to perform as well as the dense model they are extracted from [2, 3]. That IMP, despite its simplicity, is more robust in discovering such winning tickets than other, more complex pruning schemes [4] suggests that its iterative coarse-graining [5] is especially capable of extracting and maintaining strong inductive biases. This perspective is strengthened by observations that winning tickets discovered by IMP: 1) have properties that make them transferable across related tasks [6-13] and architectures [14]; 2) can outperform dense models on classes with limited data [15]; 3) have less overconfident predictions [16]. The first direct evidence for IMP discovering good inductive biases came from studying the winning tickets extracted by IMP in fully connected neural networks (FCNs) [17]. Pellegrini and Biroli (2022) [17] found that the sparse subnetworks identified by IMP had local receptive field (RF) structure (Figure 1A), an architectural feature found in visual cortex [18] and convolutional neural networks (CNNs) [19]. Comparing IMP derived winning tickets with the sparse subnetworks found by oneshot pruning (Figure 1B), Pellegrini and Biroli (2022) [17] argued that the iterative nature of IMP was essential for refining the local RF structure. However, to-date, an understanding of how IMP, a pruning method based purely on the magnitude of the network parameters, is able to "sift out" non-localized weights remains unknown. Resolving this will not only shed light on the effect of IMP on FCNs, but also will provide new insight on the success of IMP more broadly.


A distributional simplicity bias in the learning dynamics of transformers

arXiv.org Artificial Intelligence

The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing to more complex, non-linear functions. While simplicity biases have been described theoretically and experimentally in feed-forward networks for supervised learning, the extent to which they also explain the remarkable success of transformers trained with self-supervised techniques remains unclear. In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate \textit{clones} of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.


Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

arXiv.org Machine Learning

Neural networks extract features from data using stochastic gradient descent (SGD). In particular, higher-order input cumulants (HOCs) are crucial for their performance. However, extracting information from the $p$th cumulant of $d$-dimensional inputs is computationally hard: the number of samples required to recover a single direction from an order-$p$ tensor (tensor PCA) using online SGD grows as $d^{p-1}$, which is prohibitive for high-dimensional inputs. This result raises the question of how neural networks extract relevant directions from the HOCs of their inputs efficiently. Here, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks.


Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

arXiv.org Machine Learning

Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of~$d$-dimensional inputs. We first characterise the fundamental statistical and computational limits of recovering the spike by analysing the number of samples~$n$ required to strongly distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs. We find that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d^2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. These results suggest the existence of a wide statistical-to-computational gap in this problem. Numerical experiments show that neural networks learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-order correlations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.


What does self-attention learn from Masked Language Modelling?

arXiv.org Machine Learning

Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalised Potts model with interactions between sites and Potts colours. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalisation error of self-attention in a model scenario analytically using the replica method.


The impact of memory on learning sequence-to-sequence tasks

arXiv.org Machine Learning

The recent success of neural networks in natural language processing has drawn renewed attention to learning sequence-to-sequence (seq2seq) tasks. While there exists a rich literature that studies classification and regression tasks using solvable models of neural networks, seq2seq tasks have not yet been studied from this perspective. Here, we propose a simple model for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences -- the stochastic switching-Ornstein-Uhlenbeck (SSOU) model. We introduce a measure of non-Markovianity to quantify the amount of memory in the sequences. For a minimal auto-regressive (AR) learning model trained on this task, we identify two learning regimes corresponding to distinct phases in the stationary state of the SSOU process. These phases emerge from the interplay between two different time scales that govern the sequence statistics. Moreover, we observe that while increasing the integration window of the AR model always improves performance, albeit with diminishing returns, increasing the non-Markovianity of the input sequences can improve or degrade its performance. Finally, we perform experiments with recurrent and convolutional neural networks that show that our observations carry over to more complicated neural network architectures.


Attacks on Online Learners: a Teacher-Student Analysis

arXiv.org Machine Learning

Machine learning models are famously vulnerable to adversarial attacks: small ad-hoc perturbations of the data that can catastrophically alter the model predictions. While a large literature has studied the case of test-time attacks on pre-trained models, the important case of attacks in an online learning setting has received little attention so far. In this work, we use a control-theoretical perspective to study the scenario where an attacker may perturb data labels to manipulate the learning dynamics of an online learner. We perform a theoretical analysis of the problem in a teacher-student setup, considering different attack strategies, and obtaining analytical results for the steady state of simple linear learners. These results enable us to prove that a discontinuous transition in the learner's accuracy occurs when the attack strength exceeds a critical threshold. We then study empirically attacks on learners with complex architectures using real data, confirming the insights of our theoretical analysis. Our findings show that greedy attacks can be extremely efficient, especially when data stream in small batches.


The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

arXiv.org Artificial Intelligence

Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.


Quantifying lottery tickets under label noise: accuracy, calibration, and complexity

arXiv.org Artificial Intelligence

Pruning deep neural networks is a widely used strategy to alleviate the computational burden in machine learning. Overwhelming empirical evidence suggests that pruned models retain very high accuracy even with a tiny fraction of parameters. However, relatively little work has gone into characterising the small pruned networks obtained, beyond a measure of their accuracy. In this paper, we use the sparse double descent approach to identify univocally and characterise pruned models associated with classification tasks. We observe empirically that, for a given task, iterative magnitude pruning (IMP) tends to converge to networks of comparable sizes even when starting from full networks with sizes ranging over orders of magnitude. We analyse the best pruned models in a controlled experimental setup and show that their number of parameters reflects task difficulty and that they are much better than full networks at capturing the true conditional probability distribution of the labels. On real data, we similarly observe that pruned models are less prone to overconfident predictions. Our results suggest that pruned models obtained via IMP not only have advantageous computational properties but also provide a better representation of uncertainty in learning.