Goto

Collaborating Authors

 generalisation error


Learning Theory Can (Sometimes) Explain Generalisation in Graph Neural Networks

Neural Information Processing Systems

In recent years, several results in the supervised learning setting suggested that classical statistical learning-theoretic measures, such as VC dimension, do not adequately explain the performance of deep learning models which prompted a slew of work in the infinite-width and iteration regimes. However, there is little theoretical explanation for the success of neural networks beyond the supervised setting. In this paper we argue that, under some distributional assumptions, classical learning-theoretic measures can sufficiently explain generalization for graph neural networks in the transductive setting. In particular, we provide a rigorous analysis of the performance of neural networks in the context of transductive inference, specifically by analysing the generalisation properties of graph convolutional networks for the problem of node classification. While VC-dimension does result in trivial generalisation error bounds in this setting as well, we show that transductive Rademacher complexity can explain the generalisation properties of graph convolutional networks for stochastic block models. We further use the generalisation error bounds based on transductive Rademacher complexity to demonstrate the role of graph convolutions and network architectures in achieving smaller generalisation error and provide insights into when the graph structure can help in learning.


How does Weight Correlation Affect Generalisation Ability of Deep Neural Networks?

Neural Information Processing Systems

This paper studies the novel concept of weight correlation in deep neural networks and discusses its impact on the networks' generalisation ability. For fully-connected layers, the weight correlation is defined as the average cosine similarity between weight vectors of neurons, and for convolutional layers, the weight correlation is defined as the cosine similarity between filter matrices. Theoretically, we show that, weight correlation can, and should, be incorporated into the PAC Bayesian framework for the generalisation of neural networks, and the resulting generalisation bound is monotonic with respect to the weight correlation. We formulate a new complexity measure, which lifts the PAC Bayes measure with weight correlation, and experimentally confirm that it is able to rank the generalisation errors of a set of networks more precisely than existing measures. More importantly, we develop a new regulariser for training, and provide extensive experiments that show that the generalisation error can be greatly reduced with our novel approach.


Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation

Barbier, Jean, Camilli, Francesco, Nguyen, Minh-Toan, Pastore, Mauro, Skerk, Rudy

arXiv.org Machine Learning

For four decades statistical physics has been providing a framework to analyse neural networks. A long-standing question remained on its capacity to tackle deep learning models capturing rich feature learning effects, thus going beyond the narrow networks or kernel methods analysed until now. We positively answer through the study of the supervised learning of a multi-layer perceptron. Importantly, (i) its width scales as the input dimension, making it more prone to feature learning than ultra wide networks, and more expressive than narrow ones or ones with fixed embedding layers; and (ii) we focus on the challenging interpolation regime where the number of trainable parameters and data are comparable, which forces the model to adapt to the task. We consider the matched teacher-student setting. Therefore, we provide the fundamental limits of learning random deep neural network targets and identify the sufficient statistics describing what is learnt by an optimally trained network as the data budget increases. A rich phenomenology emerges with various learning transitions. With enough data, optimal performance is attained through the model's "specialisation" towards the target, but it can be hard to reach for training algorithms which get attracted by sub-optimal solutions predicted by the theory. Specialisation occurs inhomogeneously across layers, propagating from shallow towards deep ones, but also across neurons in each layer. Furthermore, deeper targets are harder to learn. Despite its simplicity, the Bayes-optimal setting provides insights on how the depth, non-linearity and finite (proportional) width influence neural networks in the feature learning regime that are potentially relevant in much more general settings.


4e8eaf897c638d519710b1691121f8cb-Supplemental.pdf

Neural Information Processing Systems

Supplementary material for'Locality defeats the curse of dimensionality in convolutional teacher-student scenarios' In this appendix we provide additional details about the derivation of Eq. (8) within the framework's are free to take any value. Finally, Eq. (8) is obtained by noticing that, under our assumptions on the decay of If all the parameters are initialised independently from a standard Normal distribution, Eq. (13) is In this section we prove the eigendecompositions introduced in Lemma 3.3 and Lemma 3.4, then extend them to overlapping-patches kernel (cf. We start by proving orthonormality of the eigenfunctions. Then, we prove that the eigenfunctions and the eigenvalues defined in Eq. (17) satisfy the kernel We start again by proving the orthonormality of the eigenfunctions. Then, we prove that the eigenfunctions and the eigenvalues defined in Eq. (19) satisfy the kernel In this section Lemma 3.3 and Lemma 3.4 are extended to kernels with overlapping patches, having (u 1) 's, we have introduced an apex We start by proving the orthonormality of the eigenfunctions.




is Lipschitz-continuous and

Neural Information Processing Systems

We cannot solve the ODE in closed form. Our ODEs do not have a known closed-form solution. It is not always true and we still have to understand the range of the cases in which it is. The reviewer is right that Eqs. "distributional" fixed points corresponding to the mean-field analysis persist even down to relatively small sizes of the We intended to verify the qualitative validity of our result in Eq. (10)



Learning Theory Can (Sometimes) Explain Generalisation in Graph Neural Networks

Neural Information Processing Systems

In recent years, several results in the supervised learning setting suggested that classical statistical learning-theoretic measures, such as VC dimension, do not adequately explain the performance of deep learning models which prompted a slew of work in the infinite-width and iteration regimes.