Rotondo, Pietro
Proportional infinite-width infinite-depth limit for deep linear neural networks
Bassetti, Federico, Ladelli, Lucia, Rotondo, Pietro
We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process, known as the Neural Network Gaussian Process. However, this Gaussian limit sacrifices descriptive power, as it lacks the ability to learn dependent features and produce output correlations that reflect observed labels. Motivated by these limitations, we explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs. Our contribution extends previous works by rigorously characterizing, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians.
Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers
Bassetti, Federico, Gherardi, Marco, Ingrosso, Alessandro, Pastore, Mauro, Rotondo, Pietro
Deep linear networks have been extensively studied, as they provide simplified models of deep learning. However, little is known in the case of finite-width architectures with multiple outputs and convolutional layers. In this manuscript, we provide rigorous results for the statistics of functions implemented by the aforementioned class of networks, thus moving closer to a complete characterization of feature learning in the Bayesian setting. Our results include: (i) an exact and elementary non-asymptotic integral representation for the joint prior distribution over the outputs, given in terms of a mixture of Gaussians; (ii) an analytical formula for the posterior distribution in the case of squared error loss function (Gaussian likelihood); (iii) a quantitative description of the feature learning infinite-width regime, using large deviation theory. From a physical perspective, deep architectures with multiple outputs or convolutional layers represent different manifestations of kernel shape renormalization, and our work provides a dictionary that translates this physics intuition and terminology into rigorous Bayesian statistics.
Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalisation
Ciceri, Simone, Cassani, Lorenzo, Pizzochero, Pierre, Osella, Matteo, Rotondo, Pietro, Gherardi, Marco
Supervised deep learning excels in the baffling task of disentangling the training data, so as to reach near-zero training error, while still achieving good accuracy on the classification of unseen data. How this feat is achieved, particularly in relation to the geometry and structure of the training data, is currently a topic of debate and partly still an open question [1-6]. Activations of hidden layers in response to input examples, i.e., the internal representations of the data, evolve during training to facilitate eventual linear separation in the last layer. This requires a gradual segregation of points belonging to different classes, in what can be pictured as a disentangling motion between their class manifolds. Segregation of class manifolds is a powerful conceptualisation that informs the design of distancebased losses in metric learning and contrastive learning [7-11] and underlies several approaches aimed at quantifying expressivity and generalisation, in artificial neural networks as well as in neuroscience [12-17]. Several recent efforts have leveraged this picture to characterise information processing along the layers of a deep network, particularly focusing on metrics such as intrinsic dimensionality and curvature [18-22]. In Ref. [19], for instance, two descriptors of manifold geometry, related to the intrinsic dimension and to the extension of the manifolds, are shown to undergo dramatic reduction as a result of training in deep convolutional neural networks. Such shrinking (together with intermanifold correlations, which we neglect in this manuscript) decisively supports the model's capacity in a memorisation task. Yet, this appears to be just one side of the coin.
Intrinsic dimension estimation for locally undersampled data
Erba, Vittorio, Gherardi, Marco, Rotondo, Pietro
High-dimensional data are ubiquitous in contemporary science and finding methods to compress them is one of the primary goals of machine learning. Given a dataset lying in a high-dimensional space (in principle hundreds to several thousands of dimensions), it is often useful to project it onto a lower-dimensional manifold, without loss of information. Identifying the minimal dimension of such manifold is a challenging problem known in the literature as intrinsic dimension estimation (IDE). Traditionally, most IDE algorithms are either based on multiscale principal component analysis (PCA) or on the notion of correlation dimension (and more in general on k-nearest-neighbors distances). These methods are affected, in different ways, by a severe curse of dimensionality. In particular, none of the existing algorithms can provide accurate ID estimates in the extreme locally undersampled regime, i.e. in the limit where the number of samples in any local patch of the manifold is less than (or of the same order of) the ID of the dataset. Here we introduce a new ID estimator that leverages on simple properties of the tangent space of a manifold to overcome these shortcomings. The method is based on the full correlation integral, going beyond the limit of small radius used for the estimation of the correlation dimension. Our estimator alleviates the extreme undersampling problem, intractable with other methods. Based on this insight, we explore a multiscale generalization of the algorithm. We show that it is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the ID of extremely curved manifolds. In particular, we test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art ID estimators.