Goto

Collaborating Authors

 eigenspectrum






A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

Couto, Carlos, Mourão, José, Figueiredo, Mário A. T., Ribeiro, Pedro

arXiv.org Machine Learning

Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.


An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models

Mao, Jialin, Griniasty, Itay, Sun, Yan, Transtrum, Mark K., Sethna, James P., Chaudhari, Pratik

arXiv.org Artificial Intelligence

Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.



A Additional Figures for Section 4.1

Neural Information Processing Systems

Star indicates the dimension at which the cumulative variance exceeds 90%. The shaded grey area are the eigenvalues that are not regularized. Eigenspectrum of the first hidden layer. B) Eigenspectrum of the second hidden layer. The shaded gray area are the eigenvalues that are not regularized.


We asked (a) if having a 1/n neural code make neural networks more robust, and (b) how does the neural code

Neural Information Processing Systems

We thank the reviewers for their insightful comments and suggestions. As pointed out by R1,R2 & R3, our experiments were only run on MNIST. We would like to draw the attention of R5 to this particular case. " We apologize for the confusion. To clarify, the whitening employed in section 4.2 is used to investigate the BN was only used for the shallow neural networks in section 4.1 as we found that " According to the theory developed by Stringer et al., having


Why all roads don't lead to Rome: Representation geometry varies across the human visual cortical hierarchy

Ghosh, Arna, Chorghay, Zahraa, Bakhtiari, Shahab, Richards, Blake A.

arXiv.org Artificial Intelligence

Biological and artificial intelligence systems navigate the fundamental efficiency-robustness tradeoff for optimal encoding, i.e., they must efficiently encode numerous attributes of the input space while also being robust to noise. This challenge is particularly evident in hierarchical processing systems like the human brain. With a view towards understanding how systems navigate the efficiency-robustness tradeoff, we turned to a population geometry framework for analyzing representations in the human visual cortex alongside artificial neural networks (ANNs). In the ventral visual stream, we found general-purpose, scale-free representations characterized by a power law-decaying eigenspectrum in most areas. However, in certain higher-order visual areas did not have scale-free representations, indicating that scale-free geometry is not a universal property of the brain. In parallel, ANNs trained with a self-supervised learning objective also exhibited free-free geometry, but not after fine-tune on a specific task. Based on these empirical results and our analytical insights, we posit that a system's representation geometry is not a universal property and instead depends upon the computational objective.