eigenspectrum
A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
Couto, Carlos, Mourão, José, Figueiredo, Mário A. T., Ribeiro, Pedro
Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
- Europe > Portugal > Lisbon > Lisbon (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models
Mao, Jialin, Griniasty, Itay, Sun, Yan, Transtrum, Mark K., Sethna, James P., Chaudhari, Pratik
Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.
- North America > United States > Pennsylvania (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
We asked (a) if having a 1/n neural code make neural networks more robust, and (b) how does the neural code
We thank the reviewers for their insightful comments and suggestions. As pointed out by R1,R2 & R3, our experiments were only run on MNIST. We would like to draw the attention of R5 to this particular case. " We apologize for the confusion. To clarify, the whitening employed in section 4.2 is used to investigate the BN was only used for the shallow neural networks in section 4.1 as we found that " According to the theory developed by Stringer et al., having
Why all roads don't lead to Rome: Representation geometry varies across the human visual cortical hierarchy
Ghosh, Arna, Chorghay, Zahraa, Bakhtiari, Shahab, Richards, Blake A.
Biological and artificial intelligence systems navigate the fundamental efficiency-robustness tradeoff for optimal encoding, i.e., they must efficiently encode numerous attributes of the input space while also being robust to noise. This challenge is particularly evident in hierarchical processing systems like the human brain. With a view towards understanding how systems navigate the efficiency-robustness tradeoff, we turned to a population geometry framework for analyzing representations in the human visual cortex alongside artificial neural networks (ANNs). In the ventral visual stream, we found general-purpose, scale-free representations characterized by a power law-decaying eigenspectrum in most areas. However, in certain higher-order visual areas did not have scale-free representations, indicating that scale-free geometry is not a universal property of the brain. In parallel, ANNs trained with a self-supervised learning objective also exhibited free-free geometry, but not after fine-tune on a specific task. Based on these empirical results and our analytical insights, we posit that a system's representation geometry is not a universal property and instead depends upon the computational objective.
- North America > Canada > Quebec > Montreal (0.15)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)