Goto

Collaborating Authors

 order parameter


Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Bocchi, Dario, Regimbeau, Theotime, Lucibello, Carlo, Saglietti, Luca, Cammarota, Chiara

arXiv.org Machine Learning

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $α= M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.




ClassSuperstat

KCL

Neural Information Processing Systems

In this Appendix, we will derive the fixed-point equations for the order parameters presented in the main text, following and generalising the analysis in Ref. [ Saddle-point equations The saddle-point equations are derived straightforwardly from the obtained free energy functionally extremising with respect to all parameters. The zero-regularisation limit of the logistic loss can help us study the separability transition. N 5 + \ 1 p 0, 1 d 5. (66) As a result, given that \ 2( 0, 1 ], the smaller value for which E is finite is U This result has been generalised immediately afterwards by Pesce et al. Ref. [ 59 ] for the Gaussian case, we can obtain the following fixed-point equations, 8 > > > > > >< > > > > > >: E = Mean universality Following Ref. [ In our case, this condition is simpler than in Ref. [ We see that mean-independence in this setting is indeed verified. Numerical experiments Numerical experiments regarding the quadratic loss with ridge regularisation were performed by computing the Moore-Penrose pseudoinverse solution.