Reviews: Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup
–Neural Information Processing Systems
This paper studies the learning dynamics of two-layer neural networks in the teacher-student scenario under the assumptions that the input is i.i.d. The dynamics is set to be the online algorithm or the stochastic gradient descent (SGD) with mini-batch of single sample, and the dataset size is also assumed to be sufficiently large so that the parameters have no correlation with forthcoming samples. Thanks to these assumptions, the dynamics is governed only by the covariances of connections of the student and teacher, and the closed-form macroscopic dynamics of those covariances can be derived from the SGD dynamics itself. Using this macroscopic dynamics, the generalization error which is also characterized by the covariances only, can be accurately calculated. Meanwhile when both layers of the student are leant, the generalization ability strongly depends on the choice of the activation function: For the sigmoid activation, the generalization error decreases'' as the overparameterization level increases'' while for the other activations the generalization error almost stays constant with respect to it.
Neural Information Processing Systems
Jan-27-2025, 01:29:56 GMT