Goto

Collaborating Authors

 neural network learn function


SGD on Neural Networks Learns Functions of Increasing Complexity

Neural Information Processing Systems

We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is ``retained'' throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model. Key to our work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information.


Reviews: SGD on Neural Networks Learns Functions of Increasing Complexity

Neural Information Processing Systems

There is a lot of support for the paper in the reviews. While much "folklore knowledge" exists around implicit regularization of SGD (e.g. Some suggestions of improvement should be taken seriously, but all in all the paper makes a valuable contribution towards understanding the interplay of optimization and representational power (types of functions).


Reviews: SGD on Neural Networks Learns Functions of Increasing Complexity

Neural Information Processing Systems

Update: I have read the author's response and am keeping my score Originality: The idea that SGD biases networks towards simple classifiers has been discussed in the community extensively, but lacked a crisp formalization. This paper proposes a formal description for this conjecture (and also extends it -- Conjecture 1 and Claim 2) using a rich and simple information-theoretic framework. This work is roughly related to recent work on understanding the inductive bias of SGD in function space, e.g. Valle-Perez et al, Savarese et al: both only analyze the solution returned by SGD (under different assumptions), and not the intermediate iterates like this paper does. By defining the'simplicity' of a neural network solely on the mutual information between its predictions and another classifier leads to objects which are invariant to reparameterization of the network, and only depend on the function that it implements.


SGD on Neural Networks Learns Functions of Increasing Complexity

Neural Information Processing Systems

We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is retained'' throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model.


SGD on Neural Networks Learns Functions of Increasing Complexity

Neural Information Processing Systems

We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is retained'' throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model.