Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
Gidel, Gauthier, Bach, Francis, Lacoste-Julien, Simon
When optimizing over-parameterized models, such as deep neural networks, a large set of parameters leads to a zero training error. However they lead to different values for the test error and thus have distinct generalization properties. More specifically, Neyshabur [2017, Part II] argues that the choice of the optimization algorithm (and its respective hyperparameters) provides an implicit regularization with respect to its geometry: it biases the training, finding a particular minimizer of the objective. In this work, we use the same setting as Saxe et al. [2018]: a regression problem with least-square loss on a multidimensional output. Our prediction is made either by a linear model or by a two-layer linear neural network [Saxe et al., 2018]. Our goal is to extend their work on the continuous gradient dynamics in order to understand the behavior of the discrete dynamics induced by these two models. We show that with a vanishing initialization and a small enough step-size, the gradient dynamics of the two-layer linear neural network sequentially learns components that can be ranked according to a hierarchical structure whereas the gradient dynamics of the linear model learns the same components at the same time, missing this notion of hierarchy between components. The path followed by the two-layer formulation actually corresponds to successively solving the initial regression problem with a growing low rank constraint which is also know as reduced-rank regression [Izenman, 1975]. Note that this notion of path followed by the dynamics of a whole network is different from the notion of path introduced by Neyshabur et al. [2015a] which
Apr-30-2019