On regularization of gradient descent, layer imbalance and flat minima
We analyze the training dynamics for deep linear networks using a new metric - layer imbalance - which defines the flatness of a solution. We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way. Training has two distinct phases: 1) optimization and 2) regularization. First, during the optimization phase, the loss function monotonically decreases, and the trajectory goes toward a minima manifold. Then, during the regularization phase, the layer imbalance decreases, and the trajectory goes along the minima manifold toward a flat area. Finally, we extend the analysis for stochastic gradient descent and show that SGD works similarly to noise regularization.
Jul-17-2020
- Country:
- North America > United States
- California > Santa Clara County > Santa Clara (0.04)
- Europe > Germany
- North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- North America > United States
- Genre:
- Research Report (0.40)
- Technology: