Three Mechanisms of Weight Decay Regularization
Zhang, Guodong, Wang, Chaoqi, Xu, Bowen, Grosse, Roger
We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the inputoutput Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks. Weight decay has long been a standard trick to improve the generalization performance of neural networks (Krogh & Hertz, 1992; Bos & Chug, 1996) by encouraging the weights to be small in magnitude. However, several findings cast doubt on this interpretation: - Weight decay has sometimes been observed to improve training accuracy, not just generalization performance (e.g. In principle, weight decay regularization should have no effect in this case, since one can scale the weights by a small factor without changing the network's predictions. Hence, it does not meaningfully constrain the network's capacity. The effect of weight decay remains poorly understood, and we lack clear guidelines for which tasks and architectures it is likely to help or hurt. A better understanding of the role of weight decay would help us design more efficient and robust neural network architectures.
Oct-29-2018
- Country:
- North America > Canada > Ontario > Toronto (0.14)
- Genre:
- Research Report > New Finding (0.88)
- Technology: