Diagonal Linear Networks and the Lasso Regularization Path

Berthier, Raphaël

arXiv.org Machine Learning 

The composition of layers enables a neural network to learn a modular representation of the data. However, the reasons why learning the components of this representation is computationally amenable through gradient descent methods and why it leads to excellent generalization properties are still not fully understood [3]. In order to address this questions, machine learning theory has studied the gradient flow training of linear networks--where the activation is linear-- and diagonal linear networks (DLNs)--where, in addition, weight matrices are diagonal. These studies have pointed out that a certain implicit regularization phenomenon appears when training neural networks: the gradient descent minimizes certain regularization penalties, although not explicitly present in the loss function [31]. This phenomenon has been shown to be beneficial for the generalization properties of the trained networks, and thus suggested to be a key ingredient in the success of neural networks. See, e.g., [16, 21, 9, 20] for contributions to the implicit regularization of neural networks and [32, 34, 33, 17, 20, 2, 27, 28, 25, 10, 4, 26] for more contributions specifically on DLNs. We now briefly recall the form of this phenomenon in the case of DLNs.