Optimization Insights into Deep Diagonal Linear Networks
Labarrière, Hippolyte, Molinari, Cesare, Rosasco, Lorenzo, Villa, Silvia, Vega, Cristian
In recent years, the application of deep networks has revolutionized the field of machine learning, particularly in tasks involving complex data such as images and natural language. These models, typically trained using stochastic gradient descent, have demonstrated remarkable performance on various benchmarks, raising questions about the underlying mechanisms that contribute to their success. Despite their practical efficacy, the theoretical understanding of these models remains relatively limited, creating a pressing need for deeper insights into their generalization abilities. The classical theory shows that the latter is a consequence of regularization, which is the way to impose a priori knowledge into the model and to favour "simple" solutions. While usually regularization is achieved either by choosing simple models or explicitly adding a penalty term to the empirical risk during training, this is not the case for deep neural networks, which are trained simply by minimizing the empirical risk. A new perspective has then emerged in the recent literature, which relates regularization directly to the optimization procedure (gradient based methods). The main idea is to show that the training dynamics themselves exhibit self regularizing properties, by inducing an implicit regularization (bias) which prefers generalizing solutions (see [Vardi, 2023] for an extended review of the importance of implicit bias in machine learning). In this context, a common approach is to study simplified models that approximate the networks used in practice. Analyzing the implicit bias of optimization algorithms for such networks is facilitated but still might give some insights on the good performance of neural networks in various scenarios.
Dec-21-2024