Villa, Silvia
Optimization Insights into Deep Diagonal Linear Networks
Labarrière, Hippolyte, Molinari, Cesare, Rosasco, Lorenzo, Villa, Silvia, Vega, Cristian
In recent years, the application of deep networks has revolutionized the field of machine learning, particularly in tasks involving complex data such as images and natural language. These models, typically trained using stochastic gradient descent, have demonstrated remarkable performance on various benchmarks, raising questions about the underlying mechanisms that contribute to their success. Despite their practical efficacy, the theoretical understanding of these models remains relatively limited, creating a pressing need for deeper insights into their generalization abilities. The classical theory shows that the latter is a consequence of regularization, which is the way to impose a priori knowledge into the model and to favour "simple" solutions. While usually regularization is achieved either by choosing simple models or explicitly adding a penalty term to the empirical risk during training, this is not the case for deep neural networks, which are trained simply by minimizing the empirical risk. A new perspective has then emerged in the recent literature, which relates regularization directly to the optimization procedure (gradient based methods). The main idea is to show that the training dynamics themselves exhibit self regularizing properties, by inducing an implicit regularization (bias) which prefers generalizing solutions (see [Vardi, 2023] for an extended review of the importance of implicit bias in machine learning). In this context, a common approach is to study simplified models that approximate the networks used in practice. Analyzing the implicit bias of optimization algorithms for such networks is facilitated but still might give some insights on the good performance of neural networks in various scenarios.
Variance reduction techniques for stochastic proximal point algorithms
Traoré, Cheik, Apidopoulos, Vassilis, Salzo, Saverio, Villa, Silvia
In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the stepsize but a proper variance reduced version is missing. In this work, we propose the first study of variance reduction techniques for stochastic proximal point algorithms. We introduce a stochastic proximal version of SVRG, SAGA, and some of their variants for smooth and convex functions. We provide several convergence results for the iterates and the objective function values. In addition, under the Polyak-{\L}ojasiewicz (PL) condition, we obtain linear convergence rates for the iterates and the function values. Our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts, especially about the stability with respect to the choice of the step size.
Snacks: a fast large-scale kernel SVM solver
Tanji, Sofiane, Della Vecchia, Andrea, Glineur, François, Villa, Silvia
Kernel methods provide a powerful framework for non parametric learning. They are based on kernel functions and allow learning in a rich functional space while applying linear statistical learning tools, such as Ridge Regression or Support Vector Machines. However, standard kernel methods suffer from a quadratic time and memory complexity in the number of data points and thus have limited applications in large-scale learning. In this paper, we propose Snacks, a new large-scale solver for Kernel Support Vector Machines. Specifically, Snacks relies on a Nystr\"om approximation of the kernel matrix and an accelerated variant of the stochastic subgradient method. We demonstrate formally through a detailed empirical evaluation, that it competes with other SVM solvers on a variety of benchmark datasets.
Iterative regularization in classification via hinge loss diagonal descent
Apidopoulos, Vassilis, Poggio, Tomaso, Rosasco, Lorenzo, Villa, Silvia
Estimating a quantity of interest from finite measurements is a central problem in a number of fields including machine learning but also statistics and signal processing. In this context, a key idea is that reliable estimation requires imposing some prior assumptions on the problem at hand. The theory of inverse problems provides a principled framework to formalize this idea [27]. The quantity of interest is typically seen as a function, or a vector, and prior assumptions take the form of suitable functionals, called regularizers. Following this idea, Tikhonov regularization provides a classic approach to estimate solutions [83, 84]. Indeed, the latter are found by minimizing an empirical objective where a data fit term is penalized adding the chosen regularizer. Other regularization approaches are classic in inverse problems, and in particular iterative regularization has become popular in machine learning, see e.g.
Stochastic Zeroth order Descent with Structured Directions
Rando, Marco, Molinari, Cesare, Villa, Silvia, Rosasco, Lorenzo
We introduce and analyze Structured Stochastic Zeroth order Descent (S-SZD), a finite difference approach which approximates a stochastic gradient on a set of $l\leq d$ orthogonal directions, where $d$ is the dimension of the ambient space. These directions are randomly chosen, and may change at each step. For smooth convex functions we prove almost sure convergence of the iterates and a convergence rate on the function values of the form $O(d/l k^{-c})$ for every $c<1/2$, which is arbitrarily close to the one of Stochastic Gradient Descent (SGD) in terms of number of iterations. Our bound also shows the benefits of using $l$ multiple directions instead of one. For non-convex functions satisfying the Polyak-{\L}ojasiewicz condition, we establish the first convergence rates for stochastic zeroth order algorithms under such an assumption. We corroborate our theoretical findings in numerical simulations where assumptions are satisfied and on the real-world problem of hyper-parameter optimization, observing that S-SZD has very good practical performances.
Iterative regularization for low complexity regularizers
Molinari, Cesare, Massias, Mathurin, Rosasco, Lorenzo, Villa, Silvia
Iterative regularization exploits the implicit bias of an optimization algorithm to regularize ill-posed problems. Constructing algorithms with such built-in regularization mechanisms is a classic challenge in inverse problems but also in modern machine learning, where it provides both a new perspective on algorithms analysis, and significant speed-ups compared to explicit regularization. In this work, we propose and study the first iterative regularization procedure able to handle biases described by non smooth and non strongly convex functionals, prominent in low-complexity regularization. Our approach is based on a primal-dual algorithm of which we analyze convergence and stability properties, even in the case where the original problem is unfeasible. The general results are illustrated considering the special case of sparse recovery with the $\ell_1$ penalty. Our theoretical results are complemented by experiments showing the computational benefits of our approach.
Iterative regularization for convex regularizers
Molinari, Cesare, Massias, Mathurin, Rosasco, Lorenzo, Villa, Silvia
Machine learning often reduces to estimating some model parameters. This approach raises at least two orders of questions: first, multiple solutions may exist, amongst which a specific one must be selected; second, potential instabilities with respect to noise and sampling must be controlled. A classical way to achieve both goals is to consider explicitly penalized or constrained objective functions. In machine learning, this leads to regularized empirical risk minimization (Shalev-Shwartz and Ben-David, 2014). A more recent approach is based on directly exploiting an iterative optimization procedure for an unconstrained/unpenalized problem. This approach is shared by several related ideas. One is implicit regularization (Mahoney, 2012; Gunasekar et al., 2017), stemming from the observation that the bias is controlled increasing the number of iterations, just like in penalized methods it is controlled decreasing the penalty parameter. Another one is early stopping (Yao et al., 2007; Raskutti et al., 2014), putting emphasis on the fact that running the iterates to convergence might lead to instabilities in the presence of noise. Yet another, and more classical, idea is iterative regularization, where both aspects (convergence and stability) are considered to be relevant (Engl et al., 1996; Kaltenbacher et al., 2008).
Convergence of the Forward-Backward Algorithm: Beyond the Worst Case with the Help of Geometry
Garrigos, Guillaume, Rosasco, Lorenzo, Villa, Silvia
We provide a comprehensive study of the convergence of forward-backward algorithm under suitable geometric conditions leading to fast rates. We present several new results and collect in a unified view a variety of results scattered in the literature, often providing simplified proofs. Novel contributions include the analysis of infinite dimensional convex minimization problems, allowing the case where minimizers might not exist. Further, we analyze the relation between different geometric conditions, and discuss novel connections with a priori conditions in linear inverse problems, including source conditions, restricted isometry properties and partial smoothness.
Learning with Incremental Iterative Regularization
Rosasco, Lorenzo, Villa, Silvia
Within a statistical learning setting, we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method. In particular, we show that, if all other parameters are fixed a priori, the number of passes over the data (epochs) acts as a regularization parameter, and prove strong universal consistency, i.e. almost sure convergence of the risk, as well as sharp finite sample bounds for the iterates. Our results are a step towards understanding the effect of multiple epochs in stochastic gradient techniques in machine learning and rely on integrating statistical and optimizationresults.
Learning with incremental iterative regularization
Rosasco, Lorenzo, Villa, Silvia
Within a statistical learning setting, we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method. In particular, we show that, if all other parameters are fixed a priori, the number of passes over the data (epochs) acts as a regularization parameter, and prove strong universal consistency, i.e. almost sure convergence of the risk, as well as sharp finite sample bounds for the iterates. Our results are a step towards understanding the effect of multiple epochs in stochastic gradient techniques in machine learning and rely on integrating statistical and optimization results.