Fercoq, Olivier
Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems
Pethick, Thomas, Latafat, Puya, Patrinos, Panagiotis, Fercoq, Olivier, Cevher, Volkan
This paper introduces a new extragradient-type algorithm for a class of nonconvex-nonconcave minimax problems. It is well-known that finding a local solution for general minimax problems is computationally intractable. This observation has recently motivated the study of structures sufficient for convergence of first order methods in the more general setting of variational inequalities when the so-called weak Minty variational inequality (MVI) holds. This problem class captures non-trivial structures as we demonstrate with examples, for which a large family of existing algorithms provably converge to limit cycles. Our results require a less restrictive parameter range in the weak MVI compared to what is previously known, thus extending the applicability of our scheme. The proposed algorithm is applicable to constrained and regularized problems, and involves an adaptive stepsize allowing for potentially larger stepsizes. Our scheme also converges globally even in settings where the underlying operator exhibits limit cycles.
Solving stochastic weak Minty variational inequalities without increasing batch size
Pethick, Thomas, Fercoq, Olivier, Latafat, Puya, Patrinos, Panagiotis, Cevher, Volkan
This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty vari-ational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per iteration which can however be prohibitively expensive. In contrast, our proposed methods involves two stepsizes and only requires one additional oracle evaluation per iteration. We show that it is possible to keep one fixed stepsize while it is only the second stepsize that is taken to be diminishing, making it interesting even in the monotone setting. Almost sure convergence is established and we provide a unified analysis for this family of schemes which contains a nonlinear generalization of the celebrated primal dual hybrid gradient algorithm. Stochastic first-order methods have been at the core of the current success in deep learning applications. These methods are mostly well-understood for minimization problems at this point. This is even the case in the nonconvex setting where there exists matching upper and lower bounds on the complexity for finding an approximately stable point (Arjevani et al., 2019). The picture becomes less clear when moving beyond minimization into nonconvex-nonconcave min-imax problems--or more generally nonmonotone variational inequalities. Even in the deterministic case, finding a stationary point is in general intractable (Daskalakis et al., 2021; Hirsch & V avasis, 1987). This is in stark contrast with minimization where only global optimality is NP-hard. An interesting nonmonotone class for which we do have e fficient algorithms is characterized by the so called weak Minty variational inequality (MVI) (Diakonikolas et al., 2021). This problem class captures nontrivial structures such as attracting limit cycles and is governed by a parameter ฯ whose negativity increases the degree of nonmonotonicity. In other words, it seems that we need to take ฮณ large to guarantee convergence for a large class. This reliance on a large stepsize is at the core of why the community has struggled to provide a stochastic variants for weak MVIs. The only known results e ff ectively increase the batch size at every iteration (Diakonikolas et al., 2021, Thm. Pethick et al. (2022) proposed (SEG +) which attempts to tackle the noise by only diminishing the second stepsize.
Screening Rules and its Complexity for Active Set Identification
Ndiaye, Eugene, Fercoq, Olivier, Salmon, Joseph
In learning problems involving a large number of variables, sparse models such as Lasso and Support Vector Machines (SVM) allow to select the most important variables. For instance, the Lasso estimator depends only on a subset of features that have a maximal absolute correlation with the residual; whereas the SVM classifier depends only on a subset of sample (the support vectors) that characterize the margin. The remaining features/variables have no contribution to the optimal solution. Thus, early detection of those non influential variables may lead to significant simplifications of the problem, memory and computational resources saving. Some noticeable examples are the facial reduction preprocessing steps used for accelerating the linear programming solvers [4, 22] and conic programming [3], we refer to [6, 25] for recent reviews.
Random extrapolation for primal-dual coordinate descent
Alacaoglu, Ahmet, Fercoq, Olivier, Cevher, Volkan
We introduce a randomly extrapolated primal-dual coordinate descent method that adapts to sparsity of the data matrix and the favorable structures of the objective function. Our method updates only a subset of primal and dual variables with sparse data, and it uses large step sizes with dense data, retaining the benefits of the specific methods designed for each case. In addition to adapting to sparsity, our method attains fast convergence guarantees in favorable cases \textit{without any modifications}. In particular, we prove linear convergence under metric subregularity, which applies to strongly convex-strongly concave problems and piecewise linear quadratic functions. We show almost sure convergence of the sequence and optimal sublinear convergence rates for the primal-dual gap and objective values, in the general convex-concave case. Numerical evidence demonstrates the state-of-the-art empirical performance of our method in sparse and dense settings, matching and improving the existing methods.
Joint quantile regression in vector-valued RKHSs
Sangnier, Maxime, Fercoq, Olivier, d', Alchรฉ-Buc, Florence
Addressing the will to give a more complete picture than an average relationship provided by standard regression, a novel framework for estimating and predicting simultaneously several conditional quantiles is introduced. The proposed methodology leverages kernel-based multi-task learning to curb the embarrassing phenomenon of quantile crossing, with a one-step estimation procedure and no post-processing. Moreover, this framework comes along with theoretical guarantees and an efficient coordinate descent learning algorithm. Numerical experiments on benchmark and real datasets highlight the enhancements of our approach regarding the prediction error, the crossing occurrences and the training time. Papers published at the Neural Information Processing Systems Conference.
GAP Safe screening rules for sparse multi-task and multi-class models
Ndiaye, Eugene, Fercoq, Olivier, Gramfort, Alexandre, Salmon, Joseph
Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be safe. In this paper we derive new safe rules for generalized linear models regularized with L1 and L1/L2 norms. The rules are based on duality gap computations and spherical safe regions whose diameters converge to zero. This allows to discard safely more variables, in particular for low regularization parameters.
Stochastic Conditional Gradient Method for Composite Convex Minimization
Locatello, Francesco, Yurtsever, Alp, Fercoq, Olivier, Cevher, Volkan
In this paper, we propose the first practical algorithm to minimize stochastic composite optimization problems over compact convex sets. This template allows for affine constraints and therefore covers stochastic semidefinite programs (SDPs), which are vastly applicable in both machine learning and statistics. In this setup, stochastic algorithms with convergence guarantees are either not known or not tractable. We tackle this general problem and propose a convergent, easy to implement and tractable algorithm. We prove $\mathcal{O}(k^{-1/3})$ convergence rate in expectation on the objective residual and $\mathcal{O}(k^{-5/12})$ in expectation on the feasibility gap. These rates are achieved without increasing the batchsize, which can contain a single sample. We present extensive empirical evidence demonstrating the superiority of our algorithm on a broad range of applications including optimization of stochastic SDPs.
Safe Grid Search with Optimal Complexity
Ndiaye, Eugene, Le, Tam, Fercoq, Olivier, Salmon, Joseph, Takeuchi, Ichiro
Popular machine learning estimators involve regularization parameters that can be challenging to tune, and standard strategies rely on grid search for this task. In this paper, we revisit the techniques of approximating the regularization path up to predefined tolerance $\epsilon$ in a unified framework and show that its complexity is $O(1/\sqrt[d]{\epsilon})$ for uniformly convex loss of order $d>0$ and $O(1/\sqrt{\epsilon})$ for Generalized Self-Concordant functions. This framework encompasses least-squares but also logistic regression (a case that as far as we know was not handled as precisely by previous works). We leverage our technique to provide refined bounds on the validation error as well as a practical algorithm for hyperparameter tuning. The later has global convergence guarantee when targeting a prescribed accuracy on the validation set. Last but not least, our approach helps relieving the practitioner from the (often neglected) task of selecting a stopping criterion when optimizing over the training set: our method automatically calibrates it based on the targeted accuracy on the validation set.
Smooth Primal-Dual Coordinate Descent Algorithms for Nonsmooth Convex Optimization
Alacaoglu, Ahmet, Dinh, Quoc Tran, Fercoq, Olivier, Cevher, Volkan
We propose a new randomized coordinate descent method for a convex optimization template with broad applications. Our analysis relies on a novel combination of four ideas applied to the primal-dual gap function: smoothing, acceleration, homotopy, and coordinate descent with non-uniform sampling. As a result, our method features the first convergence rate guarantees among the coordinate descent methods, that are the best-known under a variety of common structure assumptions on the template. We provide numerical evidence to support the theoretical results with a comparison to state-of-the-art algorithms.
Gap Safe screening rules for sparsity enforcing penalties
Ndiaye, Eugene, Fercoq, Olivier, Gramfort, Alexandre, Salmon, Joseph
In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the data-fitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the rules are said to be safe. In this work, we propose a unifying framework for generalized linear models regularized with standard sparsity enforcing penalties such as $\ell_1$ or $\ell_1/\ell_2$ norms. Our technique allows to discard safely more variables than previously considered safe rules, particularly for low regularization parameters. Our proposed Gap Safe rules (so called because they rely on duality gap computation) can cope with any iterative solver but are particularly well suited to (block) coordinate descent methods. Applied to many standard learning tasks, Lasso, Sparse-Group Lasso, multi-task Lasso, binary and multinomial logistic regression, etc., we report significant speed-ups compared to previously proposed safe rules on all tested data sets.