Goto

Collaborating Authors

 global linear convergence rate


Iteratively Reweighted Least Squares for Basis Pursuit with Global Linear Convergence Rate

Neural Information Processing Systems

The recovery of sparse data is at the core of many applications in machine learning and signal processing. While such problems can be tackled using $\ell_1$-regularization as in the LASSO estimator and in the Basis Pursuit approach, specialized algorithms are typically required to solve the corresponding high-dimensional non-smooth optimization for large instances.Iteratively Reweighted Least Squares (IRLS) is a widely used algorithm for this purpose due to its excellent numerical performance. However, while existing theory is able to guarantee convergence of this algorithm to the minimizer, it does not provide a global convergence rate. In this paper, we prove that a variant of IRLS converges \emph{with a global linear rate} to a sparse solution, i.e., with a linear error decrease occurring immediately from any initialization if the measurements fulfill the usual null space property assumption. We support our theory by numerical experiments showing that our linear rate captures the correct dimension dependence. We anticipate that our theoretical findings will lead to new insights for many other use cases of the IRLS algorithm, such as in low-rank matrix recovery.


On the Global Linear Convergence of Frank-Wolfe Optimization Variants

Neural Information Processing Systems

The Frank-Wolfe (FW) optimization algorithm has lately re-gained popularity thanks in particular to its ability to nicely handle the structured constraints appearing in machine learning applications. However, its convergence rate is known to be slow (sublinear) when the solution lies at the boundary. A simple less-known fix is to add the possibility to take'away steps' during optimization, an operation that importantly does not require a feasibility oracle. In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that have been successfully applied in practice: away-steps FW, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence, under a weaker condition than strong convexity of the objective. The constant in the convergence rate has an elegant interpretation as the product of the (classical) condition number of the function with a novel geometric quantity that plays the role of a'condition number' of the constraint set. We provide pointers to where these algorithms have made a difference in practice, in particular with the flow polytope, the marginal polytope and the base polytope for submodular optimization. The Frank-Wolfe algorithm [9] (also known as conditional gradient) is one of the earliest existing methods for constrained convex optimization, and has seen an impressive revival recently due to its nice properties compared to projected or proximal gradient methods, in particular for sparse optimization and machine learning applications. On the other hand, the classical projected gradient and proximal methods have been known to exhibit a very nice adaptive acceleration property, namely that the the convergence rate becomes linear for strongly convex objective, i.e. that the optimization error of the same algorithm after t iterations will decrease geometrically with O ((1 ρ)


Iteratively Reweighted Least Squares for Basis Pursuit with Global Linear Convergence Rate

Neural Information Processing Systems

The recovery of sparse data is at the core of many applications in machine learning and signal processing. While such problems can be tackled using \ell_1 -regularization as in the LASSO estimator and in the Basis Pursuit approach, specialized algorithms are typically required to solve the corresponding high-dimensional non-smooth optimization for large instances.Iteratively Reweighted Least Squares (IRLS) is a widely used algorithm for this purpose due to its excellent numerical performance. However, while existing theory is able to guarantee convergence of this algorithm to the minimizer, it does not provide a global convergence rate. In this paper, we prove that a variant of IRLS converges \emph{with a global linear rate} to a sparse solution, i.e., with a linear error decrease occurring immediately from any initialization if the measurements fulfill the usual null space property assumption. We support our theory by numerical experiments showing that our linear rate captures the correct dimension dependence.


On the Global Linear Convergence of Frank-Wolfe Optimization Variants Martin Jaggi INRIA - SIERRA project-team Dept. of Computer Science École Normale Supérieure, Paris, France ETH Zürich, Switzerland

Neural Information Processing Systems

The Frank-Wolfe (FW) optimization algorithm has lately re-gained popularity thanks in particular to its ability to nicely handle the structured constraints appearing in machine learning applications. However, its convergence rate is known to be slow (sublinear) when the solution lies at the boundary. A simple lessknown fix is to add the possibility to take'away steps' during optimization, an operation that importantly does not require a feasibility oracle. In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that have been successfully applied in practice: away-steps FW, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence, under a weaker condition than strong convexity of the objective. The constant in the convergence rate has an elegant interpretation as the product of the (classical) condition number of the function with a novel geometric quantity that plays the role of a'condition number' of the constraint set. We provide pointers to where these algorithms have made a difference in practice, in particular with the flow polytope, the marginal polytope and the base polytope for submodular optimization. The Frank-Wolfe algorithm [9] (also known as conditional gradient) is one of the earliest existing methods for constrained convex optimization, and has seen an impressive revival recently due to its nice properties compared to projected or proximal gradient methods, in particular for sparse optimization and machine learning applications. On the other hand, the classical projected gradient and proximal methods have been known to exhibit a very nice adaptive acceleration property, namely that the the convergence rate becomes linear for strongly convex objective, i.e. that the optimization error of the same algorithm after t iterations will decrease geometrically with O((1 ρ)


A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

Li, Zhize, Li, Jian

arXiv.org Machine Learning

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. The algorithm is a slight variant of the ProxSVRG algorithm [Reddi et al., 2016b]. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results (in terms of the number of stochastic gradient oracle calls and proximal operations), and improves/generalizes some others. In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., 2017] for the smooth nonconvex case. ProxSVRG+ is more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., 2016b]. Finally, for nonconvex functions satisfied Polyak-{\L}ojasiewicz condition, we show that ProxSVRG+ achieves global linear convergence rate without restart. ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and generalizes the results of SCSG) in this case.


On the Global Linear Convergence of Frank-Wolfe Optimization Variants

Lacoste-Julien, Simon, Jaggi, Martin

Neural Information Processing Systems

The Frank-Wolfe (FW) optimization algorithm has lately regained popularity thanks in particular to its ability to nicely handle the structured constraints appearing inmachine learning applications. However, its convergence rate is known to be slow (sublinear) when the solution lies at the boundary. A simple lessknown fixis to add the possibility to take'away steps' during optimization, an operation that importantly does not require a feasibility oracle. In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that have been successfully applied in practice: away-steps FW, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence, under a weaker condition thanstrong convexity of the objective. The constant in the convergence rate has an elegant interpretation as the product of the (classical) condition number of the function with a novel geometric quantity that plays the role of a'condition number' of the constraint set. We provide pointers to where these algorithms have made a difference in practice, in particular with the flow polytope, the marginal polytope and the base polytope for submodular optimization. The Frank-Wolfe algorithm [9] (also known as conditional gradient) is one of the earliest existing methods for constrained convex optimization, and has seen an impressive revival recently due to its nice properties compared to projected or proximal gradient methods, in particular for sparse optimization and machine learning applications. On the other hand, the classical projected gradient and proximal methods have been known to exhibit a very nice adaptive acceleration property, namely that the the convergence rate becomes linear for strongly convex objective, i.e. that the optimization error of the same algorithm after t iterations will decrease geometrically with O((1 ρ)