Linear Convergence of SVRG in Statistical Estimation Machine Learning

SVRG and its variants are among the state of art optimization algorithms for large scale machine learning problems. It is well known that SVRG converges linearly when the objective function is strongly convex. However this setup can be restrictive, and does not include several important formulations such as Lasso, group Lasso, logistic regression, and some non-convex models including corrected Lasso and SCAD. In this paper, we prove that, for a class of statistical M-estimators covering examples mentioned above, SVRG solves the formulation with {\em a linear convergence rate} without strong convexity or even convexity. Our analysis makes use of {\em restricted strong convexity}, under which we show that SVRG converges linearly to the fundamental statistical precision of the model, i.e., the difference between true unknown parameter $\theta^*$ and the optimal solution $\hat{\theta}$ of the model.

SAGA and Restricted Strong Convexity Machine Learning

SAGA is a fast incremental gradient method on the finite sum problem and its effectiveness has been tested on a vast of applications. In this paper, we analyze SAGA on a class of non-strongly convex and non-convex statistical problem such as Lasso, group Lasso, Logistic regression with $\ell_1$ regularization, linear regression with SCAD regularization and Correct Lasso. We prove that SAGA enjoys the linear convergence rate up to the statistical estimation accuracy, under the assumption of restricted strong convexity (RSC). It significantly extends the applicability of SAGA in convex and non-convex optimization.

A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Neural Information Processing Systems

High-dimensional statistical inference deals with models in which the the number ofparameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structuredmatrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M-estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages theassumed structure. The goal of this paper is to provide a unified framework forestablishing consistency and convergence rates for such regularized M-estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M-estimators have fast convergence rates.

Rest-Katyusha: Exploiting the Solution's Structure via Scheduled Restart Schemes

Neural Information Processing Systems

We propose a structure-adaptive variant of a state-of-the-art stochastic variancereduced gradient algorithm Katyusha for regularized empirical risk minimization. The proposed method is able to exploit the intrinsic low-dimensional structure of the solution, such as sparsity or low rank which is enforced by a non-smooth regularization, to achieve even faster convergence rate. This provable algorithmic improvement is done by restarting the Katyusha algorithm according to restricted strong-convexity (RSC) constants. We also propose an adaptive-restart variant which is able to estimate the RSC on the fly and adjust the restart period automatically. We demonstrate the effectiveness of our approach via numerical experiments.

Fast global convergence rates of gradient methods for high-dimensional statistical recovery

Neural Information Processing Systems

Many statistical $M$-estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension $d$ to grow with (and possibly exceed) the sample size $n$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov's first-order method~\cite{Nesterov07} has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter $\theta^*$ and the optimal solution $\widehat{\theta}$. This globally linear rate is substantially faster than previous analyses of global convergence for specific methods that yielded only sublinear rates. Our analysis applies to a wide range of $M$-estimators and statistical models, including sparse linear regression using Lasso ($\ell_1$-regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational efficiency in high-dimensional estimation.