Goto

Collaborating Authors

 polyak


Momentum Provably Improves Error Feedback!

Neural Information Processing Systems

Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues.


Sparse Polyak with optimal thresholding operators for high-dimensional M-estimation

Qiao, Tianqi, Maros, Marie

arXiv.org Machine Learning

We propose and analyze a variant of Sparse Polyak for high dimensional M-estimation problems. Sparse Polyak proposes a novel adaptive step-size rule tailored to suitably estimate the problem's curvature in the high-dimensional setting, guaranteeing that the algorithm's performance does not deteriorate when the ambient dimension increases. However, convergence guarantees can only be obtained by sacrificing solution sparsity and statistical accuracy. In this work, we introduce a variant of Sparse Polyak that retains its desirable scaling properties with respect to the ambient dimension while obtaining sparser and more accurate solutions.



Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Qiao, Tianqi, Maros, Marie

arXiv.org Machine Learning

We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.


Entropic Mirror Descent for Linear Systems: Polyak's Stepsize and Implicit Bias

Malitsky, Yura, Posch, Alexander

arXiv.org Machine Learning

This paper focuses on applying entropic mirror descent to solve linear systems, where the main challenge for the convergence analysis stems from the unboundedness of the domain. To overcome this without imposing restrictive assumptions, we introduce a variant of Polyak-type stepsizes. Along the way, we strengthen the bound for $\ell_1$-norm implicit bias, obtain sublinear and linear convergence results, and generalize the convergence result to arbitrary convex $L$-smooth functions. We also propose an alternative method that avoids exponentiation, resembling the original Hadamard descent, but with provable convergence.


Reviews: The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Neural Information Processing Systems

Originality: I'm not an expert on this subfield, but as far as I know the work is original. Quality: I think this is a very nice paper. The results are interesting and clearly stated. And the work addresses an important question about the difference between theoretical learning rate schedules / averaging schemes and those used in practice. I have a couple of major comments: 1) organisation of the paper.


Momentum Provably Improves Error Feedback!

Neural Information Processing Systems

Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues.


Improving Stochastic Cubic Newton with Momentum

Chayti, El Mahdi, Doikov, Nikita, Jaggi, Martin

arXiv.org Artificial Intelligence

We study stochastic second-order methods for solving general non-convex optimization problems. We propose using a special version of momentum to stabilize the stochastic gradient and Hessian estimates in Newton's method. We show that momentum provably improves the variance of stochastic estimates and allows the method to converge for any noise level. Using the cubic regularization technique, we prove a global convergence rate for our method on general non-convex problems to a second-order stationary point, even when using only a single stochastic data sample per iteration. This starkly contrasts with all existing stochastic second-order methods for non-convex problems, which typically require large batches. Therefore, we are the first to demonstrate global convergence for batches of arbitrary size in the non-convex case for the Stochastic Cubic Newton. Additionally, we show improved speed on convex stochastic problems for our regularized Newton methods with momentum.


Polyak's Heavy Ball Method Achieves Accelerated Local Rate of Convergence under Polyak-Lojasiewicz Inequality

Kassing, Sebastian, Weissmann, Simon

arXiv.org Artificial Intelligence

In this work, we consider the convergence of Polyak's heavy ball method, both in continuous and discrete time, on a non-convex objective function. We recover the convergence rates derived in [Pol64] for strongly convex objective functions, assuming only validity of the Polyak-Lojasiewicz inequality. In continuous time our result holds for all initializations, whereas in the discrete time setting we conduct a local analysis around the global minima. Our results demonstrate that the heavy ball method does, in fact, accelerate on the class of objective functions satisfying the Polyak-Lojasiewicz inequality. This holds even in the discrete time setting, provided the method reaches a neighborhood of the global minima. Instead of the usually employed Lyapunov-type arguments, our approach leverages a new differential geometric perspective of the Polyak-Lojasiewicz inequality proposed in [RB24].


Reviews: Integration Methods and Optimization Algorithms

Neural Information Processing Systems

The paper provides an interpretation of a number of accelerated gradient algorithms (and other convex optimization algorithms) based on the theory of numerical integration and numerical solution of ODEs. In particular, the paper focuses on the well-studied class of multi-step methods to interpret Nesterov's method and Polyak's heavy ball method as discretizations of the natural gradient flow dynamics. The authors argue that this interpretation is more beneficial than existing interpretations based on different dynamics (e.g. Notice that the novelty here lies in the focus on multistep methods, as it was already well-known that accelerated algorithms can be obtained by appropriately applying Euler's discretization to certain dynamics (again, see Krichene et al). A large part of the paper is devoted to introducing important properties of numerical discretization schemes for ODEs, including consistency and zero-stability, and their instantiation in the case of multistep methods.