iterate
Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback
Barakat, Anas, Kontogiannis, Andreas, Pollatos, Vasilis, Panageas, Ioannis, Varvitsiotis, Antonios
We study adversarial online learning with hidden-convex losses, i.e., nonconvex losses that become convex after a nonlinear reparameterization. Ghai, Lu and Hazan (2022) proved that, under geometric and smoothness assumptions, online gradient descent (OGD) on such nonconvex losses approximately simulates online mirror descent (OMD) on the underlying convex losses with a suitable regularizer, yielding $\mathcal{O}(T^{2/3})$ regret. They left open whether the optimal $ฮ(\sqrt{T})$ regret from online convex optimization can be recovered in this hidden-convex setting. We answer this question affirmatively. More specifically, via a sharper discrete-time algorithmic equivalence argument, we prove that OGD achieves $\mathcal{O}(\sqrt{T})$ regret under the same assumptions, matching the optimal worst-case rate for adversarial online convex optimization. We also address another open question of Ghai, Lu and Hazan (2022) by clarifying the geometry required for this algorithmic equivalence. We replace the diagonal-Jacobian sufficient condition with a necessary-and-sufficient Hessian compatibility condition, thereby expanding the class of admissible reparameterizations. We complement our tight regret bound with a lower bound showing that the Hessian compatibility assumption is essential for OGD; when it fails, we construct a smooth reparameterization and an adversarial sequence of hidden-convex losses for which OGD suffers $ฮฉ(T)$ regret. Finally, we extend our analysis to one-point bandit feedback and prove a $\mathcal{O}(T^{3/4})$ expected regret bound for bandit OGD with spherical smoothing, matching its classical rate on convex losses.
Anytime Training with Schedule-Free Spectral Optimization
Apte, Anuj, Deshpande, Pranav, Kumar, Niraj, Chakrabarti, Shouvanik, Kim, Junhyung Lyle
Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
Agrawal, Shubhada, Maguluri, Siva Theja, Zubeldia, Martin
We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. When the Martingale-difference noise is bounded, we show that the tail of the error can be sub-Gaussian, sub-Weibull, or something lighter than any Pareto but heavier than any Weibull, depending on the step size sequence and on whether the random operator is almost surely contractive, almost surely non-expansive, or expansive with positive probability. Our analysis relies on a novel Lyapunov function involving the moment-generating function of the solution to a Poisson equation, together with an auxiliary projected algorithm. We complement the upper bounds with worst-case examples showing that qualitatively sharper bounds are impossible. We further study the case of unbounded Martingale-difference noise when the average operator is contractive, and the step sizes are of order $1/k$. In this setting, we show that if the random operator is almost surely non-expansive, then the error tail is at most three times heavier than the noise tail, whereas if the random operator is expansive with positive probability, then the error may have substantially heavier tails. These results are obtained through a novel black-box truncation argument that reduces the unbounded-noise setting to the bounded-noise case.
Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
Levin, Ilya, Shuklin, Maksim, Moulines, Eric, Mangold, Paul, Samsonov, Sergey
In this paper, we establish Berry-Esseen-type bounds for federated linear stochastic approximation (LSA). Our results provide the first federated Gaussian approximations for LSA that explicitly capture communication-computation trade-offs and heterogeneity-aware error terms, quantifying the effects of local step size, number of local updates, and heterogeneity on convergence rates. We present results for both (i) constant step size regime and (ii) decreasing step size with an increasing number of local iterations, recovering the recent rates of Bonnerjee et al. [2025] as a special case. As a primary application of our results, we develop an online multiplier bootstrap procedure for inference on the last iterate, which avoids explicit estimation of the asymptotic covariance matrix, and obtain non-asymptotic validity guarantees for this procedure.
Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle
Richtรกrik, Peter, Gruntkowska, Kaja, Li, Hanmin
We design Local LMO - a new projection-free gradient-type method for constrained optimization. The key algorithmic idea is to replace the global linear minimization oracle over the constraint set used by Frank-Wolfe (FW) with a local linear minimization oracle over the intersection of the constraint set and a "small" ball centered at the current iterate. In particular, when minimizing $f:\mathbb{R}^d\to \mathbb{R}$ over a constraint $\emptyset\neq\mathcal{X}\subseteq\mathbb{R}^d$, Local LMO performs the iteration \[x_{k+1}\in \arg\min_{z\in\mathcal{X}\cap\mathcal{B}(x_{k},t_k)}\langle\nabla f(x_{k}), z \rangle,\] where $x_0\in\mathcal{X}$, and $t_k>0$ is a suitably chosen radius which can be interpreted as an effective stepsize. While designed as an alternative to FW, Local LMO is perhaps best viewed as a generalization of Gradient Descent (GD) rather than a modification of FW. Indeed, it is easy to see that Local LMO reduces to GD in the unconstrained setting and, more generally, to GD restricted to an affine subspace if the constraint $\mathcal{X}$ is affine. We prove that this simple algorithmic scheme transfers the known (unaccelerated) convergence rates of Projected Gradient Descent (PGD) to the projection-free world in several important regimes, some of which are beyond the reach of FW. In contrast to FW theory, i) our guarantees hold without requiring the feasible set $\mathcal{X}$ to be bounded, ii) our theory does not require the "curvature" assumption, which allows us to establish a standard sublinear rate for convex functions with bounded gradients, iii) we obtain a linear rate in the smooth strongly convex regime. Furthermore, we obtain sharp sublinear rates in the smooth convex and non-convex regimes, in the $(L_0,L_1)$-smooth convex regime, and in stochastic and non-differentiable settings.
A Rod Flow Model for Adam at the Edge of Stability
Neural networks are trained by minimizing loss functions with gradient-based optimizers. Cohen et al. [2021] observed that full-batch gradient descent operates at the edge of stability (EoS): the largest eigenvalue of the Hessian, called the sharpness, first rises (a phase called progressive sharpening) and then hovers at the stability threshold 2/ฮท where ฮท is the learning rate. Cohen et al. [2022] extended this picture to momentum methods and adaptive gradient methods, showing that each optimizer exhibits its own edge of stability. Rather than hovering at 2/ฮท, the relevant quantity--the preconditioned sharpness--hovers at a hyperparameter-dependent threshold that depends on the optimizer (Table 2). In practice, the dominant optimizer in machine learning is Adam [Kingma and Ba, 2015], which differs from gradient descent in two respects.
Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation
Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a {\it benign regularizer} that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.
Gradient Regularized Newton Boosting Trees with Global Convergence
Zozoulenko, Nikita, Falkowski, Daniel, Cass, Thomas, Gonon, Lukas
Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton's method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge. Within this framework, we recover Newton boosting with GBDTs and classical finite-dimensional theory as special cases. We first prove that vanilla Newton boosting achieves a linear rate of convergence for smooth, strongly convex losses that satisfy a Hessian-dominance condition. To handle general convex losses with Lipschitz Hessians, we extend a recent gradient regularized Newton scheme to the restricted weak learner setting. This scheme minimally modifies the classical algorithm by introducing an adaptive $\ell_2$-regularization term proportional to the square root of the gradient norm at each iteration. We establish a $\mathcal{O}(\frac{1}{k^2})$ rate for this scheme, thereby obtaining a globally convergent second-order GBDT algorithm with a rate matching that of first-order boosting with Nesterov momentum. In numerical experiments, we show that our scheme converges while vanilla Newton boosting may diverge.
Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing
In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.
Regularized Nonlinear Acceleration
Damien Scieur, Alexandre d'Aspremont, Francis Bach
We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple and small linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.