rmsprop
A Rod Flow Model for Adam at the Edge of Stability
Neural networks are trained by minimizing loss functions with gradient-based optimizers. Cohen et al. [2021] observed that full-batch gradient descent operates at the edge of stability (EoS): the largest eigenvalue of the Hessian, called the sharpness, first rises (a phase called progressive sharpening) and then hovers at the stability threshold 2/ฮท where ฮท is the learning rate. Cohen et al. [2022] extended this picture to momentum methods and adaptive gradient methods, showing that each optimizer exhibits its own edge of stability. Rather than hovering at 2/ฮท, the relevant quantity--the preconditioned sharpness--hovers at a hyperparameter-dependent threshold that depends on the optimizer (Table 2). In practice, the dominant optimizer in machine learning is Adam [Kingma and Ba, 2015], which differs from gradient descent in two respects.
Resetting the Optimizer in Deep RL: An Empirical Study
We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the secondorder moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark.
76c073d8a82d9ddaf993300be03ac70f-Paper.pdf
We prove the O(t 1/2) rate of convergence for the squared norm of the gradient of Moreau envelope, which is the standard stationarity measure for this class of problems. It matches the known rates that adaptive algorithms enjoy for the specific case ofunconstrained smoothnonconvexstochastic optimization.
ANO : Faster is Better in Noisy Landscape
Stochastic optimizers are central to deep learning, yet widely used methods such as Adam and Adan can degrade in non-stationary or noisy environments, partly due to their reliance on momentum-based magnitude estimates. We introduce Ano, a novel optimizer that decouples direction and magnitude: momentum is used for directional smoothing, while instantaneous gradient magnitudes determine step size. This design improves robustness to gradient noise while retaining the simplicity and efficiency of first-order methods. We further propose Anolog, which removes sensitivity to the momentum coefficient by expanding its window over time via a logarithmic schedule. We establish non-convex convergence guarantees with a convergence rate similar to other sign-based methods, and empirically show that Ano provides substantial gains in noisy and non-stationary regimes such as reinforcement learning, while remaining competitive on low-noise tasks.