How Does Adaptive Optimization Impact Local Neural Network Geometry?

Neural Information Processing Systems 

Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the *global* geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a *local* trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce R {\text{OPT}}\_{\text{med}}, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates.