Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations

Heredia, Carlos

arXiv.org Artificial Intelligence 

In this paper, we propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms by modeling them as first-order integro-differential equations. We perform numerical simulations of these equations to demonstrate their validity as accurate approximations of the original algorithms. Our results indicate a strong agreement between the behavior of the continuous-time models and the discrete implementations, thus providing a new perspective on the theoretical understanding of adaptive optimization methods. The pursuit of finding the global minima of such functions presents a significant challenge due to the inherent complexity and non-convexity of the landscape. Gradient Descent (GD) remains one of the most prominent algorithms for minimizing the function f by iteratively finding the optimal parameters θ Boyd & Vandenberghe (2004). It operates by adjusting the parameters in the direction of the steepest descent of f with a fixed step size α (learning rate). At each iteration, the algorithm computes the gradient of f with respect to θ, guiding the parameter updates to minimize f progressively Rumelhart et al. (1986): θ The continuous nature of these methods permits a more direct application of differential equation techniques. For readers interested in a continuous description of the stochastic method, we refer to Sirignano & Spiliopoulos (2017). Adaptive optimization methods such as AdaGrad Duchi et al. (2011) and RMSProp Hinton (2012) have been pivotal in advancing gradient-based algorithms.