Stochastic Gradient Flow Dynamics of Test Risk and its Exact Solution for Weak Features

Veiga, Rodrigo, Remizova, Anastasia, Macris, Nicolas

arXiv.org Artificial Intelligence 

In supervised learning of neural networks and regression models, understanding the dynamics of optimization algorithms, and in particular stochastic gradient descent (SGD), is of utmost importance. However, despite much progress in a number of directions, this still remains a highly challenging theoretical problem. A fruitful approach that allows making analytical progress consists of suitably approximating SGD by a continuous time approximation, henceforth called stochastic gradient flow (SGF). In this contribution, we build up on this approach, to develop a general formalism characterizing the dynamics of the stochastic process, and apply it to the investigation of the test risk (or generalization error) as a function of time. As is well known, the classical bias-variance trade-off has been challenged in a number of models displaying the double descent phenomenon [1, 2, 3]. Analytical derivations of double descent curves have been achieved for relatively simple models, but are limited to the use of least squares estimators (no dynamics) and pure gradient flow (GF) approximations of gradient descent (GD). The present work goes one step further by investigating the effects of stochasticity on the double descent curve. Our main contributions are summarized as follows: C1 We consider a general Itô stochastic differential equation (SDE) and represent the Markovian transition probability as a path integral, Eq. (12). A general'explicit' formula for the transition probability, Eq. (18), is derived in the limit of a small learning rate by using a Laplace approximation.