A dynamic view of some anomalous phenomena in SGD

Borkar, Vivek Shripad

arXiv.org Artificial Intelligence 

It has been observed by Belkin et al. that over-parametrized neural networks exhibit a'double descent' phenomenon. That is, as the model complexity (as reflected in the number of features) increases, the test error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., the test error decreases with the number of iterates, then increases, then decreases again. Another anomalous phenomenon is that of grokking wherein two regimes of descent are interrupted by a third regime wherein the mean loss remains almost constant. This note presents a plausible explanation for these and related phenomena by using the theory of two time scale stochastic approximation, applied to the continuous time limit of the gradient dynamics. This gives a novel perspective for an already well studied theme.Key words: stochastic gradient descent; temporal double descent; grokking; overparametrized neural networks; stochastic approximation; singularly perturbed di ff erential equations; two time scales 1. Introduction Many anomalous phenomena regarding the temporal evolution of stochastic gradient descent (SGD) as applied to over-parametrized neural networks have been pointed out in literature. We specifically consider the following: 1. T emporal double descent: Beginning with Belkin et al. [9], the phenomenon of ' double descent ' in the training of over-parametrized neural networks using stochastic gradient descent (SGD) has been flagged and extensively studied from various angles [1, 8, 10, 18, 20, 25, 31, 32, 36, 42].