On exponential convergence of SGD in non-convex over-parametrized learning
Bassily, Raef, Belkin, Mikhail, Ma, Siyuan
Stochastic Gradient Descent and its variants have become a staple of the algorithmic foundations of machine learning. Yet many of its properties are not fully understood, particularly in non-convex settings common in modern practice. In this note, we study convergence of Stochastic Gradient Descent (SGD) for the class of functions satisfying the Polyak-Lojasiewicz (PL) condition. This class contains all strongly-convex functions as well as a broad range of non-convex functions including those used in machine learning applications (see the discussion below). The primary purpose of this note is to show that in the interpolation setting (common in modern overparametrized machine learning and studied in our previous work [8]) SGD with fixed step size has exponential convergence for the functions satisfying the PL condition. To the best of our knowledge, this is the first such exponential convergence result for a class of non-convex functions. Below, we discuss and highlight a number of aspects of the PL condition which differentiate it from the convex setting and make it more relevant to the practice and requirements of many machine learning problems.
Nov-5-2018