From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

Neural Information Processing Systems 

Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models.