The Two Phases of Gradient Descent in Deep Learning

@machinelearnbot 

Thanks to great experimental work by several research groups studying the behavior of Stochastic Gradient Descent (SGD), we are collectively gaining a much clearer understanding as to what happens in the neighborhood of training convergence. The story begins with the best paper award winner for ICLR 2017, "Rethinking Generalization". This paper I first discussed several months ago in a blog post "Rethinking Generalization in Deep Learning". One interesting observation in that paper is the role of SGD. Indeed, in neural networks, we almost always choose our model as the output of running stochastic gradient descent.