The Two Phases of Gradient Descent in Deep Learning
Thanks to great experimental work by several research groups studying the behavior of Stochastic Gradient Descent (SGD), we are collectively gaining a much clearer understanding as to what happens in the neighborhood of training convergence. This paper I first discussed several months ago in a blog post "Rethinking Generalization in Deep Learning". Leslie Smith and Nicholay Topin, recently submitted a workshop paper to the ICLR 2017 workshop: "Exploring Loss Function Topology with Cyclic Learning Rate" where they discover some peculiar convergence behavior: Here, as you monotonically increase and decrease the learning rate, there is a transition near at the convergence regime that a large enough learning rate perturbs the system right off is basin into a space of much higher loss. There is however one pragmatic take away from this paper "Averaging two models within a basin tend to give a error that is the average of the two models (or less).Averaging two models between basins tend to give an error that is higher than both models".
May-20-2017, 21:20:28 GMT
- Technology: