The learning rate is one of the most important hyper-parameters to tune for training deep neural networks. In this post, I'm describing a simple and powerful way to find a reasonable learning rate that I learned from fast.ai It's not available to the general public yet, but will be at the end of the year at course.fast.ai There are many variations of stochastic gradient descent: Adam, RMSProp, Adagrad, etc. All of them let you set the learning rate.
Much of this post are based on the stuff written by past fast.ai This is a concise version of it, arranged in a way for one to quickly get to the meat of the material. Do go over the references for more details. First off, what is a learning rate? Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient.
Right now, Jeremy Howard – the co-founder of fast.ai Why? His own students are beating him. And their names can now be found across the tops of leaderboards all over Kaggle. So what are these secrets that are allowing novices to implement world-class algorithms in mere weeks, leaving behind experienced deep learning practitioners in their GPU-powered wake? Allow me to tell you in ten simple steps.
The process of setting the hyper-parameters requires expertise and extensive trial and error. There are no simple and easy ways to set hyper-parameters -- specifically, learning rate, batch size, momentum, and weight decay. Before discussing the ways to find the optimal hyper-parameters, let us first understand these hyper-parameters: learning rate, batch size, momentum, and weight decay. These hyper-parameters act as knobs which can be tweaked during the training of the model. For our model to provide best result, we need to find the optimal value of these hyper-parameters.