When building a deep learning project the most common problem we all face is choosing the correct hyperparameters (often known as optimizers). This is critical as the hyperparameters determine the expertise of the machine learning model. In Machine Learning (ML hereafter), a hyperparameter is a configuration variable that's external to the model and whose value is not estimated from the data given. Hyperparameters are an essential part of the process of estimating model parameters and are often defined by the practitioner. When an ML algorithm is used for a specific problem, for example when we are using a grid search or a random search algorithm, then we are actually tuning the hyperparameters of the model to discover the values that help us to achieve the most accurate predictions.
The process of setting the hyper-parameters requires expertise and extensive trial and error. There are no simple and easy ways to set hyper-parameters -- specifically, learning rate, batch size, momentum, and weight decay. Before discussing the ways to find the optimal hyper-parameters, let us first understand these hyper-parameters: learning rate, batch size, momentum, and weight decay. These hyper-parameters act as knobs which can be tweaked during the training of the model. For our model to provide best result, we need to find the optimal value of these hyper-parameters.
Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentums.
Much of this post are based on the stuff written by past fast.ai This is a concise version of it, arranged in a way for one to quickly get to the meat of the material. Do go over the references for more details. First off, what is a learning rate? Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient.