When building a deep learning project the most common problem we all face is choosing the correct hyperparameters (often known as optimizers). This is critical as the hyperparameters determine the expertise of the machine learning model. In Machine Learning (ML hereafter), a hyperparameter is a configuration variable that's external to the model and whose value is not estimated from the data given. Hyperparameters are an essential part of the process of estimating model parameters and are often defined by the practitioner. When an ML algorithm is used for a specific problem, for example when we are using a grid search or a random search algorithm, then we are actually tuning the hyperparameters of the model to discover the values that help us to achieve the most accurate predictions.
The process of setting the hyper-parameters requires expertise and extensive trial and error. There are no simple and easy ways to set hyper-parameters -- specifically, learning rate, batch size, momentum, and weight decay. Before discussing the ways to find the optimal hyper-parameters, let us first understand these hyper-parameters: learning rate, batch size, momentum, and weight decay. These hyper-parameters act as knobs which can be tweaked during the training of the model. For our model to provide best result, we need to find the optimal value of these hyper-parameters.
In the previous story (part A) we discussed the structure and three main building blocks of a Neural Network. This story will take you through the elements which really make a useful force and separate them from rest of the Machine Learning Algorithms. These are the values which you must manually set. If you think of an NN as a machine, the nobs that change the behavior of the machine would be the hyper-parameters of the NN. A hyper-parameter is a value required by your model which we really have very little idea about.
Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentums.