The learning rate is often considered to be the most important hyper-parameter when training a model. Choosing the optimal learning rate can greatly improve the training of a neural network and can prevent any odd behavior that may occur during stochastic gradient descent. Stochastic gradient descent (SGD) is an optimization algorithm that helps the loss function converge to the global minimum, or where the loss is at its lowest point. It behaves just like gradient descent, but also has batches to increase the computational efficiency. Gradient descent is performed to each of these smaller batches instead of the entire training set size.
Sep-20-2021, 00:15:20 GMT