r/MachineLearning - [D] Research shows SGD with too large of a mini batch can lead to huge overfitting in deep learning. Why doesn't batch gradient descent have this problem?
SGD, in its base form, is not optimized for batches. It's designed with one sample each time in mind. Batch Gradient Descent is basically Stochastic Gradient Descent but optimized for batches, with the right kind of weighing and normalisation. In most DL frameworks there are two versions of GD - Stochastic and Batch, under the same name (SGD), and the framework chooses which one to use based on the batch size you declare.
Aug-29-2019, 10:34:14 GMT
- Technology: