r/MachineLearning - [D] Research shows SGD with too large of a mini batch can lead to huge overfitting in deep learning. Why doesn't batch gradient descent have this problem?

Open in new window