How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization?

Open in new window