SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

Dec-25-2025, 03:11:47 GMT–Neural Information Processing Systems

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of $O(1/\sqrt{n})$, where $n$ is the number of samples. In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD. Our main contributions are threefold: * We show that for any regularizer, there is an SCO problem for which Regularized Empirical Risk Minimzation fails to learn.

batch-size and multiple-epoch, implicit regularization, sgd, (8 more...)

Neural Information Processing Systems

Dec-25-2025, 03:11:47 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)