Quantitative $W_1$ Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise
Cheng, Xiang, Yin, Dong, Bartlett, Peter L., Jordan, Michael I.
Stochastic Gradient Descent (SGD) is one of the workhorses of modern day machine learning. In many nonconvex optimization problems, such as training deep neural networks, SGD is able to produce solutions with good generalization error. Further, there is evidence that the generalization error of an SGD solution can be significantly better than Gradient Descent (GD) [12]. This suggests that, to understand the behavior of SGD, it is not enough to consider the limiting cases (such as small step-size or large batch-size), when it degenerates to GD. We take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution.
Jul-13-2019