Closing the convergence gap of SGD without replacement
Rajput, Shashank, Gupta, Anant, Papailiopoulos, Dimitris
Withand without-replacement sampling of the individual component functions are regarded as some of the most popular variants of SGD. During SGD with replacement sampling, the stochastic gradient is equal to g(x, ξ i) f ξi (x) and ξ i is a uniform number in {1,..., n}, i.e., a with-replacement sample from the set of gradients f 1,..., f n . In the case of without-replacement sapling, the stochastic gradient is equal to g(x, ξ i) f ξi (x) and ξ i is the i-th ordered element in a random permutation of the numbers in {1,..., n}, i.e., a without-replacement sample. In practice, SGD without replacement is much more widely used compared to its with-replacement counterpart, as it can empirically converge significantly faster [1, 2, 3]. However, in the land of theoretical guarantees, with-replacement SGD has been the focal point of convergence analyses. The reason for this is that analyzing stochastic gradients born with replacement is significantly more tractable for a simple reason: in expectation, the stochastic gradient is equal to the "true" gradient of F, i.e., E ξi f ξi (x) F (x). This makes SGD amenable to analyses very similar to that of vanilla gradient descent (GD), which has been extensively studied under a large variety of function classes and geometric assumptions, e.g., see [4]. Unfortunately, the same cannot be said for SGD without replacement, which has long resisted nonvacuous convergence guaranteess.
Mar-5-2020
- Country:
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Genre:
- Research Report (0.64)