Closing the convergence gap of SGD without replacement

Rajput, Shashank, Gupta, Anant, Papailiopoulos, Dimitris

arXiv.org Machine Learning 

Withand without-replacement sampling of the individual component functions are regarded as some of the most popular variants of SGD. During SGD with replacement sampling, the stochastic gradient is equal to g(x, ξ i) f ξi (x) and ξ i is a uniform number in {1,..., n}, i.e., a with-replacement sample from the set of gradients f 1,..., f n . In the case of without-replacement sapling, the stochastic gradient is equal to g(x, ξ i) f ξi (x) and ξ i is the i-th ordered element in a random permutation of the numbers in {1,..., n}, i.e., a without-replacement sample. In practice, SGD without replacement is much more widely used compared to its with-replacement counterpart, as it can empirically converge significantly faster [1, 2, 3]. However, in the land of theoretical guarantees, with-replacement SGD has been the focal point of convergence analyses. The reason for this is that analyzing stochastic gradients born with replacement is significantly more tractable for a simple reason: in expectation, the stochastic gradient is equal to the "true" gradient of F, i.e., E ξi f ξi (x) F (x). This makes SGD amenable to analyses very similar to that of vanilla gradient descent (GD), which has been extensively studied under a large variety of function classes and geometric assumptions, e.g., see [4]. Unfortunately, the same cannot be said for SGD without replacement, which has long resisted nonvacuous convergence guaranteess.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found