Goto

Collaborating Authors

 biased expectation



Review for NeurIPS paper: Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations

Neural Information Processing Systems

Weaknesses: While the "biased expectation" appears to be a powerful tool, the overall results are restricted to the gradients of the algorithm at _some_ time t in the last T iterates. While this is a common outcome of the standard analysis of SGD, it would be nice if (with some additional assumptions on f) the results could be transposed to f(x_t) or x_t within some basin of attraction. The special case of s 0 needs much more detailed treatment. While the authors point out in the supplement that \phi is continuous at s 0, much of the document switches between looking at s- 0 or s 0 without explanation. Assumption 1: I see that the authors need to contol X_t 2 in Thm 1. (Eq.


Review for NeurIPS paper: Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations

Neural Information Processing Systems

After significant discussions with the reviewers, the reviewers were all unanimously in appreciation of the simplicity and cleanliness of the approach presented by the paper. However the authors are strongly encouraged to improve the presentation of the paper - especially the crucial proof of Lemma 1 - multiple steps have been contracted in the presentation and clarifying them is necessary. Furthermore the case of the diminishing step-size scheme is strongly suggested to be fleshed out in theory rather than being left as straightforward extensions. Lastly, the reviewers suggested to use heavier tailed distribution like the Levy distribution to verify the theory better.


Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations

Neural Information Processing Systems

This work proposes a novel analysis of stochastic gradient descent (SGD) for non-convex and smooth optimization. In the case of sub-Gaussian and centered noise, we prove that, with probability 1-\delta, the number of iterations to reach a precision \varepsilon for the squared gradient norm is O(\varepsilon {-2}\ln(1/\delta)) . In the case of centered and integrable heavy-tailed noise, we show that, while the expectation of the iterates may be infinite, the squared gradient norm still converges with probability 1-\delta in O(\varepsilon {-p}\delta {-q}) iterations, where p,q 2 . This result shows that heavy-tailed noise on the gradient slows down the convergence of SGD without preventing it, proving that SGD is robust to gradient noise with unbounded variance, a setting of interest for Deep Learning. In addition, it indicates that choosing a step size proportional to T {-1/b} where b is the tail-parameter of the noise and T is the number of iterations leads to the best convergence rates.