Goto

Collaborating Authors

 Gradient Descent


Review for NeurIPS paper: Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes

Neural Information Processing Systems

All the reviewers agree that the paper presents a worthwhile theoretical contribution, which may facilitate/motivate further work to tackle more challenging problems. The main limitation of the work is its practical impact as the proposed analysis does not apply to the lengthscales. Although R3 stands by their comments, they expressed their willingness to accept and recognized, during discussions, this work as an excellent attempt at the problem. Overall, I believe the NeurIPS community will benefit from this work and recommend the authors to take the reviewers' suggestions and comments into consideration.


Reviews: Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent

Neural Information Processing Systems

The other reviewers also convinced me that despite not having the right assumptions for the mention applications, the work might still be useful in other applications. I request the authors to remove the applications mentioned in the introduction or to explicitly write that their assumptions are not satisfied for them. Based on this points, I increase my score from 4 to 6. Let me also clarify on why I believe having the right assumption is important and what I dislike about the theory. SARAH is an interesting method as it does not require bounded gradients and, at the same time, there are settings where the its known complexity is better than that of SGD.


Reviews: Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent

Neural Information Processing Systems

This paper has been deeply discussed between the reviewers and myself. After a lengthy discussion and thanks to the authors' rebuttal, the reviewers were convinced that the proposed algorithm and its analysis and novel, interesting, and worth to be published in NeurIPS. However, the reviewers also noted the mismatch between the motivating examples in the introduction and the assumptions in the analysis. Note that it is not enough to state that the assumptions hold in the "domain of optimization" because there is no guarantee that such domain is bounded. So, please carefully take into account the reviewers' comments in preparing the camera-ready version.


Review for NeurIPS paper: Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

Neural Information Processing Systems

Additional Feedback: [After rebuttal] I appreciate the additional explanations in the rebuttal. I think the example (a more complete version) will go a long way in improving the paper, but as is presented I think not enough details is given for a proper evaluation, thus I look forward to reading a revised version of this work. Note that my tautology comment is not saying that the proof is trivial, but saying the way it is written masks the potential insights the proof may give, in particular, there should be a result that shows that such a limit in Cons 1 exists under some general conditions characterising the data and the model architecture. I believe the example provided in the rebuttal may potentially be useful for formalising this. On first reading, these conditions appear not well-motivated.


Review for NeurIPS paper: Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

Neural Information Processing Systems

Before the author response, all the reviewers seem agree that the results were quite interesting (and I agree), but had a concern about the connection to ML. The author response included examples which mostly addressed this concern, so two reviewers recommended acceptance, while another (reviewer 1) recommended rejection, but was borderline. However, I feel the remaining concerns by reviewer 1 are rather minor.


Review for NeurIPS paper: Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

Neural Information Processing Systems

Weaknesses: While the paper is written very clearly, there are several questions I'd like to raise. Firstly, in discussing the applicability of the results the paper mentions'some basic vision or sound recognition tasks' (line 33) - I'd like to ask about some examples of such tasks. Looking at the statement of the Theorem 1, seems that it should be applicable in finite-dimensional spaces with invertible covariance matrices. If it is so, then I do not understand the results. In particular, for X distributed with a finite support and has identity covariance matrix, the conditions (a) and (b) hold for arbitrarily large positive \alpha, however the theorem statement implies that the estimates will go to zero at an arbitrarily large polynomial rate, which is not true.


Review for NeurIPS paper: Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

Neural Information Processing Systems

The submission considers noiseless (/low noise) linear regression with non-linear transformation of the input data, and show that under this setting, SGD achieves faster convergence rates. This is a very nice contribution with applicability to important problems as mentioned by the authors in their feedback. We urge the authors to incorporate the points they made in response to the reviews.


Review for NeurIPS paper: Online Robust Regression via SGD on the l1 loss

Neural Information Processing Systems

The paper concerns robust linear regression in the online setting, where the data follows a Gaussian linear model with corruptions. It is shown that the stochastic gradient descent on the absolute loss converges to the true parameter at a rate of order O(1/n). The paper received a universally positive evaluation from the reviewers, who acknowledged the novelty of the results, the theoretical justification of the proposed approach and the scalability of the algorithm. The main issue raised in the reviews is about quite restrictive assumptions on the data distribution (Gaussian linear model, and the centered data assumption).


Reviews: Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

Neural Information Processing Systems

After rebuttal: I have carefully read the comments from other reviewers and the feedback from the authors. My main concern was the generalization ability of NGD, but the experiments in the feedback are a bit confused to me because GD doesn't seem to achieve zero training loss but NGD converges to 0 very quickly in MNIST regression. I would suggest the authors provide more details about that experiment setting, e.g., how do you select the hyperparameter. Thus, I would like to keep my score unchanged. The framework for the proof follows the recent line of work about over-parametrization, e.g., the papers written by Du et al, Li and Liang, and Allen-Zhu et al., the core of which is the Gram matrix.


Reviews: Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

Neural Information Processing Systems

This paper proves fast convergence of natural gradient descent for over-parameterized neural networks, and its generalization error bound. This paper is on the borderline and was carefully discussed. The main concern is about the novelty of this paper, as well as lack of details in the experiments. The paper gathered some support from the reviewers to merit acceptance, after author response and reviewer discussion.