Goto

Collaborating Authors

 Mathematical & Statistical Methods


Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing Systems

INDIVIDUAL COMMENTS / QUESTIONS 1) I really appreciate how the paper ties up loose ends by unifying the analysis of several momentum-based methods in the stochastic setting. I am not very closely familiar with the literature analyzing momentum methods, but there's a lot of work out there (e.g., the line of research studying momentum methods in the continuous time limit). A brief review would be very helpful to position the paper within the existing work. To me this implies that the analysis would go through for more general functions. I don't find it obvious that it would.


Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing Systems

The reviewers agree that the topic tackled in the paper is interesting and the mathematical results are promising. Overall, this submission is a good attempt in deriving a mathematical understanding of QHM, but the results are often only partially investigated and commented. For instance, in section 3 the main result (i.e. the convergence rate for quadratics) is really hard to parse and is poorly commented in the sense that its practical value is unclear. The paper also makes a number of conjectures that are not backed up and the authors are therefore advised to tone down their claims. This includes "we conjecture that the optimal convergence rate is a monotonically decreasing function of nu" as well as the quality of the approximation in Section 4. In conclusion, all three reviewers liked the paper but also highlighted some shortcomings, therefore justifying acceptance as a poster but not an oral.


Review for NeurIPS paper: Federated Accelerated Stochastic Gradient Descent

Neural Information Processing Systems

Summary and Contributions: The paper proposes a new version of Local-SGD/Federated Averaging algorithm -- Federated Accelerated SGD (FedAc). In particular, the algorithm solves a smooth convex expectation minimization problem in a distributed/federated fashion: M workers in parallel can access the stochastic gradients of the objective function and periodically communicate with a parameter-server. FedAc is a combination of AC-SA method from (Ghadimi and Lan, 2012) and Federated Averaging. Authors propose a first analysis of this method for generally strongly convex functions (in the convex case this method was analyzed in (Woodworth et al., 2020), but only for quadratic objectives) under the assumption that the variance of the stochastic gradients is uniformly bounded. The derived bounds outperform the state-of-the-art result for federated methods in this setting, and these rates are close to the accelerated ones. Moreover, authors show how their bounds improve under the additional assumption that the Hessian is Lipschitz continuous, and the 4-th central moment of the stochastic gradient is bounded and also extended known results for Local-SGD (FedAvg) to this case.


Reviews: A Kernel Loss for Solving the Bellman Equation

Neural Information Processing Systems

Originality: The derivation of the loss function is original; the resulting loss function has some close similarities with the coupled formulation of LSTD, which should be discussed. Quality: The claims seem to be accurate (I briefly verified the proofs of Theorem 3.1, Proposition 3.3, Proposition 3.4; I did not verify Theorem 3.2 and Corollary 3.5). Clarity: The paper is well-written and clear. Significance: The addressed problem is important; the insights are also useful. SUMMARY: The paper addresses the problem of designing a new loss function for RL.


Reviews: A Kernel Loss for Solving the Bellman Equation

Neural Information Processing Systems

There is general consensus that the idea introduced in the paper is novel and interesting. Yet, I encourage the authors to read carefully the reviewers' comments and take them into consideration in the camera ready. In particular, the connection with the nested formulation of LSTD should be discussed to frame the contribution of the paper better.


Review for NeurIPS paper: Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

Neural Information Processing Systems

Additional Feedback: Post rebuttal: The authors addressed my comments. Therefore, I keep my score as'accept' but not higher as I think the clarity of the writing should be improved. When G is nonsmooth and proximable, using proximal maps lead to much faster convergence compared to using subgradients in the optimization case. It is therefore an important problem to investigate the sampling analogue of this scheme which is the topic of this paper. As mentioned, some previous work has been done on this problem, but this paper presents an approach that is most general (in terms of G being supported on a more general set) to date.


Review for NeurIPS paper: Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

Neural Information Processing Systems

This paper presents a new analysis for proximal stochastic gradient Langevin algorithm. All the reviewers recognized the intellectual merits with minor concerns. I would suggest the authors to carefully revise their paper based the reviewers' comments.


Reviews: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Neural Information Processing Systems

UPDATE: I've read the other reviews and the rebuttal. I am keeping my score - this is a good paper. The study of Stochastic Gradient Descent in overparametrized setting is a popular and important trend in a recent development of huge-scale optimization for deep-learning. The authors propose a very basic and classical method, consisting from the well-known algorithmical blocks (SGD Armijo-type line search) together with its first theoretical justification under "interpolation assumption". The proof of convergence (for example, Theorem 2) mainly consists from the standard arguments (which are used for the proof of the classical non-stochastic Gradient Method under Lipschitz-continuous gradients).


Reviews: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Neural Information Processing Systems

This paper brings a classic idea into the present and makes progress on a vexing problem with SGD --- setting the step size. The authors provide theoretical evidence as well as emipirical evidence that their method is useful. The assumptions may be somewhat limiting; one version requires strong convexity and when that is relaxed, other assumptions must be made. But this work points to a path that may be useful in the long-run. An important way of contribution in ML is bridging fields; that could mean bringing in ideas that are state-of-the-art in other fields or it could mean revisiting classic ideas in new ways.


Review for NeurIPS paper: SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

Summary and Contributions: The paper makes the following contributions: 1) Interpretation (up to a constant factor of 2) of SVGD as (kernelized) gradient flow of the Chi-squared divergence, called as CSF 2) Establishing exponential ergodicity of CSF (continuous case) with respect to the KL metric and Chi-squared divergence metric, under certain Poincare condition (or LSI) on the target. Indeed this is an issue with any kernel method (from SVM to MMD to SVGD) and it has been addressed in various ways. If one were critical, there is still no "nice" way to pick a kernel. Indeed as mentioned in Line 16 and 17, a single integral operator depending on target \pi is good (in a way it is also along expected lines - for example in MMD context something similar leads to optimality properties). However I tend to not agree 100% with lines 27-28 that "solving high-dimensional PDEs is precisely the target of intensive research in modern numerical PDE" which is my main concern with the practical applicability of the proposed work. There is no "concrete" progress in this direction to the best of the reviewer's knowledge despite several ad-hoc approaches recently.