Goto

Collaborating Authors

 Mathematical & Statistical Methods


Quasi-Newton Methods for Saddle Point Problems Luo

Neural Information Processing Systems

The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi-Newton methods in convex optimization.


Quasi-Newton Methods for Saddle Point Problems Luo

Neural Information Processing Systems

The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi-Newton methods in convex optimization.



Greedy and Random Quasi-Newton Methods with Faster Explicit Superlinear Convergence Dachao Lin 1 Haishan Ye

Neural Information Processing Systems

In this paper, we follow Rodomanov and Nesterov [19]'s work to study quasi-Newton methods. We focus on the common SR1 and BFGS quasi-Newton methods to establish better explicit (local) superlinear convergence rates. First, based on the greedy quasi-Newton update which greedily selects the direction to maximize a certain measure of progress, we improve the convergence rate to a conditionnumber-free superlinear convergence rate. Second, based on the random quasi-Newton update that selects the direction randomly from a spherically symmetric distribution, we show the same superlinear convergence rate established as above. Our analysis is closely related to the approximation of a given Hessian matrix, unconstrained quadratic objective, as well as the general strongly convex, smooth and strongly self-concordant functions.


Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing Systems

INDIVIDUAL COMMENTS / QUESTIONS 1) I really appreciate how the paper ties up loose ends by unifying the analysis of several momentum-based methods in the stochastic setting. I am not very closely familiar with the literature analyzing momentum methods, but there's a lot of work out there (e.g., the line of research studying momentum methods in the continuous time limit). A brief review would be very helpful to position the paper within the existing work. To me this implies that the analysis would go through for more general functions. I don't find it obvious that it would.


Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing Systems

The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavyball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.


Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing Systems

The reviewers agree that the topic tackled in the paper is interesting and the mathematical results are promising. Overall, this submission is a good attempt in deriving a mathematical understanding of QHM, but the results are often only partially investigated and commented. For instance, in section 3 the main result (i.e. the convergence rate for quadratics) is really hard to parse and is poorly commented in the sense that its practical value is unclear. The paper also makes a number of conjectures that are not backed up and the authors are therefore advised to tone down their claims. This includes "we conjecture that the optimal convergence rate is a monotonically decreasing function of nu" as well as the quality of the approximation in Section 4. In conclusion, all three reviewers liked the paper but also highlighted some shortcomings, therefore justifying acceptance as a poster but not an oral.


Review for NeurIPS paper: Federated Accelerated Stochastic Gradient Descent

Neural Information Processing Systems

Summary and Contributions: The paper proposes a new version of Local-SGD/Federated Averaging algorithm -- Federated Accelerated SGD (FedAc). In particular, the algorithm solves a smooth convex expectation minimization problem in a distributed/federated fashion: M workers in parallel can access the stochastic gradients of the objective function and periodically communicate with a parameter-server. FedAc is a combination of AC-SA method from (Ghadimi and Lan, 2012) and Federated Averaging. Authors propose a first analysis of this method for generally strongly convex functions (in the convex case this method was analyzed in (Woodworth et al., 2020), but only for quadratic objectives) under the assumption that the variance of the stochastic gradients is uniformly bounded. The derived bounds outperform the state-of-the-art result for federated methods in this setting, and these rates are close to the accelerated ones. Moreover, authors show how their bounds improve under the additional assumption that the Hessian is Lipschitz continuous, and the 4-th central moment of the stochastic gradient is bounded and also extended known results for Local-SGD (FedAvg) to this case.


Reviews: A Kernel Loss for Solving the Bellman Equation

Neural Information Processing Systems

Originality: The derivation of the loss function is original; the resulting loss function has some close similarities with the coupled formulation of LSTD, which should be discussed. Quality: The claims seem to be accurate (I briefly verified the proofs of Theorem 3.1, Proposition 3.3, Proposition 3.4; I did not verify Theorem 3.2 and Corollary 3.5). Clarity: The paper is well-written and clear. Significance: The addressed problem is important; the insights are also useful. SUMMARY: The paper addresses the problem of designing a new loss function for RL.


Reviews: A Kernel Loss for Solving the Bellman Equation

Neural Information Processing Systems

There is general consensus that the idea introduced in the paper is novel and interesting. Yet, I encourage the authors to read carefully the reviewers' comments and take them into consideration in the camera ready. In particular, the connection with the nested formulation of LSTD should be discussed to frame the contribution of the paper better.