Gradient Descent
Review for NeurIPS paper: Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization
The main result in the paper extends a classical result of Hazan et al to a O(1/T) convergence bound for the duality gap of non-smooth strongly convex-strongly concave min-max problem (instead of the objective gap), proposing a min-max adaptation of the algorithm (Epoch-GDA). They also provide a related bound for finding approximate stationary points in weakly convex-strongly concave problems. Overall the reviewers found the contribution to be a significant and challenging extension over the existing result of Hazan et al. with specific challenges to be overcome in the duality gap version. The authors are strongly recommended to make the promised revisions in the rebutttal as they will embellish the paper.
Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods
INDIVIDUAL COMMENTS / QUESTIONS 1) I really appreciate how the paper ties up loose ends by unifying the analysis of several momentum-based methods in the stochastic setting. I am not very closely familiar with the literature analyzing momentum methods, but there's a lot of work out there (e.g., the line of research studying momentum methods in the continuous time limit). A brief review would be very helpful to position the paper within the existing work. To me this implies that the analysis would go through for more general functions. I don't find it obvious that it would.
Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods
The reviewers agree that the topic tackled in the paper is interesting and the mathematical results are promising. Overall, this submission is a good attempt in deriving a mathematical understanding of QHM, but the results are often only partially investigated and commented. For instance, in section 3 the main result (i.e. the convergence rate for quadratics) is really hard to parse and is poorly commented in the sense that its practical value is unclear. The paper also makes a number of conjectures that are not backed up and the authors are therefore advised to tone down their claims. This includes "we conjecture that the optimal convergence rate is a monotonically decreasing function of nu" as well as the quality of the approximation in Section 4. In conclusion, all three reviewers liked the paper but also highlighted some shortcomings, therefore justifying acceptance as a poster but not an oral.
Review for NeurIPS paper: Agnostic Learning of a Single Neuron with Gradient Descent
Summary and Contributions: The paper considers the problem of agnostically learning a single neuron with respect to the squared loss via gradient descent (GD). The focus of the paper is on specifically understanding the guarantees GD obtains. Under only boundedness of the input distribution, the authors show that for strictly increasing (gradient bounded below by a constant) activation functions, GD finds a point that achieves error O(\sqrt{opt}) \eps where opt is the loss of the best fitting neuron. They extend the result to ReLU under standard anti-concentration assumptions (similar to [1,2]). For ReLU, it is known that you in fact get O(opt) \eps under essentially the same assumptions (see [1]) using an alternate algorithm.
Review for NeurIPS paper: Agnostic Learning of a Single Neuron with Gradient Descent
The reviewers agree that the techniques in this paper are interesting and novel, and that the paper is well-written. For the camera ready version, the authors should include discussion for the additional references pointed out by Reviewer #2. In particular, it would be good to point out that one of the main strengths of this work is the fact that it handles arbitrary distributions over x. Per the comments of Reviewers 3 and 4, it would also be nice to include some additional discussion of the challenges of extending these techniques to multiple neurons.
Review for NeurIPS paper: Federated Accelerated Stochastic Gradient Descent
Summary and Contributions: The paper proposes a new version of Local-SGD/Federated Averaging algorithm -- Federated Accelerated SGD (FedAc). In particular, the algorithm solves a smooth convex expectation minimization problem in a distributed/federated fashion: M workers in parallel can access the stochastic gradients of the objective function and periodically communicate with a parameter-server. FedAc is a combination of AC-SA method from (Ghadimi and Lan, 2012) and Federated Averaging. Authors propose a first analysis of this method for generally strongly convex functions (in the convex case this method was analyzed in (Woodworth et al., 2020), but only for quadratic objectives) under the assumption that the variance of the stochastic gradients is uniformly bounded. The derived bounds outperform the state-of-the-art result for federated methods in this setting, and these rates are close to the accelerated ones. Moreover, authors show how their bounds improve under the additional assumption that the Hessian is Lipschitz continuous, and the 4-th central moment of the stochastic gradient is bounded and also extended known results for Local-SGD (FedAvg) to this case.
Reviews: Limitations of the empirical Fisher approximation for natural gradient descent
Originality: the paper lacks a sound and novel contribution. Theoretically, there is only one minor result as stated above. Technically, there is not a systematical experimental study on real deep networks. The main contribution is on discussing two different formulations of the Fisher matrix. The main trick on making these two formulations different (despite that the authors took a sophisticated approach going though GGN) is that the so called empirical Fisher relies on y_n (target of neural network output), and if one consider y_n to be randomly distributed with fixed variance based on the neural network output, the two formulations are equivalent, otherwise there is a scale parameter in eq.(3) which is shrinking making the two formulations different because of the shrinking and damping.
Reviews: Limitations of the empirical Fisher approximation for natural gradient descent
All reviewers were positive about the paper. The paper corrects several common incorrect assertions and misleading derivations in the natural gradient algorithms literature. The exposition is remarkably clear, with a potential to serve as a reference paper on the topic. The paper is clearly of broad interest to the machine learning community. We recommend to take the reviewers' comments and suggestions into account while preparing the camera ready final version of the paper.
Review for NeurIPS paper: Stochastic Optimization for Performative Prediction
Additional Feedback: This paper presents that different time intervals at which they deploy models trained with stochastic gradient descent in performative prediction leads to qualitatively different algorithms. Moreover, a series of experimental results confirm the result of theoretical analysis of greedy and lazy deploy. Although the paper is theoretically sound, there are still some questions need to be discussed in this paper: 1. This paper uses stochastic gradient methods to solve performative prediction, which is similar to the previous work [R1]. In addition, the proof in this paper seems an extension of the prior work [R1] on performative prediction. The authors should give the key contribution of this paper and discuss the difference between them.