Gradient Descent
A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems
Nonconvex-concave min-max problem arises in many machine learning applications including minimizing a pointwise maximum of a set of nonconvex functions and robust adversarial training of neural networks. A popular approach to solve this problem is the gradient descent-ascent (GDA) algorithm which unfortunately can exhibit oscillation in case of nonconvexity. In this paper, we introduce a "smoothing" scheme which can be combined with GDA to stabilize the oscillation and ensure convergence to a stationary solution.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper proposes a supervised learning algorithm. It uses stochastic gradient descent and periodically expands the hypothesis space by introducing new basis functions and adding corresponding components to the weight vector. As such, as it processes more data, it fits more complex models. The hypothesis space considered here are polynomials and higher order monomials are gradually introduced to the model. The concept of growing the hypothesis space as more data is introduced is not new (training kernel methods with SGD exhibits this behavior), but in the proposed method, choosing which monomials to add to the hypothesis space is very cheap.
between the correctness of autodiff systems and that of applications (e.g., gradient descent) built upon autodiff systems?
We thank the reviewers for their constructive and inspiring feedback. As we cannot see R2 (i.e., Reviewer #2), we respond to the reviews by R1, R3, and R4 only. The correctness of autodiff systems defined in the paper could be misleading to practitioners. We agree with the reviewers' points that (i) the correctness of the applications built upon autodiff systems is as important Also, we do not claim that our correctness condition is "the" Rather we are just suggesting "a" correctness condition that can serve as a reasonable (possibly minimal) We will clarify this limitation in the revised version of the paper. Here are detailed responses to the point (ii) on the applications mentioned in the reviews.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper seems to essentially combine three published ideas: accelerated gradient, the stochastic gradient variance reduction technique of Johnson and Zhang, and variance reduction via minibatching. Hence, on a conceptual level at least, it's a fairly incremental paper (I don't want to minimize the effort that may have gone into developing the convergence proof). With this said, it's well-done, mostly well-written, and has good theoretical and experimental results. In terms of quality, originality and significance, it's as I said above: they're combining pre-existing ideas, but doing it well, and included a convergence proof with a slightly improved rate over the competition.