Gradient Descent
Review for NeurIPS paper: A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems
Originally, the paper got three positive scores: 7,7,6, all with very high confidences. Basically, there was no critical issues raised by the reviewers, except suggesting the enhancement on experiments and clarifying some points. During discussion, all the reviewers agreed that the paper should be accepted and Reviewer #3 raised his/her score to 7. So all the reviewers reached a consensus. Thus the AC decided to accept the paper. However, the AC also have the following comments after a quick reading.
Reviews: Stein Variational Gradient Descent With Matrix-Valued Kernels
This has clarified some of my comments, and thus I have increased my score by one level. However, a negative aspect from the author response was the excuse given for the absence of a comparison against SVN; "the results we obtained were much worse than our methods and baselines, and hence did not investigate it further". To me this seems like a very obvious thing to include and discuss in the paper, lending strong support to the new method, so I am a bit suspicious about why it wasn't included. I hope the authors will include it in the revised manuscript, if accepted. Another reviewer commented on the lack of wall-time comparison, and again I feel that the authors did not give a valid excuse for omitting this information from the manuscript (the reader can, I think, be trusted to adjust for different computational setups and hardware when interpreting the wall-time data).
Reviews: Minimal Variance Sampling in Stochastic Gradient Boosting
Update: I read authors' responce RE:sampling rate does not tell the whole story - i was suggesting to add information about on average how many instances were used for each of the splits (because it is not equal to sampling rate * total dataset size). I am keeping my accept rating, hoping that authors do make the changes to improve the derivations/clarity in the final submission Summary: this paper is concerned with a common trick that a lot of GBDT implementation apply - subsampling instances in order to speed up calculations for finding the best split. The authors formulate the problem of choosing the instances to sample as an optimization problem and derive a modified sampling scheme that is aimed at mimicking the gain that would be assigned to a split on all the of the data by using a gain calculated only on a subsampled instances. The experiments demonstrate good results. The paper is well written and easy to follow, apart from a couple of places in derivations(see my questions).
Reviews: Minimal Variance Sampling in Stochastic Gradient Boosting
The authors propose a non-uniform sampling strategy for stochastic gradient boosted decision trees. In particular, sampling probability of the training data is optimized towards maximizing the estimation accuracy of the splitting score of decision trees. The optimization problem allows an approximate closed-form solution. Experiment results demonstrate superior performance of the proposed strategy. The reviewers agree that the paper can not only help understand sampling within GBDT from a more rigorous perspective but also improve GBDT implementations in practice.
Decoupled SGDA for Games with Intermittent Strategy Communication
Zindari, Ali, Yazdkhasti, Parham, Rodomanov, Anton, Chavdarova, Tatjana, Stich, Sebastian U.
We focus on reducing communication overhead in multiplayer games, where frequently exchanging strategies between players is not feasible and players have noisy or outdated strategies of the other players. We introduce Decoupled SGDA, a novel adaptation of Stochastic Gradient Descent Ascent (SGDA). In this approach, players independently update their strategies based on outdated opponent strategies, with periodic synchronization to align strategies. For Strongly-Convex-Strongly-Concave (SCSC) games, we demonstrate that Decoupled SGDA achieves near-optimal communication complexity comparable to the best-known GDA rates. For weakly coupled games where the interaction between players is lower relative to the non-interactive part of the game, Decoupled SGDA significantly reduces communication costs compared to standard SGDA. Our findings extend to multi-player games. To provide insights into the effect of communication frequency and convergence, we extensively study the convergence of Decoupled SGDA for quadratic minimax problems. Lastly, in settings where the noise over the players is imbalanced, Decoupled SGDA significantly outperforms federated minimax methods.
Context-Aware Neural Gradient Mapping for Fine-Grained Instruction Processing
Boldo, David, Pemberton, Lily, Thistledown, Gabriel, Fairchild, Jacob, Kowalski, Felix
The integration of contextual embeddings into the optimization processes of large language models is an advancement in natural language processing. The Context-Aware Neural Gradient Mapping framework introduces a dynamic gradient adjustment mechanism, incorporating contextual embeddings directly into the optimization process. This approach facilitates real-time parameter adjustments, enhancing task-specific generalization even in the presence of sparse or noisy data inputs. The mathematical foundation of this framework relies on gradient descent modifications, where contextual embeddings are derived from a supplementary neural network trained to map input features to optimal adaptation gradients. By employing differential geometry principles, high-dimensional input dependencies are encoded into low-dimensional gradient manifolds, enabling efficient adaptation without necessitating the retraining of the entire model. Empirical evaluations demonstrate that the proposed framework consistently outperforms baseline models across various metrics, including accuracy, robustness to noise, and computational efficiency. The integration of context-specific embeddings allows for a more complex understanding of language, thereby improving the model's ability to handle diverse linguistic phenomena. Furthermore, the computational efficiency achieved through this method demonstrates its scalability for large-scale language models operating under diverse constraints.
Reviews: What Can ResNet Learn Efficiently, Going Beyond Kernels?
Dear Authors: I read your rebuttal. I do indeed understand the point of your paper. I also agree that solving linear equations is not a good example of something kernel methods can't do. The'isotonic regression' algorithm for learning a ReLU is a simple SGD algorithm that uses a straight-through estimator. The high-level message of your paper is that there are problems that gradient-based methods can solve, but kernel methods cannot.
Reviews: Competitive Gradient Descent
This paper deals with the computation of Nash Equilibria in competitive two-player games where the x player is minimizing a function f(x,y) and the y player is minimizing a function g(x,y) . Such problems arise in a wide variety of domains, notably in training GANs, and there has been much recent interest in developing algorithms for solving such problems. Gradient Descent Ascent (GDA) is a natural candidate algorithm for finding Nash Equilibrium, but it will provably oscillate or diverge even in simple settings. As such, many recent works have modified GDA or proposed different algorithms or schemes to guarantee convergence. This paper proposes a new algorithm called Competitive Gradient Descent (CGD), which updates each player's iterates by adding the Nash Equilibrium of a regularized bilinear approximation of the game at the current iterates. CGD requires Hessian-vector products, making it a second-order algorithm.
Reviews: Competitive Gradient Descent
Overall, the reviewers appreciated the novelty and simplicity of the algorithm. The paper is also well written and the methods are well motivated. In terms of content for revision: The lack of discussion around Newton-style approximation and dropping of diagonal terms in the Jacobian was raised again the discussions (see updated reviews). Additionally, we also sought an expert reviewer during discussions. The additional reviewer included below: ---- Additional review: The paper proposes an interesting new method for solving games Min_x F(x,y) Min_y G(x,y) which is, in general, different from running online gradient descent on x and y in parallel, or other existing methods.