Gradient Descent
Review for NeurIPS paper: A Non-Asymptotic Analysis for Stein Variational Gradient Descent
Additional Feedback: Post rebuttal: Thank you for you response. I keep my score as it is (which is accept) -- but like R2, I really would like to see the finite-N results in the extra page that would be provided if this paper is accepted. The analysis looks sound and nicely done. Some elaboration on what satisfies this inequality is provided before Sec. Therefore, these results are still not precisely about the SVGD, but on its infinite-particle limit.
Review for NeurIPS paper: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses
Weaknesses: - Below eq (3), for the upper bound of \delta_t the right-hand side should be 2\sum_s\eta_sa_s instead of 2\sum_s\eta_sa_s\delta_s . It would be interesting to add some discussions or comparison with these references mentioned below: 1. "Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent". In this paper, their work relaxes the smoothness to \alpha -Holder continuity of (sub)gradients, which include the non-smooth loss functions in this paper as \alpha 0 . Their stability analysis also improves the optimal generalization bounds O(1/\sqrt{n}) for multi-pass SGD with T O(n 2) . It seems to me that the main technical novelty appeared in the proof of Lemma 3 which studied \delta_t 2 (as opposed to the study of \delta_t in Hardt et al's paper) using the approximate contraction for the gradient mapping for the non-smooth loss which has already explored in the above paper. Similar ideas have already explored in the above reference in a more general setting.
Review for NeurIPS paper: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses
This paper got high scores: 9,7,6,8, all with high confidence. The major concerns are from Reviewer #3, who asked about the relationship with "Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent" ([*]) and whether the technique of using Poisson sampling in [1] can be used in the current work which uses uniform sampling. While the authors clarified in the rebuttal that their work is stronger than [*] and some of the results in [*] may not be generalized to the case in the current paper, and the technique in [1] can be applied straightforwardly, Reviewer #3 further refuted on the first claim and doubted on the second claim during discussion. The AC confirmed that this paper is concurrent with [*] and deemed that Reviewer #3 may have missed the sketched proof in the "Privacy Analysis" section of the rebuttal. During further discussion, Reviewer #3 acknowledged that the sketched proof made sense and supported acceptance.
Fast and Provable Tensor-Train Format Tensor Completion via Precondtioned Riemannian Gradient Descent
Bian, Fengmiao, Cai, Jian-Feng, Zhang, Xiaoqun, Zhang, Yuanwei
Low-rank tensor completion aims to recover a tensor from partially observed entries, and it is widely applicable in fields such as quantum computing and image processing. Due to the significant advantages of the tensor train (TT) format in handling structured high-order tensors, this paper investigates the low-rank tensor completion problem based on the TT-format. We proposed a preconditioned Riemannian gradient descent algorithm (PRGD) to solve low TT-rank tensor completion and establish its linear convergence. Experimental results on both simulated and real datasets demonstrate the effectiveness of the PRGD algorithm. On the simulated dataset, the PRGD algorithm reduced the computation time by two orders of magnitude compared to existing classical algorithms. In practical applications such as hyperspectral image completion and quantum state tomography, the PRGD algorithm significantly reduced the number of iterations, thereby substantially reducing the computational time.
Communication-Efficient Stochastic Distributed Learning
Ren, Xiaoxing, Bastianello, Nicola, Johansson, Karl H., Parisini, Thomas
We address distributed learning problems, both nonconvex and convex, over undirected networks. In particular, we design a novel algorithm based on the distributed Alternating Direction Method of Multipliers (ADMM) to address the challenges of high communication costs, and large datasets. Our design tackles these challenges i) by enabling the agents to perform multiple local training steps between each round of communications; and ii) by allowing the agents to employ stochastic gradients while carrying out local computations. We show that the proposed algorithm converges to a neighborhood of a stationary point, for nonconvex problems, and of an optimal point, for convex problems. We also propose a variant of the algorithm to incorporate variance reduction thus achieving exact convergence. We show that the resulting algorithm indeed converges to a stationary (or optimal) point, and moreover that local training accelerates convergence. We thoroughly compare the proposed algorithms with the state of the art, both theoretically and through numerical results.
Reviews: Splitting Steepest Descent for Growing Neural Architectures
The paper introduces an algorithm for splitting neurons when training neural networks using gradient descent, in order to progressively grow a neural network architecture for a better fit to the data. Some theoretical connections to wasserstein-infinity gradient steps are presented, as well as multiple experiments for learning interpretable or lightweight neural networks. The proposed method is original and interesting, and the experiments suggest that it is promising for various applications. However, the current proposed method seems far from practical in large scale settings, and the presented theoretical results are quite limited: * computationally, the algorithm as presented in Algorithm 1 requires (i) to check if a local optima is reached on the full training loss (and it seems that gradient updates are performed on the full training loss, which is costly), (ii) eigenvector computation on splitting matrices, (iii) dealing with dynamically growing weight matrices. These aspects seem to make the algorithm impractical compared to simply running SGD on an over-parameterized network.
Review for NeurIPS paper: Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm
Additional Feedback: Post rebuttal: The authors addressed my comments. Therefore, I keep my score as'accept' but not higher as I think the clarity of the writing should be improved. When G is nonsmooth and proximable, using proximal maps lead to much faster convergence compared to using subgradients in the optimization case. It is therefore an important problem to investigate the sampling analogue of this scheme which is the topic of this paper. As mentioned, some previous work has been done on this problem, but this paper presents an approach that is most general (in terms of G being supported on a more general set) to date.
Reviews: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
UPDATE: I've read the other reviews and the rebuttal. I am keeping my score - this is a good paper. The study of Stochastic Gradient Descent in overparametrized setting is a popular and important trend in a recent development of huge-scale optimization for deep-learning. The authors propose a very basic and classical method, consisting from the well-known algorithmical blocks (SGD Armijo-type line search) together with its first theoretical justification under "interpolation assumption". The proof of convergence (for example, Theorem 2) mainly consists from the standard arguments (which are used for the proof of the classical non-stochastic Gradient Method under Lipschitz-continuous gradients).