Gradient Descent
Data-Parallel Neural Network Training via Nonlinearly Preconditioned Trust-Region Method
Alegrรญa, Samuel A. Cruz, Trotti, Ken, Kopaniฤรกkovรก, Alena, Krause, Rolf
Parallel training methods are increasingly relevant in machine learning (ML) due to the continuing growth in model and dataset sizes. We propose a variant of the Additively Preconditioned Trust-Region Strategy (APTS) for training deep neural networks (DNNs). The proposed APTS method utilizes a data-parallel approach to construct a nonlinear preconditioner employed in the nonlinear optimization strategy. In contrast to the common employment of Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), which are both variants of gradient descent (GD) algorithms, the APTS method implicitly adjusts the step sizes in each iteration, thereby removing the need for costly hyperparameter tuning. We demonstrate the performance of the proposed APTS variant using the MNIST and CIFAR-10 datasets. The results obtained indicate that the APTS variant proposed here achieves comparable validation accuracy to SGD and Adam, all while allowing for parallel training and obviating the need for expensive hyperparameter tuning.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
This paper proposes a new adaptive learning rate scheme for optimizing nonlinear objective functions that arise during the training of deep neural networks. The main argument is based on recent results that indicate that the difficulty of the optimization stems from the presence of saddle points rather than local minima in the optimization path. The saddle points slow down training since the objective function tends to be flat in many directions and ill-conditioned in the neighbourhood of the saddle points. The authors propose a new method for reducing the ill-conditioning (the problem of pathological curvature) by "preconditioning" the objective function through a linear change of variables, which reduces to left-multiplying the gradient descent update step with a learned preconditioning matrix D. They focus specifically on the case where D is diagonal, and they show how a diagonal D reduces to methods for learning parameter-specific learning rates, such as the well-known Jacobi preconditioner or RMSProp. This is a nice framework within which to consider different schemes for adaptive learning rates.
Review for NeurIPS paper: Decentralized Accelerated Proximal Gradient Descent
Additional Feedback: edit I have read the rebuttal, and I would have liked the authors to point out precisely *what* technical points change and are difficult to handle. I think it would be great to actually highlight them in a revision of the paper. On a side note, I still believe that it is possible to get rid of the consensus step on y_t, and closeness between y_t and \bar{y}_t should be enforceable by the consensus step on x_t. This should be better in practice, since the consensus steps that are currently performed on y_t would also benefit x_t. A comparison with higher values of K would also have been welcome.
Review for NeurIPS paper: Decentralized Accelerated Proximal Gradient Descent
The paper gives an accelerated gradient method for decentralized optimization on composite objectives. It achieves this by mimicking centralized accelerated proximal gradient descent. Slight concerns remained about the level of novelty over the Mudag algorithm, which should be expanded in the discussion more precisely, as well as the (theory) requirement of K 1 communications after every step and the not yet fully explained dependence of K on the graph parameter. We expect the authors to incorporate the feedback and improvement suggestions from the 4 reviews in the camera ready version.
Review for NeurIPS paper: An Improved Analysis of Stochastic Gradient Descent with Momentum
Weaknesses: The ideas of the paper could be interesting however the paper loses some points in terms of presentation. Also some claims are not really justified. For example the title mentioned "An Improved Analysis" but it was never really explained in detail why the proposed analysis justifies the word "improved". There are some limitations of existing papers in the Intro but this should be more clear in the main contributions of the work. In line 74, the authors mentioned: "To the best of our knowledge, this is the first convergence (and acceleration) guarantee for SGDM in the multistage setting."
Review for NeurIPS paper: An Improved Analysis of Stochastic Gradient Descent with Momentum
The paper studies the convergence of SGD with momentum, which is of strong research interest. It shows that SGD with momentum converges as fast as SGD for smooth strongly-convex/non-convex objectives, and faster in a multi-stage scenario with learning-rate decay. While the core contribution was liked by all reviewers, Reviewer 3 brought a serious issue in the proof of Lemma 1 to our attention, which forms the foundation for the main results. After the feedback and additional clarification by the authors and longer discussions, we share the impression with the authors that the issue can be fixed by replacing E[m k] by v k throughout the paper and adjusting minor constants. We expect trust the authors to perform these changes and should any issues remain, withdraw the paper. Additionally, we hope the detailed feedback with improvement suggestions from the 4 reviews will be implemented for the camera ready version.
Efficient Over-parameterized Matrix Sensing from Noisy Measurements via Alternating Preconditioned Gradient Descent
Liu, Zhiyu, Han, Zhi, Tang, Yandong, Zhang, Hai, Tang, Shaojie, Wang, Yao
We consider the noisy matrix sensing problem in the over-parameterization setting, where the estimated rank $r$ is larger than the true rank $r_\star$. Specifically, our main objective is to recover a matrix $ X_\star \in \mathbb{R}^{n_1 \times n_2} $ with rank $ r_\star $ from noisy measurements using an over-parameterized factorized form $ LR^\top $, where $ L \in \mathbb{R}^{n_1 \times r}, \, R \in \mathbb{R}^{n_2 \times r} $ and $ \min\{n_1, n_2\} \ge r > r_\star $, with the true rank $ r_\star $ being unknown. Recently, preconditioning methods have been proposed to accelerate the convergence of matrix sensing problem compared to vanilla gradient descent, incorporating preconditioning terms $ (L^\top L + \lambda I)^{-1} $ and $ (R^\top R + \lambda I)^{-1} $ into the original gradient. However, these methods require careful tuning of the damping parameter $\lambda$ and are sensitive to initial points and step size. To address these limitations, we propose the alternating preconditioned gradient descent (APGD) algorithm, which alternately updates the two factor matrices, eliminating the need for the damping parameter and enabling faster convergence with larger step sizes. We theoretically prove that APGD achieves near-optimal error convergence at a linear rate, starting from arbitrary random initializations. Through extensive experiments, we validate our theoretical results and demonstrate that APGD outperforms other methods, achieving the fastest convergence rate. Notably, both our theoretical analysis and experimental results illustrate that APGD does not rely on the initialization procedure, making it more practical and versatile.
Private Minimum Hellinger Distance Estimation via Hellinger Distance Differential Privacy
Deng, Fengnan, Vidyashankar, Anand N.
Objective functions based on Hellinger distance yield robust and efficient estimators of model parameters. Motivated by privacy and regulatory requirements encountered in contemporary applications, we derive in this paper \emph{private minimum Hellinger distance estimators}. The estimators satisfy a new privacy constraint, namely, Hellinger differential privacy, while retaining the robustness and efficiency properties. We demonstrate that Hellinger differential privacy shares several features of standard differential privacy while allowing for sharper inference. Additionally, for computational purposes, we also develop Hellinger differentially private gradient descent and Newton-Raphson algorithms. We illustrate the behavior of our estimators in finite samples using numerical experiments and verify that they retain robustness properties under gross-error contamination.
Comparing privacy notions for protection against reconstruction attacks in machine learning
Biswas, Sayan, Dras, Mark, Faustini, Pedro, Fernandes, Natasha, McIver, Annabelle, Palamidessi, Catuscia, Sadeghi, Parastoo
Within the machine learning community, reconstruction attacks are a principal concern and have been identified even in federated learning (FL), which was designed with privacy preservation in mind. In response to these threats, the privacy community recommends the use of differential privacy (DP) in the stochastic gradient descent algorithm, termed DP-SGD. However, the proliferation of variants of DP in recent years\textemdash such as metric privacy\textemdash has made it challenging to conduct a fair comparison between different mechanisms due to the different meanings of the privacy parameters $\epsilon$ and $\delta$ across different variants. Thus, interpreting the practical implications of $\epsilon$ and $\delta$ in the FL context and amongst variants of DP remains ambiguous. In this paper, we lay a foundational framework for comparing mechanisms with differing notions of privacy guarantees, namely $(\epsilon,\delta)$-DP and metric privacy. We provide two foundational means of comparison: firstly, via the well-established $(\epsilon,\delta)$-DP guarantees, made possible through the R\'enyi differential privacy framework; and secondly, via Bayes' capacity, which we identify as an appropriate measure for reconstruction threats.
Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints
Sung, Kyle, Kratsios, Anastasis, Forman, Noah
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.