AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Data-Parallel Neural Network Training via Nonlinearly Preconditioned Trust-Region Method

Alegría, Samuel A. Cruz, Trotti, Ken, Kopaničáková, Alena, Krause, Rolf

arXiv.org Artificial IntelligenceFeb-7-2025

Parallel training methods are increasingly relevant in machine learning (ML) due to the continuing growth in model and dataset sizes. We propose a variant of the Additively Preconditioned Trust-Region Strategy (APTS) for training deep neural networks (DNNs). The proposed APTS method utilizes a data-parallel approach to construct a nonlinear preconditioner employed in the nonlinear optimization strategy. In contrast to the common employment of Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), which are both variants of gradient descent (GD) algorithms, the APTS method implicitly adjusts the step sizes in each iteration, thereby removing the need for costly hyperparameter tuning. We demonstrate the performance of the proposed APTS variant using the MNIST and CIFAR-10 datasets. The results obtained indicate that the APTS variant proposed here achieves comparable validation accuracy to SGD and Adam, all while allowing for parallel training and obviating the need for expensive hyperparameter tuning.

artificial intelligence, machine learning, minibatch, (17 more...)

arXiv.org Artificial Intelligence

2502.05133

Country:

Europe > Switzerland (0.05)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsFeb-6-2025, 22:45:54 GMT

This paper proposes a new adaptive learning rate scheme for optimizing nonlinear objective functions that arise during the training of deep neural networks. The main argument is based on recent results that indicate that the difficulty of the optimization stems from the presence of saddle points rather than local minima in the optimization path. The saddle points slow down training since the objective function tends to be flat in many directions and ill-conditioned in the neighbourhood of the saddle points. The authors propose a new method for reducing the ill-conditioning (the problem of pathological curvature) by "preconditioning" the objective function through a linear change of variables, which reduces to left-multiplying the gradient descent update step with a learned preconditioning matrix D. They focus specifically on the case where D is diagonal, and they show how a diagonal D reduces to methods for learning parameter-specific learning rates, such as the well-known Jacobi preconditioner or RMSProp. This is a nice framework within which to consider different schemes for adaptive learning rates.

author feedback and meta-review, contribution, objective function, (12 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Add feedback

Review for NeurIPS paper: Decentralized Accelerated Proximal Gradient Descent

Neural Information Processing SystemsFeb-6-2025, 17:21:46 GMT

Additional Feedback: edit I have read the rebuttal, and I would have liked the authors to point out precisely *what* technical points change and are difficult to handle. I think it would be great to actually highlight them in a revision of the paper. On a side note, I still believe that it is possible to get rid of the consensus step on y_t, and closeness between y_t and \bar{y}_t should be enforceable by the consensus step on x_t. This should be better in practice, since the consensus steps that are currently performed on y_t would also benefit x_t. A comparison with higher values of K would also have been welcome.

consensus step, decentralized accelerated proximal gradient descent, neurips paper, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Review for NeurIPS paper: Decentralized Accelerated Proximal Gradient Descent

Neural Information Processing SystemsFeb-6-2025, 17:21:38 GMT

The paper gives an accelerated gradient method for decentralized optimization on composite objectives. It achieves this by mimicking centralized accelerated proximal gradient descent. Slight concerns remained about the level of novelty over the Mudag algorithm, which should be expanded in the discussion more precisely, as well as the (theory) requirement of K 1 communications after every step and the not yet fully explained dependence of K on the graph parameter. We expect the authors to incorporate the feedback and improvement suggestions from the 4 reviews in the camera ready version.

decentralized accelerated proximal gradient descent, neurips paper

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Add feedback

Review for NeurIPS paper: An Improved Analysis of Stochastic Gradient Descent with Momentum

Neural Information Processing SystemsFeb-6-2025, 16:50:48 GMT

Weaknesses: The ideas of the paper could be interesting however the paper loses some points in terms of presentation. Also some claims are not really justified. For example the title mentioned "An Improved Analysis" but it was never really explained in detail why the proposed analysis justifies the word "improved". There are some limitations of existing papers in the Intro but this should be more clear in the main contributions of the work. In line 74, the authors mentioned: "To the best of our knowledge, this is the first convergence (and acceleration) guarantee for SGDM in the multistage setting."

improved analysis, neurips paper, stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Review for NeurIPS paper: An Improved Analysis of Stochastic Gradient Descent with Momentum

Neural Information Processing SystemsFeb-6-2025, 16:50:41 GMT

The paper studies the convergence of SGD with momentum, which is of strong research interest. It shows that SGD with momentum converges as fast as SGD for smooth strongly-convex/non-convex objectives, and faster in a multi-stage scenario with learning-rate decay. While the core contribution was liked by all reviewers, Reviewer 3 brought a serious issue in the proof of Lemma 1 to our attention, which forms the foundation for the main results. After the feedback and additional clarification by the authors and longer discussions, we share the impression with the authors that the issue can be fixed by replacing E[m k] by v k throughout the paper and adjusting minor constants. We expect trust the authors to perform these changes and should any issues remain, withdraw the paper. Additionally, we hope the detailed feedback with improvement suggestions from the 4 reviews will be implemented for the camera ready version.

improved analysis, neurips paper, stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Efficient Over-parameterized Matrix Sensing from Noisy Measurements via Alternating Preconditioned Gradient Descent

Liu, Zhiyu, Han, Zhi, Tang, Yandong, Zhang, Hai, Tang, Shaojie, Wang, Yao

arXiv.org Machine LearningFeb-6-2025

We consider the noisy matrix sensing problem in the over-parameterization setting, where the estimated rank $r$ is larger than the true rank $r_\star$. Specifically, our main objective is to recover a matrix $ X_\star \in \mathbb{R}^{n_1 \times n_2} $ with rank $ r_\star $ from noisy measurements using an over-parameterized factorized form $ LR^\top $, where $ L \in \mathbb{R}^{n_1 \times r}, \, R \in \mathbb{R}^{n_2 \times r} $ and $ \min\{n_1, n_2\} \ge r > r_\star $, with the true rank $ r_\star $ being unknown. Recently, preconditioning methods have been proposed to accelerate the convergence of matrix sensing problem compared to vanilla gradient descent, incorporating preconditioning terms $ (L^\top L + \lambda I)^{-1} $ and $ (R^\top R + \lambda I)^{-1} $ into the original gradient. However, these methods require careful tuning of the damping parameter $\lambda$ and are sensitive to initial points and step size. To address these limitations, we propose the alternating preconditioned gradient descent (APGD) algorithm, which alternately updates the two factor matrices, eliminating the need for the damping parameter and enabling faster convergence with larger step sizes. We theoretically prove that APGD achieves near-optimal error convergence at a linear rate, starting from arbitrary random initializations. Through extensive experiments, we validate our theoretical results and demonstrate that APGD outperforms other methods, achieving the fastest convergence rate. Notably, both our theoretical analysis and experimental results illustrate that APGD does not rely on the initialization procedure, making it more practical and versatile.

artificial intelligence, initialization, machine learning, (17 more...)

arXiv.org Machine Learning

2502.00463

Country:

Asia > China > Liaoning Province > Shenyang (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)

Add feedback

Private Minimum Hellinger Distance Estimation via Hellinger Distance Differential Privacy

Deng, Fengnan, Vidyashankar, Anand N.

arXiv.org Machine LearningFeb-6-2025

Objective functions based on Hellinger distance yield robust and efficient estimators of model parameters. Motivated by privacy and regulatory requirements encountered in contemporary applications, we derive in this paper \emph{private minimum Hellinger distance estimators}. The estimators satisfy a new privacy constraint, namely, Hellinger differential privacy, while retaining the robustness and efficiency properties. We demonstrate that Hellinger differential privacy shares several features of standard differential privacy while allowing for sharper inference. Additionally, for computational purposes, we also develop Hellinger differentially private gradient descent and Newton-Raphson algorithms. We illustrate the behavior of our estimators in finite samples using numerical experiments and verify that they retain robustness properties under gross-error contamination.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2501.14974

Country:

North America > United States > Virginia > Fairfax County > Fairfax (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Comparing privacy notions for protection against reconstruction attacks in machine learning

Biswas, Sayan, Dras, Mark, Faustini, Pedro, Fernandes, Natasha, McIver, Annabelle, Palamidessi, Catuscia, Sadeghi, Parastoo

arXiv.org Artificial IntelligenceFeb-6-2025

Within the machine learning community, reconstruction attacks are a principal concern and have been identified even in federated learning (FL), which was designed with privacy preservation in mind. In response to these threats, the privacy community recommends the use of differential privacy (DP) in the stochastic gradient descent algorithm, termed DP-SGD. However, the proliferation of variants of DP in recent years\textemdash such as metric privacy\textemdash has made it challenging to conduct a fair comparison between different mechanisms due to the different meanings of the privacy parameters $\epsilon$ and $\delta$ across different variants. Thus, interpreting the practical implications of $\epsilon$ and $\delta$ in the FL context and amongst variants of DP remains ambiguous. In this paper, we lay a foundational framework for comparing mechanisms with differing notions of privacy guarantees, namely $(\epsilon,\delta)$-DP and metric privacy. We provide two foundational means of comparison: firstly, via the well-established $(\epsilon,\delta)$-DP guarantees, made possible through the R\'enyi differential privacy framework; and secondly, via Bayes' capacity, which we identify as an appropriate measure for reconstruction threats.

artificial intelligence, machine learning, mechanism, (17 more...)

arXiv.org Artificial Intelligence

2502.04045

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(11 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints

Sung, Kyle, Kratsios, Anastasis, Forman, Noah

arXiv.org Machine LearningFeb-6-2025

We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

artificial intelligence, machine learning, neural network, (12 more...)

arXiv.org Machine Learning

2502.03792

Country:

North America > Canada > Ontario > Hamilton (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Add feedback