AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Towards Tight Communication Lower Bounds for Distributed Optimisation

Neural Information Processing SystemsOct-10-2024, 02:27:15 GMT

We consider a standard distributed optimisation setting where N machines, each holding a d -dimensional function f_i, aim to jointly minimise the sum of the functions \sum_{i 1} N f_i (x) . This problem arises naturally in large-scale distributed optimisation, where a standard solution is to apply variants of (stochastic) gradient descent. We focus on the communication complexity of this problem: our main result provides the first fully unconditional bounds on total number of bits which need to be sent and received by the N machines to solve this problem under point-to-point communication, within a given error-tolerance. Specifically, we show that \Omega( Nd \log d / N\varepsilon) total bits need to be communicated between the machines to find an additive \epsilon -approximation to the minimum of \sum_{i 1} N f_i (x) . The result holds for both deterministic and randomised algorithms, and, importantly, requires no assumptions on the algorithm structure.

communication complexity, optimisation, tight communication lower bound, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Predicting Training Time Without Training

Neural Information Processing SystemsOct-10-2024, 02:00:46 GMT

We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. To do so, we leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. This allows us to approximate the training loss and accuracy at any point during training by solving a low-dimensional Stochastic Differential Equation (SDE) in function space. Using this result, we are able to predict the time it takes for Stochastic Gradient Descent (SGD) to fine-tune a model to a given loss without having to perform any training. In our experiments, we are able to predict training time of a ResNet within a 20\% error margin on a variety of datasets and hyper-parameters, at a 30 to 45-fold reduction in cost compared to actual training.

deep network, predict training time, training time

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Competitive Gradient Descent

Neural Information Processing SystemsOct-10-2024, 01:35:53 GMT

We introduce a new algorithm for the numerical computation of Nash equilibria of competitive two-player games. Our method is a natural generalization of gradient descent to the two-player setting where the update is given by the Nash equilibrium of a regularized bilinear local approximation of the underlying game. It avoids oscillatory and divergent behaviors seen in alternating gradient descent. Using numerical experiments and rigorous analysis, we provide a detailed comparison to methods based on \emph{optimism} and \emph{consensus} and show that our method avoids making any unnecessary changes to the gradient dynamics while achieving exponential (local) convergence for (locally) convex-concave zero sum games. Convergence and stability properties of our method are robust to strong interactions between the players, without adapting the stepsize, which is not the case with previous methods. In our numerical experiments on non-convex-concave problems, existing methods are prone to divergence and instability due to their sensitivity to interactions among the players, whereas we never observe divergence of our algorithm.

algorithm, competitive gradient descent, interaction, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Random Reshuffling is Not Always Better

Neural Information Processing SystemsOct-10-2024, 01:33:41 GMT

Many learning algorithms, such as stochastic gradient descent, are affected by the order in which training examples are used. It is often observed that sampling the training examples without-replacement, also known as random reshuffling, causes learning algorithms to converge faster. We give a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffling in learning algorithms (Recht and Ré, "Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences," COLT 2012). We use this to give an example of a learning task and algorithm for which with-replacement random sampling actually outperforms random reshuffling.

algorithm, random reshuffling, training example, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Add feedback

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Neural Information Processing SystemsOct-10-2024, 00:47:58 GMT

Epoch-GD) proposed by (Hazan and Kale, 2011) was deemeda breakthrough for stochastic strongly convex minimization, which achieves theoptimal convergence rate of O(1/T) with T iterative updates for the objective gap. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is still unclear whethera fast rate ofO(1/T)for theduality gapis achievable for stochastic min-max optimization under strong convexity and strong concavity. Although some re-cent studies have proposed stochastic algorithms with fast convergence rates formin-max problems, they require additional assumptions about the problem, e.g.,smoothness, bi-linear structure, etc. To the best of our knowledge, our result is the first one that shows Epoch-GDA can achieve the optimal rate ofO(1/T)for the duality gapof general SCSC min-max problems. We emphasize that such generalization of Epoch-GD for strongly convex minimization problems to Epoch-GDA for SCSC min-max problems is non-trivial and requires novel technical analysis. Moreover, we notice that the key lemma can also be used for proving the convergence of Epoch-GDA for weakly-convex strongly-concave min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions.

min-max optimization, min-max problem, stochastic gradient descent ascent method, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Add feedback

Privacy of Noisy Stochastic Gradient Descent: More Iterations without More Privacy Loss

Neural Information Processing SystemsOct-10-2024, 00:45:17 GMT

A central issue in machine learning is how to train models on sensitive user data. Industry has widely adopted a simple algorithm: Stochastic Gradient Descent with noise (a.k.a. However, foundational theoretical questions about this algorithm's privacy loss remain open---even in the seemingly simple setting of smooth convex losses over a bounded domain. Our main result resolves these questions: for a large range of parameters, we characterize the differential privacy up to a constant. This result reveals that all previous analyses for this setting have the wrong qualitative behavior.

iteration, noisy stochastic gradient descent, stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Genre: Play > Prospect > Container > Reservoir (0.91)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing SystemsOct-10-2024, 00:21:23 GMT

The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavy-ball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.

momentum, stationary distribution, stochastic gradient method, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Functional Stochastic Gradient MCMC for Bayesian Neural Networks

Wu, Mengjing, Xuan, Junyu, Lu, Jie

arXiv.org Artificial IntelligenceOct-10-2024

Classical parameter-space Bayesian inference for Bayesian neural networks (BNNs) suffers from several unresolved prior issues, such as knowledge encoding intractability and pathological behaviours in deep networks, which can lead to improper posterior inference. To address these issues, functional Bayesian inference has recently been proposed leveraging functional priors, such as the emerging functional variational inference. In addition to variational methods, stochastic gradient Markov Chain Monte Carlo (MCMC) is another scalable and effective inference method for BNNs to asymptotically generate samples from the true posterior by simulating continuous dynamics. However, existing MCMC methods perform solely in parameter space and inherit the unresolved prior issues, while extending these dynamics to function space is a non-trivial undertaking. In this paper, we introduce novel functional MCMC schemes, including stochastic gradient versions, based on newly designed diffusion dynamics that can incorporate more informative functional priors. Moreover, we prove that the stationary measure of these functional dynamics is the target posterior over functions. Our functional MCMC schemes demonstrate improved performance in both predictive accuracy and uncertainty quantification on several tasks compared to naive parameter-space MCMC and functional variational inference.

artificial intelligence, machine learning, training data, (12 more...)

arXiv.org Artificial Intelligence

2409.16632

Country: North America (0.14)

Genre: Research Report (0.82)

Industry:

Energy > Oil & Gas > Upstream (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)

Add feedback

Towards Sharper Risk Bounds for Minimax Problems

Zhu, Bowei, Li, Shaojie, Liu, Yong

arXiv.org Machine LearningOct-10-2024

Minimax problems have achieved success in machine learning such as adversarial training, robust optimization, reinforcement learning. For theoretical analysis, current optimal excess risk bounds, which are composed by generalization error and optimization error, present 1/n-rates in strongly-convex-strongly-concave (SC-SC) settings. Existing studies mainly focus on minimax problems with specific algorithms for optimization error, with only a few studies on generalization performance, which limit better excess risk bounds. In this paper, we study the generalization bounds measured by the gradients of primal functions using uniform localized convergence. We obtain a sharper high probability generalization error bound for nonconvex-strongly-concave (NC-SC) stochastic minimax problems. Furthermore, we provide dimension-independent results under Polyak-Lojasiewicz condition for the outer layer. Based on our generalization error bound, we analyze some popular algorithms such as empirical saddle point (ESP), gradient descent ascent (GDA) and stochastic gradient descent ascent (SGDA). We derive better excess primal risk bounds with further reasonable assumptions, which, to the best of our knowledge, are n times faster than exist results in minimax problems.

artificial intelligence, machine learning, probability, (17 more...)

arXiv.org Machine Learning

doi: 10.24963/ijcai.2024/630

2410.08497

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Finite Sample and Large Deviations Analysis of Stochastic Gradient Algorithm with Correlated Noise

Yin, George, Krishnamurthy, Vikram

arXiv.org Artificial IntelligenceOct-10-2024

This paper focuses on finite sample analysis for stochastic gradient algorithms. The motivation stems from a vast varieties of applications. In particular, the recent advances on stochastic optimization in conjunction with machine learning have opened up new domains. A particular emphasis of the learning community requires us taking a careful look at of the finite sample analysis. Well, it is well known that stochastic gradient algorithms or stochastic approximation algorithms are normally concentrated on dealing with asymptotic properties of the recursive algorithms. However, the learning community placed more effort for carrying out analysis of finite sample properties of the recursive algorithms; see for example,... and references therein.

artificial intelligence, machine learning, stochastic gradient algorithm, (14 more...)

arXiv.org Artificial Intelligence

2410.08449

Country:

North America > United States > New York (0.04)
North America > United States > Connecticut (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.82)

Add feedback