AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Noise is All You Need: Private Second-Order Convergence of Noisy SGD

Avdiukhin, Dmitrii, Dinitz, Michael, Fan, Chenglin, Yaroslavtsev, Grigory

arXiv.org Artificial IntelligenceOct-9-2024

Private optimization is a topic of major interest in machine learning, with differentially private stochastic gradient descent (DP-SGD) playing a key role in both theory and practice. Furthermore, DP-SGD is known to be a powerful tool in contexts beyond privacy, including robustness, machine unlearning, etc. Existing analyses of DP-SGD either make relatively strong assumptions (e.g., Lipschitz continuity of the loss function, or even convexity) or prove only first-order convergence (and thus might end at a saddle point in the non-convex setting). At the same time, there has been progress in proving second-order convergence of the non-private version of ``noisy SGD'', as well as progress in designing algorithms that are more complex than DP-SGD and do guarantee second-order convergence. We revisit DP-SGD and show that ``noise is all you need'': the noise necessary for privacy already implies second-order convergence under the standard smoothness assumptions, even for non-Lipschitz loss functions. Hence, we get second-order convergence essentially for free: DP-SGD, the workhorse of modern private optimization, under minimal assumptions can be used to find a second-order stationary point.

convergence, gradient, neural information processing system, (15 more...)

arXiv.org Artificial Intelligence

2410.06878

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
(14 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Reviews: Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks

Neural Information Processing SystemsOct-8-2024, 21:06:19 GMT

The paper combines ES and SGD in a complementary way, not viewing as an alternative to each other. More intuitively, in the course of optimization, the author proposes to use evolutionary strategy to make optimiser adjust at different sections (geometry) of optimisation path. Since the geometry of the loss surface can differ drastically at different locations, it is reasonable to use different gradient based optimisers with different hyper parameters at different locations. Then the main question becomes how to choose a proper optimiser at different locations. The paper proposes ES for that.

esgd hyperparameter, hyperparameter, optimizer, (12 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Neural Information Processing SystemsOct-8-2024, 19:33:04 GMT

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only \mathcal{\tilde{O}}(\epsilon {-3.5}) stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the \mathcal{\tilde{O}}(\epsilon {-4}) rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.

fast nonconvex optimization, local minima, stochastic cubic regularization, (2 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.12)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Sparsified SGD with Memory

Neural Information Processing SystemsOct-8-2024, 18:51:09 GMT

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory).

algorithm, scalability, sparsified sgd

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Training Deep Models Faster with Robust, Approximate Importance Sampling

Neural Information Processing SystemsOct-8-2024, 17:51:18 GMT

In practice, the cost of computing importances greatly limits the impact of importance sampling. We propose a robust, approximate importance sampling procedure (RAIS) for stochastic gradient de- scent. By approximating the ideal sampling distribution using robust optimization, RAIS provides much of the benefit of exact importance sampling with drastically reduced overhead. Empirically, we find RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves faster through these paths, achieving speed-ups of at least 20% and sometimes much more.

approximate importance sampling, robust

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.54)

Add feedback

Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Neural Information Processing SystemsOct-8-2024, 16:29:29 GMT

Uncertainty sampling, a popular active learning algorithm, is used to reduce the amount of data required to learn a classifier, but it has been observed in practice to converge to different parameters depending on the initialization and sometimes to even better parameters than standard training on all the data. In this work, we give a theoretical explanation of this phenomenon, showing that uncertainty sampling on a convex (e.g., logistic) loss can be interpreted as performing a preconditioned stochastic gradient step on the population zero-one loss. Experiments on synthetic and real datasets support this connection.

preconditioned stochastic gradient descent, uncertainty sampling, zero-one loss

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.69)

Add feedback

cpSGD: Communication-efficient and differentially-private distributed SGD

Neural Information Processing SystemsOct-8-2024, 14:57:52 GMT

Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For d variables and n \approx d clients, the proposed method uses \cO(\log \log(nd)) bits of communication per client per coordinate and ensures constant privacy.

communication-efficient and differentially-private, mechanism, privacy, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Implicit Bias of Gradient Descent on Linear Convolutional Networks

Neural Information Processing SystemsOct-8-2024, 14:35:45 GMT

We show that gradient descent on full-width linear convolutional networks of depth L converges to a linear predictor related to the \ell_{2/L} bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear SVM solution, regardless of depth.

gradient descent, implicit bias, linear convolutional network, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Reviews: Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo

Neural Information Processing SystemsOct-8-2024, 14:29:07 GMT

I like the fact that the method is very simple to understand and implement (see my summary), and does not require any major changes to the base SG-MCMC algorithm. Also, this seems very general and applies to a large class of SG-MCMC algorithms, and is therefore potentially very impactful to the Stochastic Gradient MCMC community. Novelty: Although Richardson-Romberg extrapolation is well known in numerical analysis, it is not widely known in the machine learning / stochastic gradient MCMC community. Clarity: The paper is well written and the presentation is clear. Comments/questions: - Can this technique be directly applied to all SG-MCMC methods?

richardson-romberg markov chain monte carlo, sg-mcmc algorithm, sgld, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.40)

Add feedback

Reviews: Gradient Descent Can Take Exponential Time to Escape Saddle Points

Neural Information Processing SystemsOct-8-2024, 13:28:15 GMT

It has recently been shown that, when all the saddle points of a non-convex function are "strict saddle", then gradient descent with a (reasonable) random initialization converges to a local minimizer with probability one. For a randomly perturbed version of gradient descent, the convergence rate can additionally be shown to be polynomial in the parameters. This article proves that such a convergence rate does not hold for the non-perturbed version: there exists reasonably smooth functions with only strict saddle points, and natural initialization schemes such that gradient descent requires a number of steps that is exponential in the dimension to find a local minimum. I liked this article very much. It answers a very natural question: gradient descent is an extremely classical, and very simple algorithm.

assumption, equation, gradient descent, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback