AITopics | Gradient Descent

6d0bf1265ea9635fb4f9d56f16d7efb2-Paper-Conference.pdf

Neural Information Processing SystemsFeb-13-2026, 18:57:52 GMT

Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
North America > Canada > British Columbia (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

b865367fc4c0845c0682bd466e6ebf4c-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-13-2026, 18:57:24 GMT

dasgupta, dendrogram, theoretical guarantee, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.44)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.33)

Add feedback

Momentum-Based Variance Reduction in Non-Convex SGD

Ashok Cutkosky, Francesco Orabona

Neural Information Processing SystemsFeb-13-2026, 18:55:59 GMT

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results.

algorithm, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)

Add feedback

Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization

Yuanxiang Gao, Li Chen, Baochun Li

Neural Information Processing SystemsFeb-13-2026, 18:36:25 GMT

Neural Information Processing Systems http://nips.cc/

cross-entropy minimization, placement, proximal policy optimization, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Louisiana (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback

Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond Taiji Suzuki 1,2, Denny Wu

Neural Information Processing SystemsFeb-13-2026, 17:35:03 GMT

Langevin dynamics (MFLD) (Mei et al., 2018; Hu et al., 2019) is particularly attractive due to the MFLD arises from a noisy gradient descent update on the parameters, where Gaussian noise is injected to the gradient to encourage "exploration". Furthermore, uniform-in-time estimates of the particle discretization error have also been established (Suzuki et al., The goal of this work is to address the following question.

artificial intelligence, machine learning, neural network, (15 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Stochastic Composite Mirror Descent: Optimal Bounds with High Probabilities

Yunwen Lei, Ke Tang

Neural Information Processing SystemsFeb-13-2026, 15:37:40 GMT

We apply the derived computational error bounds to study the generalization performance ofmulti-pass stochastic gradient descent (SGD) ina non-parametric setting.

artificial intelligence, assumption, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

ae614c557843b1df326cb29c57225459-Paper.pdf

Neural Information Processing SystemsFeb-13-2026, 14:36:03 GMT

In this work, we showthat this "lazy training" phenomenon isnot specific tooverparameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding amodel equivalenttolearning withpositive-definite kernels.

artificial intelligence, machine learning, neural network, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Add feedback

Stochastic Chebyshev Gradient Descent for Spectral Optimization

Insu Han, Haim Avron, Jinwoo Shin

Neural Information Processing SystemsFeb-13-2026, 14:35:01 GMT

Unfortunately, computing the gradient of a spectral function is generally of cubic complexity, as such gradient descent methods are rather expensive for optimizing objectives involving the spectral function.

artificial intelligence, gradient, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > Hungary > Csongrád-Csanád County > Szeged (0.05)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback

SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Zhize Li

Neural Information Processing SystemsFeb-13-2026, 13:56:58 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, gradient, stationary point, (13 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California (0.04)
North America > Canada (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Bayesian Distributed Stochastic Gradient Descent

Michael Teng, Frank Wood

Neural Information Processing SystemsFeb-13-2026, 13:56:32 GMT

We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-throughput algorithm for training deep neural networks on parallel computing clusters. This algorithm uses amortized inference in a deep generative model to perform joint posterior predictive inference of mini-batch gradient computation times in a compute cluster specific manner. Specifically, our algorithm mitigates the straggler effect in synchronous, gradient-based optimization by choosing an optimal cutoff beyond which mini-batch gradient messages from slow workers are ignored. The principle novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance improves over the static-cutoff prior art, leading to higher gradient computation throughput on large compute clusters. In our experiments we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but sometimes also increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness.

artificial intelligence, machine learning, throughput, (18 more...)

Neural Information Processing Systems

Country: