AITopics | optimizer

Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks

Neural Information Processing SystemsMar-16-2026, 21:27:45 GMT

We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary algorithms in one framework in which the optimization alternates between the SGD step and evolution step to improve the average fitness of the population. With a back-off strategy in the SGD step and an elitist strategy in the evolution step, it guarantees that the best fitness in the population will never degrade. In addition, individuals in the population optimized with various SGD-based optimizers using distinct hyper-parameters in the SGD step are considered as competing species in a coevolution setting such that the complementarity of the optimizers is also taken into account. The effectiveness of ESGD is demonstrated across multiple applications including speech recognition, image recognition and language modeling, using networks with a variety of deep architectures.

artificial intelligence, machine learning, proceedings, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)

Add feedback

f718499c1c8cef6730f9fd03c8125cab-Supplemental.pdf

Neural Information Processing SystemsMar-14-2026, 07:00:14 GMT

computer vision, leakyrelu, power 0, (10 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

c8cc6e90ccbff44c9cee23611711cdc4-AuthorFeedback.pdf

Neural Information Processing SystemsMar-13-2026, 10:50:55 GMT

configuration, generalization, reinforcement learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.38)

Add feedback

L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing SystemsMar-13-2026, 06:54:55 GMT

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including dense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on classical datasets MNIST, fashion MNIST, CIFAR10 and others.

artificial intelligence, machine learning, proceedings, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Gronich, Eitan, Vardi, Gal

arXiv.org Machine LearningMar-4-2026

We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2602.1634

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Towards Anytime-Valid Statistical Watermarking

Huang, Baihe, Xu, Eric, Ramchandran, Kannan, Jiao, Jiantao, Jordan, Michael I.

arXiv.org Machine LearningFeb-20-2026

The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Machine Learning

2602.17608

Country:

Asia > Middle East > Jordan (0.41)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Massachusetts > Middlesex County > Burlington (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Regularizing Trajectory Optimization with Denoising Autoencoders

Rinu Boney, Norman Di Palo, Mathias Berglund, Alexander Ilin, Juho Kannala, Antti Rasmus, Harri Valpola

Neural Information Processing SystemsFeb-19-2026, 10:53:55 GMT

Neural Information Processing Systems http://nips.cc/

international conference, optimization, trajectory optimization, (15 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Energy (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Large-batchOptimizationforDenseVisualPredictions

Neural Information Processing SystemsFeb-19-2026, 06:21:58 GMT

At thet-th backward propagation step, we can derive the gradient il(wt)toupdatei-th module inM. The number in the bracket represents the batch size. We see that when the batch size is small (i.e.,32), the gradientvariancesaresimilar. N and K indicate the number of FPN levels and region proposals fed into the detection head. To evaluate this assumption, as shown in Figure 1, we have three observations. As illustrated by the second figure in Figure 1, the gradient misalignment phenomenon between detection head and backbone has been reduced.

artificial intelligence, machine learning, wehave, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.95)

Add feedback