AITopics | shampoo

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Nayak, Nikhil, White, Julia, Zaratiana, Urchade, Zhang, Kelton, Princis, Henrijs, Atreja, Dhruv, Fawcett, Henry, Thomas, Matthew, Hurn-Maloney, George, Lewis, Ash

arXiv.org Machine LearningMay-21-2026

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2605.20756

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Stochastic_Preconditioners-7

J Sun

Neural Information Processing SystemsApr-30-2026, 05:58:24 GMT

artificial intelligence, machine learning, matrix, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

389cfad711d2b1e2128e931feee80230-Paper-Conference.pdf

Neural Information Processing SystemsApr-26-2026, 16:24:55 GMT

approximation, artificial intelligence, machine learning, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

0b43289db08ed60edc6451cb2132e203-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 15:51:03 GMT

Add feedback

e5b4633454cb2174779d294ccda02318-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 12:14:24 GMT

matrix, preconditioner, shampoo, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Singapore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

ef72fa6579401ffff9da246a5014f055-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 21:02:00 GMT

artificial intelligence, hyperparameter, machine learning, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.95)

Add feedback

Stochastic_Preconditioners-7

J Sun

Neural Information Processing SystemsFeb-17-2026, 21:01:56 GMT

Thus, diagonal preconditioning methods remain popular.

artificial intelligence, machine learning, matrix, (18 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Denmark (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

dae3312c4c6c7000a37ecfb7b0aeb0e4-Supplemental.pdf

Neural Information Processing SystemsFeb-11-2026, 11:05:37 GMT

algorithm, matrix, shampoo, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

dae3312c4c6c7000a37ecfb7b0aeb0e4-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 11:05:33 GMT

Based on the so-calledtensor normal(TN) distribution [31],wepropose andanalyze abrandnewapproximate natural gradient method, Tensor Normal Training(TNT), which likeShampoo, only requires knowledge of the shape of the training parameters.

artificial intelligence, machine learning, matrix, (19 more...)

Neural Information Processing Systems

Technology: