AITopics | minibatch size

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

Neural Information Processing SystemsApr-25-2026, 12:56:56 GMT

We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan [17], which is insensitive to the minimum expected loss; over the optimistic acceleration of Cotter et al. [10], which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin [19], which is limited to least squares problems and is also similarly suboptimal with respect to the minibatch size.

algorithm, artificial intelligence, machine learning, (14 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Add feedback

STEM: AStochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

Neural Information Processing SystemsApr-25-2026, 08:30:09 GMT

Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the number of local updates, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on certain stochastic momentum estimator, the algorithm requires O( 3/2) samples and O( 1) communication rounds to compute an -stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such near-optimal sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between the number of local updates and the minibatch sizes, on which the above sample and communication complexities can be maintained. Finally, we show that for the classical FedAvg (a.k.a. Local SGD, which is a momentum-less special case of the STEM), a similar trade-off curve exists, albeit with worse sample and communication complexities. Our insights on this trade-off provides guidelines for choosing the four important design elements for FL algorithms, the number of local updates, WNs' and server's update directions, and minibatch sizes to achieve the best performance.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report (0.45)

Industry: Information Technology (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

STEM: AStochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

Neural Information Processing SystemsApr-25-2026, 08:30:05 GMT

Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the number of local updates, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on certain stochastic momentum estimator, the algorithm requires O( 3/2) samples and O( 1) communication rounds to compute an -stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such near-optimal sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between the number of local updates and the minibatch sizes, on which the above sample and communication complexities can be maintained. Finally, we show that for the classical FedAvg (a.k.a. Local SGD, which is a momentum-less special case of the STEM), a similar trade-off curve exists, albeit with worse sample and communication complexities. Our insights on this trade-off provides guidelines for choosing the four important design elements for FL algorithms, the number of local updates, WNs' and server's update directions, and minibatch sizes to achieve the best performance.

artificial intelligence, complexity, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SM

Neural Information Processing SystemsApr-25-2026, 06:02:59 GMT

artificial intelligence, deep learning, machine learning, (19 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

Neural Information Processing SystemsMar-18-2026, 12:51:54 GMT

Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on certain stochastic momentum estimator, the algorithm requires $\tilde{\mathcal{O}}(\epsilon^{-3/2})$ samples and $\tilde{\mathcal{O}}(\epsilon^{-1})$ communication rounds to compute an $\epsilon$-stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such {\it near-optimal} sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between local update frequencies and local minibatch sizes, on which the above sample and communication complexities can be maintained.

artificial intelligence, machine learning, proceedings, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Wasserstein Dependency Measure for Representation Learning

Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, Pierre Sermanet

Neural Information Processing SystemsFeb-15-2026, 05:55:50 GMT

Neural Information Processing Systems http://nips.cc/

arxiv preprint arxiv, mutual information, representation, (10 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

Zhize Li, Jian Li

Neural Information Processing SystemsFeb-15-2026, 01:52:50 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, minibatch size, pl condition, (12 more...)

Neural Information Processing Systems

Country:

Asia > China (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.52)

Add feedback

Adaptive Methods for Nonconvex Optimization

Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, Sanjiv Kumar

Neural Information Processing SystemsFeb-13-2026, 17:37:30 GMT

The first prominent algorithms in this line of research isADAGRAD [7,22], which uses a per-dimension learning rate based on squared pastgradients.ADAGRADachievedsignificant performance gainsincomparison toSGDwhenthe gradientsaresparse.

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

4d0b954f0bef437c29dfa73fafdf3fa5-Paper.pdf

Neural Information Processing SystemsFeb-12-2026, 03:02:22 GMT

averaging, gradient descent, sgd, (14 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

Add feedback

ff1418e8cc993fe8abcfe3ce2003e5c5-Supplemental.pdf

Neural Information Processing SystemsFeb-12-2026, 02:07:07 GMT

The table ( right) shows 100 epoch results using best lr and wd values found at 50 epochs. ViT's patchify stem differs from the proposed convolutional stem in the type of convolution used and We investigate these factors next. The focus of this paper is studying the large, positive impact of changing ViT's default We use AdamW for all experiments. Figure 7 shows the results. The table ( right) shows 100 epoch results using optimal lr and wd values chosen from the 50 epoch runs.

artificial intelligence, experiment, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

minibatch size

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

STEM: AStochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

STEM: AStochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

SM

STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

Wasserstein Dependency Measure for Representation Learning

A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

Adaptive Methods for Nonconvex Optimization

4d0b954f0bef437c29dfa73fafdf3fa5-Paper.pdf

ff1418e8cc993fe8abcfe3ce2003e5c5-Supplemental.pdf