AITopics | Statistical Learning

Collaborating Authors

Statistical Learning

News Overviews Instructional Materials AI-Alerts Classics

On the Role of Batch Size in Stochastic Conditional Gradient Methods

Islamov, Rustem, Machacek, Roman, Lucchi, Aurelien, Silveti-Falls, Antonio, Gorbunov, Eduard, Cevher, Volkan

arXiv.org Machine LearningMar-24-2026

We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2603.21191

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > UAE (0.04)
(6 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Optimal Cluster Recovery in the Labeled Stochastic Block Model

Se-Young Yun, Alexandre Proutiere

Neural Information Processing SystemsMar-23-2026, 20:54:05 GMT

We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number K of clusters of sizes linearly growing with the global population of items n. Every pair of items is labeled independently at random, and label ` appears with probability p(i,j,`) between two items in clusters indexed by iand j, respectively. The objective is to reconstruct the clusters from the observation of these random labels. Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find the set of parameters such that there exists a clustering algorithm with at most s misclassified items in average under the general LSBM and for any s = o(n), which solves one open problem raised in [2]. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within O(npolylog(n)) computations and without the a-priori knowledge of the model parameters.

algorithm, artificial intelligence, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Add feedback

Differential Privacy without Sensitivity

Kentaro Minami, HItomi Arai, Issei Sato, Hiroshi Nakagawa

Neural Information Processing SystemsMar-23-2026, 20:53:12 GMT

The exponential mechanism is a general method to construct a randomized estimator that satisfies (ε,0)-differential privacy. Recently, Wang et al. showed that the Gibbs posterior, which is a data-dependent probability distribution that contains the Bayesian posterior, is essentially equivalent to the exponential mechanism under certain boundedness conditions on the loss function. While the exponential mechanism provides a way to build an (ε,0)-differential private algorithm, it requires boundedness of the loss function, which is quite stringent for some learning problems. In this paper, we focus on (ε,δ)-differential privacy of Gibbs posteriors with convex and Lipschitz loss functions. Our result extends the classical exponential mechanism, allowing the loss functions to have an unbounded sensitivity.

artificial intelligence, machine learning, privacy, (16 more...)

Neural Information Processing Systems

Country: Europe > Spain (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Graphons, mergeons, and so on!

Justin Eldridge, Mikhail Belkin, Yusu Wang

Neural Information Processing SystemsMar-23-2026, 20:52:16 GMT

In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.

artificial intelligence, graphon, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

Add feedback

Scalable Adaptive Stochastic Optimization Using Random Projections

Gabriel Krummenacher, Brian McWilliams, Yannic Kilcher, Joachim M. Buhmann, Nicolai Meinshausen

Neural Information Processing SystemsMar-23-2026, 20:51:58 GMT

Adaptive stochastic gradient methods such as ADAGRAD have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of ADAGRAD is expected to attain better performance, however in high dimensions it is computationally impractical.

artificial intelligence, deep learning, machine learning, (15 more...)

Neural Information Processing Systems

Country: Europe > Switzerland (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning Deep Parsimonious Representations

Renjie Liao, Alex Schwing, Richard Zemel, Raquel Urtasun

Neural Information Processing SystemsMar-23-2026, 20:51:45 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

A Credit Assignment Compiler for Joint Prediction

Kai-Wei Chang, He He, Stephane Ross, Hal Daume III, John Langford

Neural Information Processing SystemsMar-23-2026, 15:53:37 GMT

Neural Information Processing Systems http://nips.cc/

machine learning, natural language, prediction, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Nested Mini-Batch K-Means

James Newling, François Fleuret

Neural Information Processing SystemsMar-23-2026, 15:53:00 GMT

A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm, already used data should preferentially be reused. To this end we propose using nested mini-batches, whereby data in a mini-batch at iteration t is automatically reused at iteration t+1. Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1% of the empirical minimum 100 earlier than the standard mini-batch algorithm.

artificial intelligence, centroid, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.29)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back

Vitaly Feldman

Neural Information Processing SystemsMar-23-2026, 15:52:41 GMT

In stochastic convex optimization the goal is to minimize a convex function F(x) .

artificial intelligence, machine learning, sample complexity, (13 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)

Add feedback

Coordinate-wise Power Method

Qi Lei, Kai Zhong, Inderjit S. Dhillon

Neural Information Processing SystemsMar-23-2026, 15:51:15 GMT

In this paper, we propose a coordinate-wise version of the power method from an optimization viewpoint. The vanilla power method simultaneously updates all the coordinates of the iterate, which is essential for its convergence analysis. However, different coordinates converge to the optimal value at different speeds. Our proposed algorithm, which we call coordinate-wise power method, is able to select and update the most important k coordinates in O(kn) time at each iteration, where n is the dimension of the matrix and k n is the size of the active set. Inspired by the "greedy" nature of our method, we further propose a greedy coordinate descent algorithm applied on a non-convex objective function specialized for symmetric matrices. We provide convergence analyses for both methods. Experimental results on both synthetic and real data show that our methods achieve up to 23 times speedup over the basic power method. Meanwhile, due to their coordinate-wise nature, our methods are very suitable for the important case when data cannot fit into memory. Finally, we introduce how the coordinatewise mechanism could be applied to other iterative methods that are used in machine learning.

artificial intelligence, machine learning, power method, (13 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback