AITopics | different batch size

Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.

artificial intelligence, batch size, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.26092

Country:

North America > Canada (0.46)
North America > United States (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Closed-Form Last Layer Optimization

Galashov, Alexandre, Da Costa, Nathaël, Xu, Liyuan, Hennig, Philipp, Gretton, Arthur

arXiv.org Machine LearningOct-7-2025

Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.

batch size, closed-form solution, different batch size, (16 more...)

arXiv.org Machine Learning

2510.04606

Country:

North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

c68c9c8258ea7d85472dd6fd0015f047-Supplemental.pdf

Neural Information Processing SystemsAug-22-2025, 00:47:48 GMT

experiment, qualitative result, quantitative result, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf

Neural Information Processing SystemsAug-16-2025, 17:17:12 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(5 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

A Proofs for Section 3

Neural Information Processing SystemsAug-15-2025, 19:40:39 GMT

The lemma is proven in Section D . First consider an even k . This together with ( 37) completes the proof of ( 23). C.1 Proof of Theorem 5 Recall we let a D.1 Proof of Lemma 1 We show the following more general result. The proof is a simple practice for linear algebra.

batch size, decomposition, different learning rate, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

A Beam Search Algorithm

Neural Information Processing SystemsAug-14-2025, 05:24:40 GMT

Algorithm 1 demonstrates the step-by-step operations of our beam search algorithm (see Sec. 4.3). We consider recovering sentences in the current work. We leave recovering longer paragraphs as future work. We keep 2000 examples of each dataset as the evaluation set, and use the left for training. "End-to-End optimization", "Reg" means the inclusion of a regularization term, "DR" refers to a discrete token Our approach is unique as it does not rely on end-to-end optimization, is demonstrated on large batch sizes (i.e.

batch size, different batch size, initial learning rate, (15 more...)

Neural Information Processing Systems

Country: