Goto

Collaborating Authors

 different batch size






Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training

Lu, Kuan-Wei, Hong, Ding-Yong, Liu, Pangfeng, Wu, Jan-Jan

arXiv.org Artificial Intelligence

Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.


Closed-Form Last Layer Optimization

Galashov, Alexandre, Da Costa, Nathaël, Xu, Liyuan, Hennig, Philipp, Gretton, Arthur

arXiv.org Machine Learning

Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.




A Proofs for Section 3

Neural Information Processing Systems

The lemma is proven in Section D . First consider an even k . This together with ( 37) completes the proof of ( 23). C.1 Proof of Theorem 5 Recall we let a D.1 Proof of Lemma 1 We show the following more general result. The proof is a simple practice for linear algebra.


A Beam Search Algorithm

Neural Information Processing Systems

Algorithm 1 demonstrates the step-by-step operations of our beam search algorithm (see Sec. 4.3). We consider recovering sentences in the current work. We leave recovering longer paragraphs as future work. We keep 2000 examples of each dataset as the evaluation set, and use the left for training. "End-to-End optimization", "Reg" means the inclusion of a regularization term, "DR" refers to a discrete token Our approach is unique as it does not rely on end-to-end optimization, is demonstrated on large batch sizes (i.e.