different batch size
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > Canada (0.05)
- North America > Puerto Rico (0.05)
- North America > Mexico > Colima (0.05)
- Asia > Japan (0.05)
- North America > United States > New Jersey (0.04)
- Leisure & Entertainment (0.94)
- Media > Film (0.47)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- North America > United States > Nevada > Washoe County > Reno (0.04)
- (5 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.46)
Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training
Lu, Kuan-Wei, Hong, Ding-Yong, Liu, Pangfeng, Wu, Jan-Jan
Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Colorado > Broomfield County > Broomfield (0.04)
- North America > United States > Georgia > Chatham County > Savannah (0.04)
- (2 more...)
Closed-Form Last Layer Optimization
Galashov, Alexandre, Da Costa, Nathaël, Xu, Liyuan, Hennig, Philipp, Gretton, Arthur
Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.
- North America > United States (0.28)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (5 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.46)
A Beam Search Algorithm
Algorithm 1 demonstrates the step-by-step operations of our beam search algorithm (see Sec. 4.3). We consider recovering sentences in the current work. We leave recovering longer paragraphs as future work. We keep 2000 examples of each dataset as the evaluation set, and use the left for training. "End-to-End optimization", "Reg" means the inclusion of a regularization term, "DR" refers to a discrete token Our approach is unique as it does not rely on end-to-end optimization, is demonstrated on large batch sizes (i.e.
- North America > Puerto Rico (0.05)
- North America > Mexico > Colima (0.05)
- Asia > Japan (0.05)
- North America > United States > New Jersey (0.04)
- Leisure & Entertainment (0.94)
- Media > Film (0.47)