AITopics | batch training

Collaborating Authors

batch training

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Neural Information Processing SystemsNov-21-2025, 15:53:26 GMT

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the generalization gap phenomenon. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase.

batch training, generalization gap, name change, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.59)

Add feedback

Reviews: Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Neural Information Processing SystemsOct-8-2024, 05:41:51 GMT

I think the paper provides some clarity on a topic that has seen a bit of attention lately, namely that of the role of noise in optimization and in particular the hypothesis of sharp minima/flat minima. From this perspective I think this datapoint is important for our collective understanding of training deep networks. I don't think the observation made by the authors come as a surprise to anyone with experience with these models, however the final conclusion might. We know that when having large minibatches we have lower variance and hence we should use a larger learning rate, etc. I think one practical issue that people have got stuck in the past is that with larger minibatches the computational cost of any given gradient increases.

generalization gap, minibatch, minima, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.51)

Add feedback

Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

Jiang, Guo-qing, Liu, Jinlong, Ding, Zixiang, Guo, Lin, Lin, Wei

arXiv.org Artificial IntelligenceSep-24-2023

As models for nature language processing (NLP), computer vision (CV) and recommendation systems (RS) require surging computation, a large number of GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput. However, training such LB tasks often meets large generalization gap and downgrades final precision, which limits enlarging the batch size. In this work, we develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR) and apply it onto popular optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of convergence rate to explain its fast training dynamics, and a generalization analysis to demonstrate its smaller generalization gap on LB training. Comprehensive experiments demonstrate that VRGD can accelerate training ($1\sim 2 \times$), narrow generalization gap and improve final accuracy. We push the batch size limit of BERT pretraining up to 128k/64k and DLRM to 512k without noticeable accuracy loss. We improve ImageNet Top-1 accuracy at 96k by $0.52pp$ than LARS. The generalization gap of BERT and ImageNet training is significantly reduce by over $65\%$.

batch training, gradient signal, noise ratio, (1 more...)

arXiv.org Artificial Intelligence

2309.13681

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Large Deviations for Accelerating Neural Networks Training

Guggilam, Sreelekha, Chandola, Varun, Patra, Abani

arXiv.org Artificial IntelligenceMar-1-2023

Artificial neural networks (ANNs) require tremendous amount of data to train on. However, in classification models, most data features are often similar which can lead to increase in training time without significant improvement in the performance. Thus, we hypothesize that there could be a more efficient way to train an ANN using a better representative sample. For this, we propose the LAD Improved Iterative Training (LIIT), a novel training approach for ANN using large deviations principle to generate and iteratively update training samples in a fast and efficient setting. This is exploratory work with extensive opportunities for future work. The thesis presents this ongoing research work with the following contributions from this study: (1) We propose a novel ANN training method, LIIT, based on the large deviations theory where additional dimensionality reduction is not needed to study high dimensional data. (2) The LIIT approach uses a Modified Training Sample (MTS) that is generated and iteratively updated using a LAD anomaly score based sampling strategy. (3) The MTS sample is designed to be well representative of the training data by including most anomalous of the observations in each class. This ensures distinct patterns and features are learnt with smaller samples. (4) We study the classification performance of the LIIT trained ANNs with traditional batch trained counterparts.

artificial intelligence, machine learning, training data, (18 more...)

arXiv.org Artificial Intelligence

2303.00954

Country: North America > United States > New York (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU

Zheng, Zangwei, Xu, Pengtai, Zou, Xuan, Tang, Da, Li, Zhen, Xi, Chenguang, Wu, Peng, Zou, Leqi, Zhu, Yijie, Chen, Ming, Ding, Xiangzhuo, Xue, Fuzhao, Qin, Ziheng, Cheng, Youlong, You, Yang

arXiv.org Artificial IntelligenceNov-30-2022

The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU. Our code locates at https://github.com/bytedance/LargeBatchCTR.

artificial intelligence, batch size, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2204.0624

Country:

Asia > Singapore (0.04)
North America > Canada > British Columbia (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Services (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

A Closer Look at the Generalization Gap in Large Batch Training of Neural Networks

#artificialintelligenceSep-16-2020, 21:50:17 GMT

Deep learning architectures such as recurrent neural networks and convolutional neural networks have seen many significant improvements and have been applied in the fields of computer vision, speech recognition, natural language processing, audio recognition and more. The most commonly used optimization method for training highly complex and non-convex DNNs is stochastic gradient descent (SGD) or some variant of it. DNNs however typically have some non-convex objective functions which are a bit difficult optimize with SGD. Thus, SGD, at best, finds a local minimum of this objective function. Although the solutions of DNNs are a local minima, they have produced great end results.

artificial intelligence, batch size, machine learning, (16 more...)

#artificialintelligence

AI-Alerts: 2020 > 2020-09 > AAAI AI-Alert for Sep 22, 2020 (1.00)

Country: Asia > Middle East > Israel (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Zhao, Shen-Yi, Xie, Yin-Peng, Li, Wu-Jun

arXiv.org Machine LearningJul-28-2020

Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy. As a result, large batch training has also become a challenging topic. In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to momentum SGD (MSGD) which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the $\epsilon$-stationary point with the same computation complexity (total number of gradient computation). Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

artificial intelligence, batch size, machine learning, (15 more...)

arXiv.org Machine Learning

2007.13985

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Hoffer, Elad, Hubara, Itay, Soudry, Daniel

Neural Information Processing SystemsFeb-14-2020, 08:42:56 GMT

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomenon. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase.

batch training, generalization gap, neural network, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Giladi, Niv, Nacson, Mor Shpigel, Hoffer, Elad, Soudry, Daniel

arXiv.org Machine LearningSep-26-2019

Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more scalable. However, asynchronous training has its pitfalls, mainly a degradation in generalization, even after convergence of the algorithm. This gap remains not well understood, as theoretical analysis so far mainly focused on the convergence rate of asynchronous methods. Contributions: We examine asynchronous training from the perspective of dynamical stability. We find that the degree of delay interacts with the learning rate, to change the set of minima accessible by an asynchronous stochastic gradient descent algorithm. We derive closed-form rules on how the learning rate could be changed, while keeping the accessible set the same. Specifically, for high delay values, we find that the learning rate should be kept inversely proportional to the delay. We then extend this analysis to include momentum. We find momentum should be either turned off, or modified to improve training stability. We provide empirical experiments to validate our theoretical findings.

equation, momentum, stability, (16 more...)

arXiv.org Machine Learning

1909.1234

Country:

Asia > Middle East > Israel > Haifa District > Haifa (0.04)
Asia > India (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reinforcement Learning Tutorial Part 3: Basic Deep Q-Learning

#artificialintelligenceMar-13-2019, 03:32:21 GMT

In part 1 we introduced Q-learning as a concept with a pen and paper example. In part 2 we implemented the example in code and demonstrated how to execute it in the cloud. In this third part, we will move our Q-learning approach from a Q-table to a deep neural net. With Q-table, your memory requirement is an array of states x actions. For the state-space of 5 and action-space of 2, the total memory consumption is 2 x 5 10.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback