How AI Training Scales

#artificialintelligence 

In the last few years AI researchers have had increasing success in speeding up neural network training through data-parallelism, which splits large batches of data across many machines. Researchers have successfully used batch sizes of tens of thousands for image classification and language modeling, and even millions for RL agents that play the game Dota 2. These large batches allow increasing amounts of compute to be efficiently poured into the training of a single model, and are an important enabler of the fast growth in AI training compute. However, batch sizes that are too large show rapidly diminishing algorithmic returns, and it's not clear why these limits are larger for some tasks and smaller for others.[1] We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients,[2] we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found