AI and Compute

#artificialintelligence

We're releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore's Law had an 18-month doubling period). Since 2012, this metric has grown by more than 300,000x (an 18-month doubling period would yield only a 12x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it's worth preparing for the implications of systems far outside today's capabilities. The chart shows the total amount of compute, in petaflop/s-days, that was used to train selected results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used. A petaflop/s-day (pfs-day) consists of performing 1015 neural net operations per second for one day, or a total of about 1020 operations.


Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

arXiv.org Machine Learning

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.


AI's compute hunger outpaces Moore's law

#artificialintelligence

Demand for compute to train artificial intelligence models has shot up enormously over the past six years and is showing no signs of slowing down. Not for profit research firm OpenAI - which is sponsored by Peter Thiel, Elon Musk, Microsoft and Amazon Web Services, among others - published an analysis that showed the amount of compute used for the largest AI training runs has doubled every three-and-a-half months since 2012. This means compute amounts have grown by more than 300,000 times over the past six years, OpenAI said. In comparison, the well-known Moore's Law, which observed the number of transistors in an integrated circuit would double every year-and-a-half, would yield only a twelve-fold increase in performance over the same period. Part of the reason AI models still have enough compute is because of the use of massively parallel video cards or graphics processing units (GPUs) that can have thousands of cores per unit.


AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks

arXiv.org Artificial Intelligence

Typically, Ultra-deep neural network(UDNN) tends to yield high-quality model, but its training process is usually resource intensive and time-consuming. Modern GPU's scarce DRAM capacity is the primary bottleneck that hinders the trainability and the training efficiency of UDNN. In this paper, we present "AccUDNN", an accelerator that aims to make the utmost use of finite GPU memory resources to speed up the training process of UDNN. AccUDNN mainly includes two modules: memory optimizer and hyperparameter tuner. Memory optimizer develops a performance-model guided dynamic swap out/in strategy, by offloading appropriate data to host memory, GPU memory footprint can be significantly slashed to overcome the restriction of trainability of UDNN. After applying the memory optimization strategy, hyperparameter tuner is designed to explore the efficiency-optimal minibatch size and the matched learning rate. Evaluations demonstrate that AccUDNN cuts down the GPU memory requirement of ResNet-152 from more than 24GB to 8GB. In turn, given 12GB GPU memory budget, the efficiency-optimal minibatch size can reach 4.2x larger than original Caffe. Benefiting from better utilization of single GPU's computing resources and fewer parameter synchronization of large minibatch size, 7.7x speed-up is achieved by 8 GPUs' cluster without any communication optimization and no accuracy losses.


How the "bigger is better" mentality damages AI research

#artificialintelligence

Something you'll hear a lot is that the increasing availability of compute resources has paved the way for important advances in artificial intelligence. With access to powerful cloud computing platforms, AI researchers have been able to train larger neural networks in shorter timespans. This has enabled AI to make inroads in many fields such as computer vision, speech recognition, and natural language processing. But what you'll hear less is the darker implications of the current direction of AI research. Currently, advances in AI is mostly tied to scaling deep learning models and creating neural networks with more layers and parameters.