Micikevicius, Paulius
Microscaling Data Formats for Deep Learning
Rouhani, Bita Darvish, Zhao, Ritchie, More, Ankit, Hall, Mathew, Khodamoradi, Alireza, Deng, Summer, Choudhary, Dhruv, Cornea, Marius, Dellinger, Eric, Denolf, Kristof, Dusan, Stosic, Elango, Venmugil, Golub, Maximilian, Heinecke, Alexander, James-Roxby, Phil, Jani, Dharmesh, Kolhe, Gaurav, Langhammer, Martin, Li, Ada, Melnick, Levi, Mesmakhosroshahi, Maral, Rodriguez, Andres, Schulte, Michael, Shafipour, Rasoul, Shao, Lei, Siu, Michael, Dubey, Pradeep, Micikevicius, Paulius, Naumov, Maxim, Verrilli, Colin, Wittig, Ralph, Burger, Doug, Chung, Eric
Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe.
Accelerating Sparse Deep Neural Networks
Mishra, Asit, Latorre, Jorge Albericio, Pool, Jeff, Stosic, Darko, Stosic, Dusan, Venkatesh, Ganesh, Yu, Chong, Micikevicius, Paulius
As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero values in parameters that can then be discarded from storage or computations. While most research focuses on high levels of sparsity, there are challenges in universally maintaining model accuracy as well as achieving significant speedups over modern matrix-math hardware. To make sparsity adoption practical, the NVIDIA Ampere GPU architecture introduces sparsity support in its matrix-math units, Tensor Cores. We present the design and behavior of Sparse Tensor Cores, which exploit a 2:4 (50%) sparsity pattern that leads to twice the math throughput of dense matrix units. We also describe a simple workflow for training networks that both satisfy 2:4 sparsity pattern requirements and maintain accuracy, verifying it on a wide range of common tasks and model architectures. This workflow makes it easy to prepare accurate models for efficient deployment on Sparse Tensor Cores.
MLPerf Training Benchmark
Mattson, Peter, Cheng, Christine, Coleman, Cody, Diamos, Greg, Micikevicius, Paulius, Patterson, David, Tang, Hanlin, Wei, Gu-Yeon, Bailis, Peter, Bittorf, Victor, Brooks, David, Chen, Dehao, Dutta, Debojyoti, Gupta, Udit, Hazelwood, Kim, Hock, Andrew, Huang, Xinyuan, Jia, Bill, Kang, Daniel, Kanter, David, Kumar, Naveen, Liao, Jeffery, Narayanan, Deepak, Oguntebi, Tayo, Pekhimenko, Gennady, Pentecost, Lillian, Reddi, Vijay Janapa, Robie, Taylor, John, Tom St., Wu, Carole-Jean, Xu, Lingjie, Young, Cliff, Zaharia, Matei
Machine learning is experiencing an explosion of software and hardware solutions, and needs industry-standard performance benchmarks to drive design and enable competitive evaluation. However, machine learning training presents a number of unique challenges to benchmarking that do not exist in other domains: (1) some optimizations that improve training throughput actually increase time to solution, (2) training is stochastic and time to solution has high variance, and (3) the software and hardware systems are so diverse that they cannot be fairly benchmarked with the same binary, code, or even hyperparameters. We present MLPerf, a machine learning benchmark that overcomes these challenges. We quantitatively evaluate the efficacy of MLPerf in driving community progress on performance and scalability across two rounds of results from multiple vendors.
Mixed Precision Training
Micikevicius, Paulius, Narang, Sharan, Alben, Jonah, Diamos, Gregory, Elsen, Erich, Garcia, David, Ginsburg, Boris, Houston, Michael, Kuchaiev, Oleksii, Venkatesh, Ganesh, Wu, Hao
Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.