Goto

Collaborating Authors

 Deng, Summer


Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

arXiv.org Artificial Intelligence

Abstract--DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. This setup necessitates the use of collective communication primitives for Deep Learning Recommendation Models (DLRMs) have synchronization across all GPUs. Specifically, the partitioning significantly risen to prominence in both research and industry of sparse embedding tables requires nodes to aggregate sparse sectors in recent years. These models integrate sparse input embedding lookups during forward passes and their corresponding embedding learning with neural network architectures, marking gradients during backward passes. Consequently, allto-all a notable advance over traditional collaborative filteringbased communication is utilized in both forward and backward recommendation systems [1]. DLRMs have been successfully passes for synchronizing sparse lookups and gradients, while implemented in various industry applications, including all-reduce is employed for synchronizing dense/MLP gradients product recommendations system by Amazon [2], personalized during the backward pass. As a result, they constitute a significant portion gradients across all GPUs during each minibatch iteration significantly of deep learning applications across multiple industries. For example, DLRMs are uniquely designed to process high-dimensional Figure 1 shows that all-to-all communication accounts for categorical features, typically represented by one-or multihot more than 60% of the total training time for DLRM on an vectors matching the size of the category, which leads to 8-node, 32 A100 GPUs cluster (connected through a Slingshot significant data sparsity.


Microscaling Data Formats for Deep Learning

arXiv.org Artificial Intelligence

Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe.