Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

Feng, Hao, Zhang, Boyuan, Ye, Fanjiang, Si, Min, Chu, Ching-Hsiang, Tian, Jiannan, Yin, Chunxing, Deng, Summer, Hao, Yuchen, Balaji, Pavan, Geng, Tong, Tao, Dingwen

Jul-11-2024–arXiv.org Artificial Intelligence

Abstract--DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. This setup necessitates the use of collective communication primitives for Deep Learning Recommendation Models (DLRMs) have synchronization across all GPUs. Specifically, the partitioning significantly risen to prominence in both research and industry of sparse embedding tables requires nodes to aggregate sparse sectors in recent years. These models integrate sparse input embedding lookups during forward passes and their corresponding embedding learning with neural network architectures, marking gradients during backward passes. Consequently, allto-all a notable advance over traditional collaborative filteringbased communication is utilized in both forward and backward recommendation systems [1]. DLRMs have been successfully passes for synchronizing sparse lookups and gradients, while implemented in various industry applications, including all-reduce is employed for synchronizing dense/MLP gradients product recommendations system by Amazon [2], personalized during the backward pass. As a result, they constitute a significant portion gradients across all GPUs during each minibatch iteration significantly of deep learning applications across multiple industries. For example, DLRMs are uniquely designed to process high-dimensional Figure 1 shows that all-to-all communication accounts for categorical features, typically represented by one-or multihot more than 60% of the total training time for DLRM on an vectors matching the size of the category, which leads to 8-node, 32 A100 GPUs cluster (connected through a Slingshot significant data sparsity.

artificial intelligence, compression ratio, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jul-11-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Florida (0.14)
  - Indiana (0.14)
  - Texas (0.14)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found