Huang, Jianyu
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
Wang, Zheng, Cai, Anna, Xie, Xinfeng, Pan, Zaifeng, Guan, Yue, Chu, Weiwei, Wang, Jie, Li, Shikai, Huang, Jianyu, Cai, Chris, Hao, Yuchen, Ding, Yufei
In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.
Context Parallelism for Scalable Million-Token Inference
Yang, Amy, Yang, Jingyi, Ibrahim, Aya, Xie, Xinfeng, Tang, Bangsheng, Sizov, Grigory, Reizenstein, Jeremy, Park, Jongsoo, Huang, Jianyu
We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.
PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters
Jiao, Xun, Lin, Fred, Dixit, Harish D., Coburn, Joel, Pandey, Abhinav, Wang, Han, Ramesh, Venkat, Huang, Jianyu, Xu, Wang, Moore, Daniel, Sankar, Sriram
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models
Mudigere, Dheevatsa, Hao, Yuchen, Huang, Jianyu, Tulloch, Andrew, Sridharan, Srinivas, Liu, Xing, Ozdal, Mustafa, Nie, Jade, Park, Jongsoo, Luo, Liang, Yang, Jie Amy, Gao, Leon, Ivchenko, Dmytro, Basant, Aarti, Hu, Yuxi, Yang, Jiyan, Ardestani, Ehsan K., Wang, Xiaodong, Komuravelli, Rakesh, Chu, Ching-Hsiang, Yilmaz, Serhat, Li, Huayu, Qian, Jiyuan, Feng, Zhuobo, Ma, Yinbin, Yang, Junjie, Wen, Ellie, Li, Hong, Yang, Lin, Sun, Chonglin, Zhao, Whitney, Melts, Dimitry, Dhulipala, Krishna, Kishore, KR, Graf, Tyler, Eisenman, Assaf, Matam, Kiran Kumar, Gangidi, Adi, Chen, Guoqiang Jerry, Krishnan, Manoj, Nayak, Avinash, Nair, Krishnakumar, Muthiah, Bharath, khorashadi, Mahmoud, Bhattacharya, Pallab, Lapukhov, Petr, Naumov, Maxim, Qiao, Lin, Smelyanskiy, Mikhail, Jia, Bill, Rao, Vijay
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
Mixed-Precision Embedding Using a Cache
Yang, Jie Amy, Huang, Jianyu, Park, Jongsoo, Tang, Ping Tak Peter, Tulloch, Andrew
In recommendation systems, practitioners observed that increase in the number of embedding tables and their sizes often leads to significant improvement in model performances. Given this and the business importance of these models to major internet companies, embedding tables for personalization tasks have grown to terabyte scale and continue to grow at a significant rate. Meanwhile, these large-scale models are often trained with GPUs where high-performance memory is a scarce resource, thus motivating numerous work on embedding table compression during training. We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision, and the most frequently or recently accessed rows cached and trained in full precision. The proposed architectural change works in conjunction with standard precision reduction and computer arithmetic techniques such as quantization and stochastic rounding. For an open source deep learning recommendation model (DLRM) running with Criteo-Kaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache whose size are 5% of the embedding tables, while maintaining accuracy. For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables, while maintaining accuracy, and 16% end-to-end training speedup by reducing GPU-to-host data transfers.
A Study of BFLOAT16 for Deep Learning Training
Kalamkar, Dhiraj, Mudigere, Dheevatsa, Mellempudi, Naveen, Das, Dipankar, Banerjee, Kunal, Avancha, Sasikanth, Vooturi, Dharma Teja, Jammalamadaka, Nataraj, Huang, Jianyu, Yuen, Hector, Yang, Jiyan, Park, Jongsoo, Heinecke, Alexander, Georganas, Evangelos, Srinivasan, Sudarshan, Kundu, Abhisek, Smelyanskiy, Misha, Kaul, Bharat, Dubey, Pradeep
This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can represent is the same as that of IEEE 754 floating-point format (FP32) and conversion to/from FP32 is simple. Maintaining the same range as FP32 is important to ensure that no hyper-parameter tuning is required for convergence; e.g., IEEE 754 compliant half-precision floating point (FP16) requires hyper-parameter tuning. In this paper, we discuss the flow of tensors and various key operations in mixed precision training, and delve into details of operations, such as the rounding modes for converting FP32 tensors to BFLOAT16. We have implemented a method to emulate BFLOAT16 operations in Tensorflow, Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep learning training using BFLOAT16 tensors achieves the same state-of-the-art (SOTA) results across domains as FP32 tensors in the same number of iterations and with no changes to hyper-parameters.