SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

May-26-2025, 16:27:22 GMT–Neural Information Processing Systems

Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization.

large language model, machine learning, natural language, (11 more...)

Neural Information Processing Systems

May-26-2025, 16:27:22 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report (0.42)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.82)
  - Natural Language > Large Language Model (0.78)