LoCo: Low-Bit Communication Adaptor for Large-scale Model Training
Xie, Xingyu, Lin, Zhijie, Toh, Kim-Chuan, Zhou, Pan
–arXiv.org Artificial Intelligence
Abstract--To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE. This progress is largely attributed to the advent of largescale To address the challenge of communication efficiency in models, like the GPT and LLAMA series [1], [5]-[7], large-scale model training, error-feedback compression [17], characterized by their billions of parameters and trillions [18] (EFC) has been developed to compensate for communication of training tokens. This trend of large-scale models has variables before compression, ensuring small compression expanded into various other fields, including finance [8], errors. This technique has been utilized in gradient compression law [9], and medicine [10]. Despite their successes, these to create communication-efficient low-bit optimizers, large-scale models necessitate extensive GPUs for parallel such as 1-bit Adam [14] and 1-bit LAMB [19]. However, these training, employing strategies like data parallelism [11], low-bit optimizers face several key challenges.
arXiv.org Artificial Intelligence
Jul-5-2024