Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
–Neural Information Processing Systems
Recently, various strategies for distributed training of large language models (LLMs) have been proposed.By categorizing them into basic strategies and composite strategies, we have discovered that existing basic strategies provide limited options in specific scenarios, leaving considerable room for optimization in training speed.In this paper, we rethink the impact of memory and communication costs on the training speed of LLMs, taking into account the impact of intra-and inter-group communication performance disparities, and then propose a new set of basic strategies named the \textbf{Pa}rtial \textbf{R}edundancy \textbf{O}ptimizer (PaRO).PaRO Data Parallelism (PaRO-DP) accelerates LLM training through refined model state partitioning and tailored training procedures.
Neural Information Processing Systems
Feb-5-2026, 16:21:13 GMT
- Technology: