Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

Liu, Xinyi, Wang, Yujie, Zhu, Shenhan, Fu, Fangcheng, Liu, Qingshuo, Lin, Guangming, Cui, Bin

May-1-2025–arXiv.org Artificial Intelligence

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

May-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Missouri (0.14)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.95)
  - Representation & Reasoning (0.88)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found