Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training
Feng, Wenjiao, Xiao, Rongxing, Li, Zonghang, Yu, Hongfang, Sun, Gang, Luo, Long, Guizani, Mohsen, Ho, Qirong, Liu, Steve
–arXiv.org Artificial Intelligence
Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control, which is misaligned with the self-governed setup where institutions can freely join and leave. This paper proposes Chaos, a multi-party distributed training system with self-healing and autoscaling, enabling robust and elastic training under churn. It speeds up autoscaling via multi-neighbor state replication and model sharding. We formalize the sharding and assignment as a MINLP that captures WAN heterogeneity, and reduce it to a tractable MILP by analyzing its monotonicity on a divisibility chain. By establishing an equivalence, we derive a greedy algorithm that follows optimality rules and yields the optimal solution in polynomial time. Chaos uses a cluster monitor to track resource and topology changes, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling among institutions. Experiments show that Chaos has substantially lower scale-out delay than Pollux, Elan, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 20ms. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.
arXiv.org Artificial Intelligence
Sep-16-2025
- Country:
- Asia
- China > Sichuan Province
- Chengdu (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- China > Sichuan Province
- Europe > Sweden
- North America > United States (0.04)
- Asia
- Genre:
- Research Report (0.82)
- Industry:
- Technology: