Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training

Feng, Wenjiao, Xiao, Rongxing, Li, Zonghang, Yu, Hongfang, Sun, Gang, Luo, Long, Guizani, Mohsen, Ho, Qirong, Liu, Steve

Sep-16-2025–arXiv.org Artificial Intelligence

Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control, which is misaligned with the self-governed setup where institutions can freely join and leave. This paper proposes Chaos, a multi-party distributed training system with self-healing and autoscaling, enabling robust and elastic training under churn. It speeds up autoscaling via multi-neighbor state replication and model sharding. We formalize the sharding and assignment as a MINLP that captures WAN heterogeneity, and reduce it to a tractable MILP by analyzing its monotonicity on a divisibility chain. By establishing an equivalence, we derive a greedy algorithm that follows optimality rules and yields the optimal solution in polynomial time. Chaos uses a cluster monitor to track resource and topology changes, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling among institutions. Experiments show that Chaos has substantially lower scale-out delay than Pollux, Elan, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 20ms. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.

machine learning, natural language, node, (19 more...)

arXiv.org Artificial Intelligence

Sep-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)
- Europe (0.28)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (0.68)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (0.67)
  - Therapeutic Area (0.46)

Technology:
- Information Technology
  - Architecture (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language (0.95)
    - Machine Learning > Neural Networks
      - Deep Learning (0.95)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found