Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training