Routing for Large ML Models

Cohen, Ofir, Schapira, Jose Yallouz Michael, Belkar, Shahar, Mizrahi, Tal

arXiv.org Artificial Intelligence 

The communication Our aim is to devise methodologies for the online adaptation patterns induced by these training process exhibit of routing configurations in ML training clusters that high regularity and persistence, giving rise to significant improve global training efficiency and fairness. Our approach opportunities for optimizing the manner in which flows are builds on two characteristics of ML training and modern networking: routed across the network. We present an algorithmic framework for quantifying network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically optimizing routing with respect to this global Traffic patterns induced by ML training tend to exhibit metric.