How to derive ring all-reduce's mathematical property step by step
In our previous blog: Combating Software System Complexity: Appropriate Abstraction Layer, we mentioned that the communication in a distributed deep learning framework is highly dependent on regular collective communication operations like all-reduce, reduce-scatter, all-gather, and so on. Therefore, it's crucial to implement a highly optimized collective communication and select an ideal algorithm based on task requirements and communication typology. This article will unveil the mathematical property of collective communication operations by analyzing the case of all-reduce, which is common in data parallelism. As illustrated in Figure 1, there are four devices, each with one matrix (to keep things simple, each row in these matrices has only one element). And all-reduce is an operation that sums up the same row's input value across devices and returns the resultant value to the corresponding row.
Jun-16-2022, 13:47:00 GMT