Data Parallelism and Distributed Deep Learning at production scale (part 2)

#artificialintelligence 

Lastly, our optimiser is wrapped by Horovod's implementation for distributed optimisation (which handles the all-gather and all-reduce MPI operations). We next assign training callbacks to GPU processors based on the processor's (unique) global rank. By default, rank-0 is designated as the root node. There are some operations we only need executing on a single node (for example, using a model checkpoint to save model weights to file). Each processor will effectively run their own training job which optionally prints training accuracy, loss, and custom metrics to CloudWatch.