Reviews: Mesh-TensorFlow: Deep Learning for Supercomputers

Neural Information Processing Systems 

The paper is about introducing a new language of abstraction for distributed tensor operations, with a focus on shallow, feed-forward neural networks. Specifically, the paper describes the concept of laid-out tensors, which are "sub-tensors" that can either stand for weights or data and be distributed among different processors to allow for data- or model-parallelism. A new symbolic language for computations with laid-out tensors is described and is shown to be convenient for describing a big variety of possible model constructs, involving model and/or data parallelisms. A theoretical performance analysis is provided, explaining how to avoid "wasted" resources when defining models and data in parallel. Experiments are performed on TPUs obtaining improved results on models that run on 256 cores on language tasks.