Scalable multi-node training with TensorFlow Amazon Web Services
We've heard from customers that scaling TensorFlow training jobs to multiple nodes and GPUs successfully is hard. TensorFlow has distributed training built-in, but it can be difficult to use. Recently, we made optimizations to TensorFlow and Horovod to help AWS customers scale TensorFlow training jobs to multiple nodes and GPUs. With these improvements, any AWS customer can use an AWS Deep Learning AMI to train ResNet-50 on ImageNet in just under 15 minutes. To achieve this, 32 Amazon EC2 instances, each with 8 GPUs, a total 256 GPUs, were harnessed with TensorFlow. All of the required software and tools for this solution ship with the latest Deep Learning AMIs (DLAMIs), so you can try it out yourself. You can train faster, implement your models faster, and get results faster than ever before. This blog post describes our results and shows you how to try out this easier and faster way to run distributed training with TensorFlow. Figure A. ResNet-50 ImageNet model training with the latest optimized TensorFlow with Horovod on a Deep Learning AMI takes 15 minutes on 256 GPUs.
Dec-17-2018, 18:45:01 GMT
- Genre:
- Industry:
- Information Technology > Services (0.40)
- Leisure & Entertainment (0.94)
- Media > Music (0.47)
- Retail > Online (0.40)
- Technology: