Scalable multi-node training with TensorFlow Amazon Web Services

Dec-17-2018, 18:45:01 GMT–#artificialintelligence

We've heard from customers that scaling TensorFlow training jobs to multiple nodes and GPUs successfully is hard. TensorFlow has distributed training built-in, but it can be difficult to use. Recently, we made optimizations to TensorFlow and Horovod to help AWS customers scale TensorFlow training jobs to multiple nodes and GPUs. With these improvements, any AWS customer can use an AWS Deep Learning AMI to train ResNet-50 on ImageNet in just under 15 minutes. To achieve this, 32 Amazon EC2 instances, each with 8 GPUs, a total 256 GPUs, were harnessed with TensorFlow. All of the required software and tools for this solution ship with the latest Deep Learning AMIs (DLAMIs), so you can try it out yourself. You can train faster, implement your models faster, and get results faster than ever before. This blog post describes our results and shows you how to try out this easier and faster way to run distributed training with TensorFlow. Figure A. ResNet-50 ImageNet model training with the latest optimized TensorFlow with Horovod on a Deep Learning AMI takes 15 minutes on 256 GPUs.

artificial intelligence, deep learning, machine learning, (16 more...)

#artificialintelligence

Dec-17-2018, 18:45:01 GMT

News Web Page

Add feedback

Genre:
- Instructional Material > Course Syllabus & Notes (0.68)

Industry:
- Leisure & Entertainment (0.94)
- Media > Music (0.47)
- Retail > Online (0.40)
- Information Technology > Services (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found