Distributed Training on AWS SageMaker
In today's world, when we have access to humongous data, deeper and bigger deep learning models, training on a single GPU on a local machine can pretty soon become a bottleneck. Some models won't even fit on a single GPU and even if they do the training could be painfully slow. Running a single experiment could take weeks and months in such a setting i.e. large training data and model. As a result, it can hamper research and development and increase the time taken for making POCs. However, to our relief cloud compute is available which allows one to set up remote machines and configure them as per the requirements of the project.
Jun-20-2021, 17:20:16 GMT