Distributed Deep Learning, Part 1: An Introduction to Distributed Training of Neural Networks
Consequently, there is an equivalence between parameter averaging and update-based data parallelism, when parameters are updated synchronously (this last part is key). This equivalence also holds for multiple averaging steps and other updaters (not just simple SGD). Update-based data parallelism becomes more interesting (and arguably more useful) when we relax the synchronous update requirement. That is, by allowing the updates Wi,j to be applied to the parameter vector as soon as they are computed (instead of waiting for N 1 iterations by all workers), we obtain asynchronous stochastic gradient descent algorithm. These benefits are not without cost, however. By introducing asynchronous updates to the parameter vector, we introduce a new problem, known as the stale gradient problem.
Oct-4-2016, 07:35:42 GMT
- Technology: