Reviews: Bayesian Distributed Stochastic Gradient Descent
–Neural Information Processing Systems
Summary: This paper presents a new algorithm called Bayesian Distributed SGD to mitigate the straggler problem when training deep learning on parallel clusters. Unlike Synchronous Distributed SGD approach where a fixed cut-off (number of workers) is predefined, BDSGD uses amortized inference to predict workers' run-times and derive a straggler cut-off accordingly. BDSGD models the joint run-time behaviour of workers which are likely to be correlated due to the underlying cluster architecture. The approach is incorporated as part of the parameter server framework, deciding which sub-gradients to drop in each iteration. Strength: The proposed idea of adaptive cut-off and predicting joint worker runtime through amortized inference with variational auto encoder loss is novel and very interesting.
Neural Information Processing Systems
Oct-7-2024, 16:59:35 GMT
- Technology: