Bayesian Distributed Stochastic Gradient Descent
–Neural Information Processing Systems
We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-throughput algorithm for training deep neural networks on parallel computing clusters. This algorithm uses amortized inference in a deep generative model to perform joint posterior predictive inference of mini-batch gradient computation times in a compute cluster specific manner. Specifically, our algorithm mitigates the straggler effect in synchronous, gradient-based optimization by choosing an optimal cutoff beyond which mini-batch gradient messages from slow workers are ignored. The principle novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance improves over the static-cutoff prior art, leading to higher gradient computation throughput on large compute clusters. In our experiments we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but sometimes also increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness.
Neural Information Processing Systems
Feb-13-2026, 13:56:32 GMT
- Country:
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.04)
- North America
- Canada
- British Columbia (0.04)
- Quebec > Montreal (0.04)
- United States (0.28)
- Canada
- Europe > United Kingdom
- Technology: