Goto

Collaborating Authors

 elastic averaging sgd


Deep learning with Elastic Averaging SGD

Neural Information Processing Systems

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

This paper studies general-purpose training algorithms for deep learning and proposes a family of algorithms called elastic averaging SGD. The idea is novel and the paper is of very high quality. The paper focuses on training large-scale deep learning models under communication constraints. This problem is difficult since there are many local optima in non-convex problems like in deep learning. The optimization problem is formulated as a global variable consensus problem such that local workers would not fall into different local optima, and then its gradient update rules are reinterpreted using the elastic forces between local and global parameters.


Deep learning with Elastic Averaging SGD

Neural Information Processing Systems

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm.