gosgd
GoSGD: Distributed Optimization for Deep Learning with Gossip Exchange
Blot, Michael, Picard, David, Cord, Matthieu
With deep convolutional neural networks (CNN) introduced by [1] and [2], computer vision tasks and more specifically image classification have made huge improvements in the years following [3]. CNN performances benefit a lot from big collections of annotated images like [4] or [5]. They are trained by optimizing a loss function with gradient descents computed on random mini-batches according to [6]. The method called stochastic gradient descent (SGD) has proved to be very efficient to train neural networks in general. However current CNN structures are extremely deep like the 100 layers ResNet of [7] and contains a lot of parameters (around 60M for Alexnet [3] and 130M for vgg [8]). Those structures involve heavy gradient computation times making the training on big data-sets very slow. Computation on GPU accelerates the training but requires huge local memory caches. Nevertheless the mini-batch optimization seems suitable for distributing the training over several threads. Many methods have been proposed like 1 [9, 10], which propose to distribute the batches over different threads called workers that periodically exchange information via a central thread to synchronize their models.