Is Local SGD Better than Minibatch SGD?
Woodworth, Blake, Patel, Kumar Kshitij, Stich, Sebastian U., Dai, Zhen, Bullins, Brian, McMahan, H. Brendan, Shamir, Ohad, Srebro, Nathan
It is often important to leverage parallelism in order to tackle large scale stochastic optimization problems. A prime example is the task of minimizing the loss of machine learning models with millions or billions of parameters over enormous training sets. One popular distributed approach is local stochastic gradient descent (SGD) (Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2018), also known as "parallel SGD" or "Federated Averaging" 1 (McMahan et al., 2016), which is commonly applied to large scale convex and non-convex stochastic optimization problems, including in data center and "Federated Learning" settings (Kairouz et al., 2019). Local SGD uses M parallel workers which, in each of R rounds, independently execute K steps of SGD starting from a common iterate, and then communicate and average their iterates to obtain the common iterate from which the next round begins. Overall, each machine computes T KR stochastic gradients and executes KR SGD steps locally, for a total of N KRM overall stochastic gradients computed (and so N KRM samples used), with R rounds of communication (every K steps of computation).
Feb-18-2020
- Country:
- Asia > Middle East
- Israel > Haifa District > Haifa (0.04)
- North America > United States
- Illinois > Cook County > Chicago (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.64)
- Technology: