Is Local SGD Better than Minibatch SGD?

Woodworth, Blake, Patel, Kumar Kshitij, Stich, Sebastian U., Dai, Zhen, Bullins, Brian, McMahan, H. Brendan, Shamir, Ohad, Srebro, Nathan

Feb-18-2020–arXiv.org Machine Learning

It is often important to leverage parallelism in order to tackle large scale stochastic optimization problems. A prime example is the task of minimizing the loss of machine learning models with millions or billions of parameters over enormous training sets. One popular distributed approach is local stochastic gradient descent (SGD) (Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2018), also known as "parallel SGD" or "Federated Averaging" 1 (McMahan et al., 2016), which is commonly applied to large scale convex and non-convex stochastic optimization problems, including in data center and "Federated Learning" settings (Kairouz et al., 2019). Local SGD uses M parallel workers which, in each of R rounds, independently execute K steps of SGD starting from a common iterate, and then communicate and average their iterates to obtain the common iterate from which the next round begins. Overall, each machine computes T KR stochastic gradients and executes KR SGD steps locally, for a total of N KRM overall stochastic gradients computed (and so N KRM samples used), with R rounds of communication (every K steps of computation).

local sgd, minibatch sgd, sgd, (13 more...)

arXiv.org Machine Learning

Feb-18-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois > Cook County > Chicago (0.04)
- Asia > Middle East
  - Israel > Haifa District > Haifa (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning > Statistical Learning
    - Gradient Descent (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found