Escaping Saddle Points with Compressed SGD
–Neural Information Processing Systems
Stochastic Gradient Descent (SGD) and its variants are the main workhorses of modern machine learning. Distributed implementations of SGD on a cluster of machines with a central server and a large number of workers are frequently used in practice due to the massive size of the data. In distributed SGD each machine holds a copy of the model and the computation proceeds in rounds. In every round, each worker finds a stochastic gradient based on its batch of examples, the server averages these stochastic gradients to obtain the gradient of the entire batch, makes an SGD step, and broadcasts the updated model parameters to the workers. With a large number of workers, computation parallelizes efficiently while communication becomes the main bottleneck [Chilimbi et al., 2014, Strom, 2015], since each worker needsto send its gradients to the server and receive the updatedmodel parameters. Commonsolutions for this probleminclude: local SGDand its variants, when each machine performs multiple local steps before communication [Stich, 2018]; decentralized architectureswhich allow pairwisecommunicationbetween the workers [McMahanet al., 2017] and gradient compression, when a compressed version of the gradient is communicated instead of the full gradient [Bernstein et al., 2018, Stich et al., 2018, Karimireddy et al., 2019]. In this work, we consider the latter approach, which we refer to as compressed SGD. Most machine learning models can be described by a d-dimensional vector of parameters x and themodel quality canbe estimatedas a function f(x).
Neural Information Processing Systems
Feb-8-2026, 17:46:32 GMT
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- New York > New York County > New York City (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.46)