compression error
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > Switzerland (0.04)
- North America > United States > California (0.14)
- South America > Brazil > São Paulo (0.04)
- North America > United States > Oregon (0.04)
- (4 more...)
Escaping Saddle Points with Compressed SGD
Stochastic Gradient Descent (SGD) and its variants are the main workhorses of modern machine learning. Distributed implementations of SGD on a cluster of machines with a central server and a large number of workers are frequently used in practice due to the massive size of the data. In distributed SGD each machine holds a copy of the model and the computation proceeds in rounds. In every round, each worker finds a stochastic gradient based on its batch of examples, the server averages these stochastic gradients to obtain the gradient of the entire batch, makes an SGD step, and broadcasts the updated model parameters to the workers. With a large number of workers, computation parallelizes efficiently while communication becomes the main bottleneck [Chilimbi et al., 2014, Strom, 2015], since each worker needsto send its gradients to the server and receive the updatedmodel parameters. Commonsolutions for this probleminclude: local SGDand its variants, when each machine performs multiple local steps before communication [Stich, 2018]; decentralized architectureswhich allow pairwisecommunicationbetween the workers [McMahanet al., 2017] and gradient compression, when a compressed version of the gradient is communicated instead of the full gradient [Bernstein et al., 2018, Stich et al., 2018, Karimireddy et al., 2019]. In this work, we consider the latter approach, which we refer to as compressed SGD. Most machine learning models can be described by a d-dimensional vector of parameters x and themodel quality canbe estimatedas a function f(x).
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.75)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Kansas (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
ErrorCompensatedX: error compensation for variance reduced algorithms
Communication cost is one major bottleneck for the scalability for distributed learning. One approach to reduce the communication cost is to compress the gradient during communication. However, directly compressing the gradient decelerates the convergence speed, and the resulting algorithm may diverge for biased compression. Recent work addressed this problem for stochastic gradient descent by adding back the compression error from the previous step. This idea was further extended to one class of variance reduced algorithms, where the variance of the stochastic gradient is reduced by taking a moving average over all history gradients. However, our analysis shows that just adding the previous step's compression error, as done in existing work, does not fully compensate the compression error. So, we propose ErrorCompensateX, which uses the compression error from the previous two steps. We show that ErrorCompensateX can achieve the same asymptotic convergence rate with the training without compression. Moreover, we provide a unified theoretical analysis framework for this class of variance reduced algorithms, with or without error compensation.
- North America > United States > Michigan (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Michigan (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Asia > Middle East > Jordan (0.04)