compression
Supervised learning through the lens of compression
This work continues the study of the relationship between sample compression schemes and statistical learning, which has been mostly investigated within the framework of binary classification. We first extend the investigation to multiclass categorization: we prove that in this case learnability is equivalent to compression of logarithmic sample size and that the uniform convergence property implies compression of constant size. We use the compressibility-learnability equivalence to show that (i) for multiclass categorization, PAC and agnostic PAC learnability are equivalent, and (ii) to derive a compactness theorem for learnability. We then consider supervised learning under general loss functions: we show that in this case, in order to maintain the compressibility-learnability equivalence, it is necessary to consider an approximate variant of compression. We use it to show that PAC and agnostic PAC are not equivalent, even when the loss function has only three values.
Communication Compression for Decentralized Training
Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em communication compression} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, We explore a natural question: {\em can the combination of both techniques lead to a system that is robust to both bandwidth and latency?} Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: unlike centralized algorithms, simply compressing {\rc exchanged information, even in an unbiased stochastic way, within the decentralized network would accumulate the error and cause divergence.} In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the convergence rate for full precision, centralized training. We validate our algorithms and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth.
- Asia > Russia (0.28)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Middle East > Saudi Arabia (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- Oceania > Australia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California > Alameda County > Livermore (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Alameda County > Livermore (0.04)
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- (3 more...)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Austria (0.04)