scalability
Sparsified SGD with Memory
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.
Scalable Robust Matrix Factorization with Nonconvex Loss
Robust matrix factorization (RMF), which uses the $\ell_1$-loss, often outperforms standard matrix factorization using the $\ell_2$-loss, particularly when outliers are present. The state-of-the-art RMF solver is the RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover, sometimes even the (convex) $\ell_1$-loss is not robust enough. In this paper, we propose the use of nonconvex loss to enhance robustness. To address the resultant difficult optimization problem, we use majorization-minimization (MM) optimization and propose a new MM surrogate. To improve scalability, we exploit data sparsity and optimize the surrogate via its dual with the accelerated proximal gradient algorithm. The resultant algorithm has low time and space complexities and is guaranteed to converge to a critical point. Extensive experiments demonstrate its superiority over the state-of-the-art in terms of both accuracy and scalability.
- North America > Cuba > Holguín Province > Holguín (0.04)
- Asia > Singapore (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York (0.04)
- North America > Canada (0.04)
- Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
Learning Latent Process from High-Dimensional Event Sequences via Efficient Sampling
Qitian Wu, Zixuan Zhang, Xiaofeng Gao, Junchi Yan, Guihai Chen
There are plenty of previous studies targeting the problem from different aspects. For temporal point process, agreat number of works [3, 13, 15, 16, 28] attempt to model the intensify function from statistic views, and recent studies harness deep recurrent model [24], generative adversarial network [23] and reinforcement learning [19, 18] to learn the temporal process. These researches mainly focus on one-dimension eventsequences where eacheventpossesses thesame marker.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.06)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)