AITopics | svrg ol

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method by combining adaptivity with variance reduction techniques. Our analysis yields a linear speedup in the number of machines, constant memory footprint, and only a logarithmic number of communication rounds. Critically, our approach is a black-box reduction that parallelizes any serial online learning algorithm, streamlining prior analysis and allowing us to leverage the significant progress that has been made in designing adaptive algorithms. In particular, we achieve optimal convergence rates without any prior knowledge of smoothness parameters, yielding a more robust algorithm that reduces the need for hyperparameter tuning. We implement our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada (0.04)
(4 more...)

Genre: Research Report > New Finding (0.49)

Industry: Education (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Distributed Stochastic Optimization via Adaptive SGD

Cutkosky, Ashok, Busa-Fekete, Róbert

Neural Information Processing SystemsDec-31-2018

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method by combining adaptivity with variance reduction techniques. Our analysis yields a linear speedup in the number of machines, constant memory footprint, and only a logarithmic number of communication rounds. Critically, our approach is a black-box reduction that parallelizes any serial online learning algorithm, streamlining prior analysis and allowing us to leverage the significant progress that has been made in designing adaptive algorithms. In particular, we achieve optimal convergence rates without any prior knowledge of smoothness parameters, yielding a more robust algorithm that reduces the need for hyperparameter tuning. We implement our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada (0.04)
(4 more...)

Genre: Research Report > New Finding (0.49)

Industry: Education (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Distributed Stochastic Optimization via Adaptive Stochastic Gradient Descent

Cutkosky, Ashok, Busa-Fekete, Robert

arXiv.org Machine LearningFeb-15-2018

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial in many applications, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial algorithm that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method based on adaptive step sizes and variance reduction techniques. We achieve a linear speedup in the number of machines, small memory footprint, and only a small number of synchronization rounds -- logarithmic in dataset size -- in which the computation nodes communicate with each other. Critically, our approach is a general reduction than parallelizes any serial SGD algorithm, allowing us to leverage the significant progress that has been made in designing adaptive SGD algorithms. We conclude by implementing our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1802.05811

Country: