AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

I wasn't getting hired as a Data Scientist. So I sought data on who is.

#artificialintelligenceSep-10-2019, 02:46:30 GMT

At the time I'm writing this, every single trending article in my Towards Data Science home page is talking about applying or learning a particular skill in data science. At the top are big-picture skills such as How to Work With Stakeholders as a Data Scientist and How to Become a Data Engineer, followed by a litany of very specific skills including technical primers on Batch Gradient Descent vs. Stochastic Gradient Descent, Multi-Class Text Classification, Faster R-CNN, et cetera. As a dedicated Medium platform for "sharing concepts, ideas, and codes" in data science, it is not surprising that such learning resources attain high popularity amongst Towards Data Science followers, who are probably navigating data-centric projects and professions. But to a novice looking to prioritize what is essential, it can quickly become daunting. Should one train to become a master Kaggler?

artificial intelligence, data scientist, machine learning, (17 more...)

#artificialintelligence

Industry:

Education (1.00)
Information Technology > Services (0.70)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint

Amid, Ehsan, Warmuth, Manfred K.

arXiv.org Machine LearningSep-10-2019

We shed new insights on the two commonly used updates for the online $k$-PCA problem, namely, Krasulina's and Oja's updates. We show that Krasulina's update corresponds to a projected gradient descent step on the Stiefel manifold of the orthonormal $k$-frames, while Oja's update amounts to a gradient descent step using the unprojected gradient. Following these observations, we derive a more \emph{implicit} form of Krasulina's $k$-PCA update, i.e. a version that uses the information of the future gradient as much as possible. Most interestingly, our implicit Krasulina update avoids the costly QR-decomposition step by bypassing the orthonormality constraint. We show that the new update in fact corresponds to an online EM step applied to a probabilistic $k$-PCA model. The probabilistic view of the updates allows us to combine multiple models in a distributed setting. We show experimentally that the implicit Krasulina update yields superior convergence while being significantly faster. We also give strong evidence that the new update can benefit from parallelism and is more stable w.r.t. tuning of the learning rate.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1909.04803

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Add feedback

Better Communication Complexity for Local SGD

Khaled, Ahmed, Mishchenko, Konstantin, Richtárik, Peter

arXiv.org Machine LearningSep-10-2019

We revisit the local Stochastic Gradient Descent (local SGD) method and prove new convergence rates. We close the gap in the theory by showing that it works under unbounded gradients and extend its convergence to weakly convex functions. Furthermore, by changing the assumptions, we manage to get new bounds that explain in what regimes local SGD is faster that its non-local version. For instance, if the objective is strongly convex, we show that, up to constants, it is sufficient to synchronize $M$ times in total, where $M$ is the number of nodes. This improves upon the known requirement of Stich (2018) of $\sqrt{TM}$ synchronization times in total, where $T$ is the total number of iterations, which helps to explain the empirical success of local SGD.

artificial intelligence, machine learning, tnull 2, (16 more...)

arXiv.org Machine Learning

1909.04746

Country: North America > United States (0.28)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Gradient Descent with Compressed Iterates

Khaled, Ahmed, Richtárik, Peter

arXiv.org Machine LearningSep-10-2019

We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed by a mobile device before it is sent back to a server for aggregation. Our analysis provides a step towards closing the gap between the theory and practice of federated learning, and opens the possibility for many extensions.

artificial intelligence, arxiv, machine learning, (14 more...)

arXiv.org Machine Learning

1909.04716

Country: Asia > Middle East > Saudi Arabia (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

First Analysis of Local GD on Heterogeneous Data

Khaled, Ahmed, Mishchenko, Konstantin, Richtárik, Peter

arXiv.org Machine LearningSep-10-2019

We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heterogeneous. We show that in a low accuracy regime, the method has the same communication complexity as gradient descent.

artificial intelligence, machine learning, tnull 2, (14 more...)

arXiv.org Machine Learning

1909.04715

Genre: Research Report (0.50)

Industry: Information Technology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Add feedback

Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach

Yang, Haibo, Zhang, Xin, Fang, Minghong, Liu, Jia

arXiv.org Machine LearningSep-10-2019

-- In this work, we consider the resilience of distributed algorithms based on stochastic gradient descent (SGD) in distributed learning with potentially Byzantine attackers, who could send arbitrary information to the parameter server to disrupt the training process. T oward this end, we propose a new Lipschitz-inspired coordinate-wise median approach (LICM-SGD) to mitigate Byzantine attacks. We show that our LICM-SGD algorithm can resist up to half of the workers being Byzantine attackers, while still converging almost surely to a stationary region in non-convex settings. Also, our LICM-SGD method does not require any information about the number of attackers and the Lipschitz constant, which makes it attractive for practical implementations. Moreover, our LICM-SGD method enjoys the optimal O ( md) computational time-complexity in the sense that the time-complexity is the same as that of the standard SGD under no attacks. We conduct extensive experiments to show that our LICM-SGD algorithm consistently outperforms existing methods in training multi-class logistic regression and convolutional neural networks with MNIST and CIF AR-10 datasets. In our experiments, LICM-SGD also achieves a much faster running time thanks to its low computational time-complexity. Fueled by the rise of machine learning and big data analytics, recent years have witnessed an ever-increasing interest in solving large-scale empirical risk minimization problems (ERM) - a fundamental optimization problem that underpins a wide range of machine learning applications. In the post-Moore's-Law era, however, to sustain the rapidly growing computational power needs for solving large-scale ERM, the only viable solution is to exploit parallelism at and across different spatial scales. Indeed, the recent success of machine learning applications is due in large part to the use of distributed machine learning frameworks (e.g., TensorFlow [1] and others) which exploit the abundance of distributed CPU/GPU resources in large-scale computing clusters.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1909.04532

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Communication-Censored Distributed Stochastic Gradient Descent

Li, Weiyu, Chen, Tianyi, Li, Liping, Ling, Qing

arXiv.org Machine LearningSep-9-2019

This paper develops a communication-efficient algorithm to solve the stochastic optimization problem defined over a distributed network, aiming at reducing the burdensome communication in applications such as distributed machine learning. Different from the existing works based on quantization and sparsification, we introduce a communication-censoring technique to reduce the transmissions of variables, which leads to our communication-Censored distributed Stochastic Gradient Descent (CSGD) algorithm. Specifically, in CSGD, the latest mini-batch stochastic gradient at a worker will be transmitted to the server only if it is sufficiently informative. When the latest gradient is not available, the stale one will be reused at the server. To implement this communication-censoring strategy, the batch sizes are increasing in order to alleviate the effect of gradient noise. Theoretically, CSGD enjoys the same order of convergence rate as that of SGD, but effectively reduces communication. Numerical experiments further demonstrate the sizable communication saving of CSGD.

artificial intelligence, iteration, machine learning, (18 more...)

arXiv.org Machine Learning

1909.03631

Country: Europe (0.28)

Genre: Research Report (0.50)

Industry: Law > Civil Rights & Constitutional Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

A Stochastic Quasi-Newton Method with Nesterov's Accelerated Gradient

Indrapriyadarsini, S., Mahboubi, Shahrzad, Ninomiya, Hiroshi, Asai, Hideki

arXiv.org Machine LearningSep-8-2019

Incorporating second order curvature information in gradient based methods have shown to improve convergence drastically despite its computational intensity. In this paper, we propose a stochastic (online) quasi-Newton method with Nesterov's accelerated gradient in both its full and limited memory forms for solving large scale non-convex optimization problems in neural networks. The performance of the proposed algorithm is evaluated in Tensorflow on benchmark classification and regression problems. The results show improved performance compared to the classical second order oBFGS and oLBFGS methods and popular first order stochastic methods such as SGD and Adam. The performance with different momentum rates and batch sizes have also been illustrated. Keywords: Neural networks · stochastic method · online training · Nesterov's accelerated gradient · quasi-Newton method · limited memory · Tensorflow 1 Introduction Neural networks have shown to be effective in innumerous real-world applications.

algorithm, nesterov, quasi-newton method, (16 more...)

arXiv.org Machine Learning

1909.03621

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Japan > Honshū > Chūbu > Shizuoka Prefecture > Shizuoka (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)

Genre: Research Report (0.70)

Industry:

Education > Educational Setting > Online (0.86)
Education > Educational Technology > Educational Software > Computer Based Training (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Lecture Notes: Optimization for Machine Learning

Hazan, Elad

arXiv.org Machine LearningSep-8-2019

Lecture notes on optimization for machine learning, derived from a course at Princeton University and tutorials given in MLSS, Buenos Aires, as well as Simons Foundation, Berkeley.

artificial intelligence, inductive learning, machine learning, (16 more...)

arXiv.org Machine Learning

1909.0355

Country:

Asia > Middle East > Israel (0.28)
North America > United States > California (0.27)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.24)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.93)
(3 more...)

Add feedback

Introduction to Online Convex Optimization

Hazan, Elad

arXiv.org Machine LearningSep-7-2019

It was written as an advanced text to serve as a basis for a graduate course, and/or as a reference to the researcher diving into this fascinating world at the intersection of optimization and machine learning. Such a course was given at the Technion in the years 2010-2014 with slight variations from year to year, and later at Princeton University in the years 2015-2016. The core material in these courses is fully covered in this book, along with exercises that allow the students to complete parts of proofs, or that were found illuminating and thought-provoking. Most of the material is given with examples of applications, which are interlaced throughout different topics. These include prediction from expert advice, portfolio selection, matrix completion and recommendation systems, SVM training and more.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

1909.05207

Country:

North America > United States (1.00)
Asia > Middle East > Israel (0.27)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Leisure & Entertainment > Games (0.67)
Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.67)
(4 more...)

Add feedback