AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Decentralized Frank-Wolfe Algorithm for Convex and Non-convex Problems

Wai, Hoi-To, Lafond, Jean, Scaglione, Anna, Moulines, Eric

arXiv.org Machine LearningMar-17-2017

Decentralized optimization algorithms have received much attention due to the recent advances in network information processing. However, conventional decentralized algorithms based on projected gradient descent are incapable of handling high dimensional constrained problems, as the projection step becomes computationally prohibitive to compute. To address this problem, this paper adopts a projection-free optimization approach, a.k.a.~the Frank-Wolfe (FW) or conditional gradient algorithm. We first develop a decentralized FW (DeFW) algorithm from the classical FW algorithm. The convergence of the proposed algorithm is studied by viewing the decentralized algorithm as an inexact FW algorithm. Using a diminishing step size rule and letting $t$ be the iteration number, we show that the DeFW algorithm's convergence rate is ${\cal O}(1/t)$ for convex objectives; is ${\cal O}(1/t^2)$ for strongly convex objectives with the optimal solution in the interior of the constraint set; and is ${\cal O}(1/\sqrt{t})$ towards a stationary point for smooth but non-convex objectives. We then show that a consensus-based DeFW algorithm meets the above guarantees with two communication rounds per iteration. Furthermore, we demonstrate the advantages of the proposed DeFW algorithm on low-complexity robust matrix completion and communication efficient sparse learning. Numerical results on synthetic and real data are presented to support our findings.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

doi: 10.1109/TAC.2017.2685559

1612.01216

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Online Multilinear Dictionary Learning for Sequential Compressive Sensing

Variddhisaï, Thiernithi, Mandic, Danilo

arXiv.org Machine LearningMar-16-2017

A method for online tensor dictionary learning is proposed. With the assumption of separable dictionaries, tensor contraction is used to diminish a $N$-way model of $\mathcal{O}\left(L^N\right)$ into a simple matrix equation of $\mathcal{O}\left(NL^2\right)$ with a real-time capability. To avoid numerical instability due to inversion of sparse matrix, a class of stochastic gradient with memory is formulated via a least-square solution to guarantee convergence and robustness. Both gradient descent with exact line search and Newton's method are discussed and realized. Extensions onto how to deal with bad initialization and outliers are also explained in detail. Experiments on two synthetic signals confirms an impressive performance of our proposed method.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

1703.02492

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Distribution-Specific Hardness of Learning Neural Networks

Shamir, Ohad

arXiv.org Machine LearningMar-9-2017

Although neural networks are routinely and successfully trained in practice using simple gradient-based methods, most existing theoretical results are negative, showing that learning such networks is difficult, in a worst-case sense over all data distributions. In this paper, we take a more nuanced view, and consider whether specific assumptions on the "niceness" of the input distribution, or "niceness" of the target function (e.g. in terms of smoothness, non-degeneracy, incoherence, random choice of parameters etc.), are sufficient to guarantee learnability using gradient-based methods. We provide evidence that neither class of assumptions alone is sufficient: On the one hand, for any member of a class of "nice" target functions, there are difficult input distributions. On the other hand, we identify a family of simple target functions, which are difficult to learn even if the input distribution is "nice". To prove our results, we develop some tools which may be of independent interest, such as extending Fourier-based hardness techniques developed in the context of statistical queries \cite{blum1994weakly}, from the Boolean cube to Euclidean space and to more general classes of functions.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

1609.01037

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Byzantine-Tolerant Machine Learning

Blanchard, Peva, Mhamdi, El Mahdi El, Guerraoui, Rachid, Stainer, Julien

arXiv.org Machine LearningMar-8-2017

The growth of data, the need for scalability and the complexity of models used in modern machine learning calls for distributed implementations. Yet, as of today, distributed machine learning frameworks have largely ignored the possibility of arbitrary (i.e., Byzantine) failures. In this paper, we study the robustness to Byzantine failures at the fundamental level of stochastic gradient descent (SGD), the heart of most machine learning algorithms. Assuming a set of $n$ workers, up to $f$ of them being Byzantine, we ask how robust can SGD be, without limiting the dimension, nor the size of the parameter space. We first show that no gradient descent update rule based on a linear combination of the vectors proposed by the workers (i.e, current approaches) tolerates a single Byzantine failure. We then formulate a resilience property of the update rule capturing the basic requirements to guarantee convergence despite $f$ Byzantine workers. We finally propose Krum, an update rule that satisfies the resilience property aforementioned. For a $d$-dimensional learning problem, the time complexity of Krum is $O(n^2 \cdot (d + \log n))$.

artificial intelligence, machine learning, vector, (16 more...)

arXiv.org Machine Learning

1703.02757

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Gradient Coding

Tandon, Rashish, Lei, Qi, Dimakis, Alexandros G., Karampatziakis, Nikos

arXiv.org Machine LearningMar-7-2017

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for Synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and generalization error.

artificial intelligence, machine learning, straggler, (19 more...)

arXiv.org Machine Learning

1612.03301

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Free Data Science eBooks - March 2017

#artificialintelligenceMar-6-2017, 23:40:26 GMT

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds.

artificial intelligence, free data science ebook, machine learning, (4 more...)

#artificialintelligence

Genre: Summary/Review (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data

Karl, Maximilian, Soelch, Maximilian, Bayer, Justin, van der Smagt, Patrick

arXiv.org Machine LearningMar-3-2017

We introduce Deep Variational Bayes Filters (DVBF), a new method for unsupervised learning and identification of latent Markovian state space models. Leveraging recent advances in Stochastic Gradient Variational Bayes, DVBF can overcome intractable inference distributions via variational inference. Thus, it can handle highly nonlinear input data with temporal and spatial dependencies such as image sequences without domain knowledge. Our experiments show that enabling backpropagation through transitions enforces state space assumptions and significantly improves information content of the latent embedding. This also enables realistic long-term prediction.

artificial intelligence, machine learning, transition, (18 more...)

arXiv.org Machine Learning

1605.06432

Country:

North America (0.46)
Europe > Germany (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

How to Escape Saddle Points Efficiently

Jin, Chi, Ge, Rong, Netrapalli, Praneeth, Kakade, Sham M., Jordan, Michael I.

arXiv.org Machine LearningMar-2-2017

This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

artificial intelligence, machine learning, stationary point, (15 more...)

arXiv.org Machine Learning

1703.00887

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Russia (0.04)
(3 more...)

Genre: Research Report > New Finding (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Add feedback

Diverse Neural Network Learns True Target Functions

Xie, Bo, Liang, Yingyu, Song, Le

arXiv.org Machine LearningMar-2-2017

Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties? In this paper, we answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Machine Learning

1611.03131

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Convergence rate of a simulated annealing algorithm with noisy observations

Bouttier, Clément, Gavra, Ioana

arXiv.org Machine LearningMar-1-2017

In this paper we propose a modified version of the simulated annealing algorithm for solving a stochastic global optimization problem. More precisely, we address the problem of finding a global minimizer of a function with noisy evaluations. We provide a rate of convergence and its optimized parametrization to ensure a minimal number of evaluations for a given accuracy and a confidence level close to 1. This work is completed with a set of numerical experimentations and assesses the practical performance both on benchmark test cases and on real world examples.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

1703.00329

Country:

Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Transportation > Air (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback