AITopics

1803.09357

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

arXiv.org Machine LearningMar-25-2018

Stochastic Variance Reduction for Policy Gradient Estimation

Xu, Tianbing, Liu, Qiang, Peng, Jian

Recent advances in policy gradient methods and deep learning have demonstrated their applicability for complex reinforcement learning problems. However, the variance of the performance gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) to model-free policy gradient to significantly improve the sample-efficiency. The SVRG estimation is incorporated into a trust-region Newton conjugate gradient framework for the policy optimization. On several Mujoco tasks, our method achieves significantly better performance compared to the state-of-the-art model-free policy gradient methods in robotic continuous control such as trust region policy optimization (TRPO)

artificial intelligence, iteration, machine learning, (17 more...)

1710.06034

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.70)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Chouza, Mariano, Roberts, Stephen, Zohren, Stefan

Gradient descent in Gaussian random fields as a toy model for high-dimensional optimisation in deep learning

arXiv.org Machine LearningMar-24-2018

Our aim is to study gradient descent in such loss functions or energy landscapes and compare it to results obtained from real high-dimensional optimization problems such as encountered in deep learning. In particular, we analyze the distribution of the improved loss function after a step of gradient descent, provide analytic expressions for the moments as well as prove asymptotic normality as the dimension of the parameter space becomes large. Moreover, we compare this with the expectation of the global minimum of the landscape obtained by means of the Euler characteristic of excursion sets. Besides complementing our analytical findings with numerical results from simulated Gaussian random fields, we also compare it to loss functions obtained from optimisation problems on synthetic and real data sets by proposing a "black box" random field toy-model for a deep neural network loss function.

artificial intelligence, machine learning, random field, (19 more...)

1803.09119

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Xie, Cong, Koyejo, Oluwasanmi, Gupta, Indranil

Generalized Byzantine-tolerant SGD

We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent (SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server (PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.

artificial intelligence, dimension, machine learning, (13 more...)

1802.10116

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Multi-View Factorization Machines

Cao, Bokai, Zhou, Hucheng, Li, Guoqiang, Yu, Philip S.

For a learning task, data can usually be collected from different sources or be represented from multiple views. For example, laboratory results from different medical examinations are available for disease diagnosis, and each of them can only reflect the health state of a person from a particular aspect/view. Therefore, different views provide complementary information for learning tasks. An effective integration of the multi-view information is expected to facilitate the learning performance. In this paper, we propose a general predictor, named multi-view machines (MVMs), that can effectively include all the possible interactions between features from multiple views. A joint factorization is embedded for the full-order interaction parameters which allows parameter estimation under sparsity. Moreover, MVMs can work in conjunction with different loss functions for a variety of machine learning tasks. A stochastic gradient descent method is presented to learn the MVM model. We further illustrate the advantages of MVMs through comparison with other methods for multi-view classification, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs).

artificial intelligence, interaction, machine learning, (19 more...)

doi: 10.1145/2835776.2835777

1506.0111

Country:

North America > United States (0.68)
Asia (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Alistarh, Dan, Allen-Zhu, Zeyuan, Li, Jerry

Byzantine Stochastic Gradient Descent

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.

algorithm, artificial intelligence, machine learning, (16 more...)

1803.08917

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Alistarh, Dan, De Sa, Christopher, Konstantinov, Nikola

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from execution in a distributed environment. However, surprisingly, the convergence properties of this classic algorithm in the standard shared-memory model are still not well-understood. In this work, we address this gap, and provide new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary. Our results give improved upper and lower bounds on the "price of asynchrony" when executing the fundamental SGD algorithm in a concurrent setting. They show that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations. At the same time, we exhibit a fundamental trade-off between the maximum delay in the system and the rate at which SGD can converge, which governs the set of parameters under which this algorithm can still work efficiently.

algorithm, artificial intelligence, machine learning, (15 more...)

1803.08841

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Jentzen, Arnulf, von Wurstemberger, Philippe

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

arXiv.org Machine LearningMar-22-2018

The stochastic gradient descent (SGD) optimization algorithm plays a central role in machine learning and, in particular, deep learning applications such as image analysis and speech recognition (cf., e.g., [12, 13, 16, 23]). It is therefore important to analyze and quantify the convergence speed of the SGD method. There is a vast amount of scientific literature investigating and providing upper bounds for the SGD method and modifications of it (cf., e.g., [3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 20, 21, 24] and cf., e.g., [14] for a more comprehensive review of the literature). Much less attention has been paid to proving lower error bounds for the SGD method, that is, to quantifying the best possible speed of convergence which the SGD method can achieve (cf., e.g., [2, 17, 19, 22, 25]). It is the key contribution of this paper to make a step in this direction.

artificial intelligence, lemma 4, machine learning, (20 more...)

1803.086

Country: Europe (0.28)

Genre:

Research Report (0.84)
Overview (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Maennel, Hartmut, Bousquet, Olivier, Gelly, Sylvain

Gradient Descent Quantizes ReLU Network Features

arXiv.org Machine LearningMar-22-2018

Deep neural networks are often trained in the over-parametrized regime (i.e. with far more parameters than training examples), and understanding why the training converges to solutions that generalize remains an open problem. Several studies have highlighted the fact that the training procedure, i.e. mini-batch Stochastic Gradient Descent (SGD) leads to solutions that have specific properties in the loss landscape. However, even with plain Gradient Descent (GD) the solutions found in the over-parametrized regime are pretty good and this phenomenon is poorly understood. We propose an analysis of this behavior for feedforward networks with a ReLU activation function under the assumption of small initialization and learning rate and uncover a quantization effect: The weight vectors tend to concentrate at a small number of directions determined by the input data. As a consequence, we show that for given input data there are only finitely many, "simple" functions that can be obtained, independent of the network size. This puts these functions in analogy to linear interpolations (for given input data there are finitely many triangulations, which each determine a function by linear interpolation). We ask whether this analogy extends to the generalization properties - while the usual distribution-independent generalization property does not hold, it could be that for e.g. smooth functions with bounded second derivative an approximation property holds which could "explain" generalization of networks (of unbounded size) to unseen inputs.

artificial intelligence, descent quantize relu network feature, machine learning

1803.08367

Genre: Research Report (0.40)

Industry:

Telecommunications > Networks (0.40)
Information Technology > Networks (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningMar-22-2018

Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit

Mitra, Partha P

Modern supervised learning techniques, particularly those using so called deep nets, involve fitting high dimensional labelled data sets with functions containing very large numbers of parameters. Much of this work is empirical, and interesting phenomena have been observed that require theoretical explanations, however the non-convexity of the loss functions complicates the analysis. Recently it has been proposed that some of the success of these techniques resides in the effectiveness of the simple stochastic gradient descent algorithm in the so called interpolation limit in which all labels are fit perfectly. This analysis is made possible since the SGD algorithm reduces to a stochastic linear system near the interpolating minimum of the loss function. Here we exploit this insight by analyzing a distributed algorithm for gradient descent, also in the interpolating limit. The algorithm corresponds to gradient descent applied to a simple penalized distributed loss function, $L({\bf w}_1,...,{\bf w}_n) = \Sigma_i l_i({\bf w}_i) + \mu \sum_{}|{\bf w}_i-{\bf w}_j|^2$. Here each node is allowed its own parameter vector and $$ denotes edges of a connected graph defining the communication links between nodes. It is shown that this distributed algorithm converges linearly (ie the error reduces exponentially with iteration number), with a rate $1-\frac{\eta}{n}\lambda_{min}(H)

algorithm, artificial intelligence, machine learning, (15 more...)

1803.02922

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)