AITopics

1902.03718

Country:

North America > Canada > Ontario > Toronto (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)

arXiv.org Machine LearningFeb-9-2019

A stochastic version of Stein Variational Gradient Descent for efficient sampling

Li, Lei, Liu, Jian-Guo, Liu, Zibu, Lu, Jianfeng

The empirical measure with samples from some probability measure (which might be known up to a multiplicative factor) has many applications in Bayesian inference [1, 2] and data assimilation [3]. A class of widely used sampling methods is the Markov Chain Monte Carlo (MCMC) methods, where the trajectory of a particle is given by some constructed Markov chain with the desired distribution invariant. The trajectory of the particle is clearly stochastic, and the Monte Carlo methods take effect slowly for small number of samples. Unlike MCMC, the Stein variational Gradient method (proposed by Liu and Wang in [4]) belongs to particle based variational inference sampling methods (see also [5, 6]). These methods update particles by solving optimization problems, and each iteration is expected to make progress. As a nonparametric variational inference method, SVGD gives a deterministic way to generate points that approximate the desired probability distribution by solving an ODE system.

batch size, particle system, rbm-svgd, (11 more...)

1902.03394

Country:

North America > United States > North Carolina > Durham County > Durham (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

arXiv.org Machine LearningFeb-9-2019

On the convergence rate of stochastic proximal point algorithm without strong convexity, smoothness or bounded gradients

Patrascu, Andrei

Significant parts of the recent learning literature on stochastic optimization algorithms focused on the theoretical and practical behaviour of stochastic first order schemes under different convexity properties. Due to its simplicity, the traditional method of choice for most supervised machine learning problems is the stochastic gradient descent (SGD) method. Many iteration improvements and accelerations have been added to the pure SGD in order to boost its convergence in various (strong) convexity setting. However, the Lipschitz gradient continuity or bounded gradients assumptions are an essential requirement for most existing stochastic first-order schemes. In this paper novel convergence results are presented for the stochastic proximal point algorithm in different settings. In particular, without any strong convexity, smoothness or bounded gradients assumptions, we show that a slightly modified quadratic growth assumption is sufficient to guarantee for the stochastic proximal point $\mathcal{O}\left(\frac{1}{k}\right)$ convergence rate, in terms of the distance to the optimal set. Furthermore, linear convergence is obtained for interpolation setting, when the optimal set of expected cost is included in the optimal sets of each functional component.

assumption, convergence rate, dist 2, (13 more...)

1901.08663

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Industry: Education > Curriculum (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Richemond, Pierre H., Guo, Yike

Combining learning rate decay and weight decay with complexity gradient descent - Part I

arXiv.org Machine LearningFeb-7-2019

The role of $L^2$ regularization, in the specific case of deep neural networks rather than more traditional machine learning models, is still not fully elucidated. We hypothesize that this complex interplay is due to the combination of overparameterization and high dimensional phenomena that take place during training and make it unamenable to standard convex optimization methods. Using insights from statistical physics and random fields theory, we introduce a parameter factoring in both the level of the loss function and its remaining nonconvexity: the \emph{complexity}. We proceed to show that it is desirable to proceed with \emph{complexity gradient descent}. We then show how to use this intuition to derive novel and efficient annealing schemes for the strength of $L^2$ regularization when performing standard stochastic gradient descent in deep neural networks.

arxiv e-print, neural network, regularization, (15 more...)

1902.02881

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
Europe > United Kingdom (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningFeb-7-2019

Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

Nguyen, Phan-Minh

Can multilayer neural networks -- typically constructed as highly complex structures with many nonlinearly activated neurons across layers -- behave in a non-trivial way that yet simplifies away a major part of their complexities? In this work, we uncover a phenomenon in which the behavior of these complex networks -- under suitable scalings and stochastic gradient descent dynamics -- becomes independent of the number of neurons as this number grows sufficiently large. We develop a formalism in which this many-neurons limiting behavior is captured by a set of equations, thereby exposing a previously unknown operating regime of these networks. While the current pursuit is mathematically non-rigorous, it is complemented with several experiments that validate the existence of this behavior.

international conference, neural network, neuron, (14 more...)

1902.0288

Country:

Asia > Middle East > Jordan (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningFeb-7-2019

Compatible Natural Gradient Policy Search

Pajarinen, Joni, Thai, Hong Linh, Akrour, Riad, Peters, Jan, Neumann, Gerhard

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

approximation, gradient, natural gradient, (15 more...)

1902.02823

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Lincolnshire > Lincoln (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Reisizadeh, Amirhossein, Prakash, Saurav, Pedarsani, Ramtin, Avestimehr, Amir Salman

CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning

We focus on the commonly used synchronous Gradient Descent paradigm for large-scale distributed learning, for which there has been a growing interest to develop efficient and robust gradient aggregation strategies that overcome two key bottlenecks: communication bandwidth and stragglers' delays. In particular, Ring-AllReduce (RAR) design has been proposed to avoid bandwidth bottleneck at any particular node by allowing each worker to only communicate with its neighbors that are arranged in a logical ring. On the other hand, Gradient Coding (GC) has been recently proposed to mitigate stragglers in a master-worker topology by allowing carefully designed redundant allocation of the data set to the workers. We propose a joint communication topology design and data set allocation strategy, named CodedReduce (CR), that combines the best of both RAR and GC. That is, it parallelizes the communications over a tree topology leading to efficient bandwidth utilization, and carefully designs a redundant data set allocation and coding strategy at the nodes to make the proposed gradient aggregation scheme robust to stragglers. In particular, we quantify the communication parallelization gain and resiliency of the proposed CR scheme, and prove its optimality when the communication topology is a regular tree. Furthermore, we empirically evaluate the performance of our proposed CR design over Amazon EC2 and demonstrate that it achieves speedups of up to 18.9x and 7.9x, respectively over the benchmarks GC and RAR.

gxx74xcwcukw1jbukgw6u jjebgtkpadkyux2bvm4v eb0uh9d, latexit latexitsha1, node, (14 more...)

1902.01981

Country: North America > United States > California > Santa Barbara County > Santa Barbara (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Exponentiated Gradient Meets Gradient Descent

Ghai, Udaya, Hazan, Elad, Singer, Yoram

The (stochastic) gradient descent and the multiplicative update method are probably the most popular algorithms in machine learning. We introduce and study a new regularization which provides a unification of the additive and multiplicative updates. This regularization is derived from an hyperbolic analogue of the entropy function, which we call hypentropy. It is motivated by a natural extension of the multiplicative update to negative numbers. The hypentropy has a natural spectral counterpart which we use to derive a family of matrix-based updates that bridge gradient methods and the multiplicative method for matrices. While the latter is only applicable to positive semi-definite matrices, the spectral hypentropy method can naturally be used with general rectangular matrices. We analyze the new family of updates by deriving tight regret bounds. We study empirically the applicability of the new update for settings such as multiclass learning, in which the parameters constitute a general rectangular matrix.

algorithm, matrix, regularization, (14 more...)

1902.01903

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Kuzborskij, Ilja, Cesa-Bianchi, Nicolò, Szepesvári, Csaba

Distribution-Dependent Analysis of Gibbs-ERM Principle

Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and ---to some extent--- Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. Our main results are distribution-dependent upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the effective dimension $\mathrm{tr}\left(\boldsymbol{H}^{\star} (\boldsymbol{H}^{\star} + \lambda \boldsymbol{I})^{-1}\right)$ of the problem, where $\boldsymbol{H}^{\star}$ is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. (2017), and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the localized excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the global excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest.

distribution-dependent analysis, excess risk, probability, (16 more...)

1902.01846

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Total stochastic gradient algorithms and applications in reinforcement learning

Parmas, Paavo

Backpropagation and the chain rule of derivatives have been prominent; however, the total derivative rule has not enjoyed the same amount of attention. In this work we show how the total derivative rule leads to an intuitive visual framework for creating gradient estimators on graphical models. In particular, previous "policy gradient theorems" are easily derived. We derive new gradient estimators based on density estimation, as well as a likelihood ratio gradient, which "jumps" to an intermediate node, not directly to the objective function. We evaluate our methods on model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm.

estimator, gradient, gradient estimator, (15 more...)

1902.01722

Country:

Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)