Goto

Collaborating Authors

 Gradient Descent


End-to-end speech recognition with neon - Nervana

#artificialintelligence

Thus, given a sequence of frames corresponding to an utterance, the model is required to produce, for each frame, a probability distribution over the alphabet. During the training phase, the softmax outputs are fed into a CTC cost function (more on this shortly) which uses the actual transcripts to (i) score the model's predictions, and (ii) generate an error signal quantifying the accuracy of the model's predictions. The overall goal is to train the model to increase the overall score of its predictions relative to the actual transcripts. Training Empirically, we have found that using stochastic gradient descent with momentum paired with gradient clipping leads to the best performing models. Deeper networks (seven layers or more) also tend to perform better in general.


Expectation Propagation performs a smoothed gradient descent

arXiv.org Machine Learning

Bayesian inference is a popular method to build learning algorithms but it is hampered by the fact that its key object, the posterior probability distribution, is often uncomputable. Expectation Propagation (EP) (Minka [2001]) is a popular algorithm that solves this issue by computing a parametric approximation (e.g: Gaussian) to the density of the posterior. However, while it is known empirically to quickly compute fine approximations, EP is extremely poorly understood which prevents it from being adopted by a larger fraction of the community. The object of the present article is to shed intuitive light on EP, by relating it to other better understood methods. More precisely, we link it to using gradient descent to compute the Laplace approximation of a target probability distribution. We show that EP is exactly equivalent to performing gradient descent on a smoothed energy landscape: i.e: the original energy landscape convoluted with some smoothing kernel. This also relates EP to algorithms that compute the Gaussian approximation which minimizes the reverse KL divergence to the target distribution, a link that has been conjectured before but has not been proved rigorously yet. These results can help practitioners to get a better feel for how EP works, as well as lead to other new results on this important method. This article was submitted and accepted to the Advances in Approximate Bayesian Inference NIPS 2016 workshop (www.approximateinference.org).


Efficient Distributed Semi-Supervised Learning using Stochastic Regularization over Affinity Graphs

arXiv.org Machine Learning

We describe a computationally efficient, stochastic graph-regularization technique that can be utilized for the semi-supervised training of deep neural networks in a parallel or distributed setting. We utilize a technique, first described in [13] for the construction of mini-batches for stochastic gradient descent (SGD) based on synthesized partitions of an affinity graph that are consistent with the graph structure, but also preserve enough stochasticity for convergence of SGD to good local minima. We show how our technique allows a graph-based semi-supervised loss function to be decomposed into a sum over objectives, facilitating data parallelism for scalable training of machine learning models. Empirical results indicate that our method significantly improves classification accuracy compared to the fully-supervised case when the fraction of labeled data is low, and in the parallel case, achieves significant speed-up in terms of wall-clock time to convergence. We show the results for both sequential and distributed-memory semi-supervised DNN training on a speech corpus.


Tensor Decomposition for Signal Processing and Machine Learning

arXiv.org Machine Learning

T ensors have a rich history, stretching over almost a century, and touching upon numerous disciplines; but they have only recently become ubiquitous in signal and data analytics at the confluence of signal processing, statistics, data mining and machine learning. This overview article aims to provide a good starting point for researchers and practitioners interested in learning about and working with tensors. As such, it focuses on fundamentals and motivation (using various application examples), aiming to strike an appropriate balance of breadth and depth that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing tensor algorithms and software. Some background in applied optimization is useful but not strictly required. The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties (including fairly good coverage of identifiability); broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning. Index Terms --T ensor decomposition, tensor factorization, rank, canonical polyadic decomposition (CPD), parallel factor analysis (PARAF AC), T ucker model, higher-order singular value decomposition (HOSVD), multilinear singular value decomposition (MLSVD), uniqueness, NPhard problems, alternating optimization, alternating direction method of multipliers, gradient descent, Gauss-Newton, stochastic gradient, Cram er-Rao bound, communications, source separation, harmonic retrieval, speech separation, collaborative filtering, mixture modeling, topic modeling, classification, subspace learning. N.D. Sidiropoulos, X. Fu, and K. Huang are with the ECE Department, University of Minnesota, Minneapolis, USA; email: (nikos,xfu,huang663)@umn.edu .


Parsimonious Online Learning with Kernels via Sparse Projections in Function Space

arXiv.org Machine Learning

Despite their attractiveness, popular perception is that techniques for nonparametric function approximation do not scale to streaming data due to an intractable growth in the amount of storage they require. To solve this problem in a memory-affordable way, we propose an online technique based on functional stochastic gradient descent in tandem with supervised sparsification based on greedy function subspace projections. The method, called parsimonious online learning with kernels (POLK), provides a controllable tradeoff? between its solution accuracy and the amount of memory it requires. We derive conditions under which the generated function sequence converges almost surely to the optimal function, and we establish that the memory requirement remains finite. We evaluate POLK for kernel multi-class logistic regression and kernel hinge-loss classification on three canonical data sets: a synthetic Gaussian mixture model, the MNIST hand-written digits, and the Brodatz texture database. On all three tasks, we observe a favorable tradeoff of objective function evaluation, classification performance, and complexity of the nonparametric regressor extracted the proposed method.


Stochastic Quasi-Newton Langevin Monte Carlo

arXiv.org Machine Learning

Recently, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods have been proposed for scaling up Monte Carlo computations to large data problems. Whilst these approaches have proven useful in many applications, vanilla SG-MCMC might suffer from poor mixing rates when random variables exhibit strong couplings under the target densities or big scale differences. In this study, we propose a novel SG-MCMC method that takes the local geometry into account by using ideas from Quasi-Newton optimization methods. These second order methods directly approximate the inverse Hessian by using a limited history of samples and their gradients. Our method uses dense approximations of the inverse Hessian while keeping the time and memory complexities linear with the dimension of the problem. We provide a formal theoretical analysis where we show that the proposed method is asymptotically unbiased and consistent with the posterior expectations. We illustrate the effectiveness of the approach on both synthetic and real datasets. Our experiments on two challenging applications show that our method achieves fast convergence rates similar to Riemannian approaches while at the same time having low computational requirements similar to diagonal preconditioning approaches.


Asynchronous Stochastic Gradient MCMC with Elastic Coupling

arXiv.org Machine Learning

We consider parallel asynchronous Markov Chain Monte Carlo (MCMC) sampling for problems where we can leverage (stochastic) gradients to define continuous dynamics which explore the target distribution. We outline a solution strategy for this setting based on stochastic gradient Hamiltonian Monte Carlo sampling (SGHMC) which we alter to include an elastic coupling term that ties together multiple MCMC instances. The proposed strategy turns inherently sequential HMC algorithms into asynchronous parallel versions. First experiments empirically show that the resulting parallel sampler significantly speeds up exploration of the target distribution, when compared to standard SGHMC, and is less prone to the harmful effects of stale gradients than a naive parallelization approach.


Recurrent neural network training with preconditioned stochastic gradient descent

arXiv.org Machine Learning

This paper studies the performance of a recently proposed preconditioned stochastic gradient descent (PSGD) algorithm on recurrent neural network (RNN) training. PSGD adaptively estimates a preconditioner to accelerate gradient descent, and is designed to be simple, general and easy to use, as stochastic gradient descent (SGD). RNNs, especially the ones requiring extremely long term memories, are difficult to train. We have tested PSGD on a set of synthetic pathological RNN learning problems and the real world MNIST handwritten digit recognition task. Experimental results suggest that PSGD is able to achieve highly competitive performance without using any trick like preprocessing, pretraining or parameter tweaking.


Model-based Adversarial Imitation Learning

arXiv.org Machine Learning

Generative adversarial learning is a popular new approach to training generative models which has been proven successful for other related problems as well. The general idea is to maintain an oracle $D$ that discriminates between the expert's data distribution and that of the generative model $G$. The generative model is trained to capture the expert's distribution by maximizing the probability of $D$ misclassifying the data it generates. Overall, the system is \emph{differentiable} end-to-end and is trained using basic backpropagation. This type of learning was successfully applied to the problem of policy imitation in a model-free setup. However, a model-free approach does not allow the system to be differentiable, which requires the use of high-variance gradient estimations. In this paper we introduce the Model based Adversarial Imitation Learning (MAIL) algorithm. A model-based approach for the problem of adversarial imitation learning. We show how to use a forward model to make the system fully differentiable, which enables us to train policies using the (stochastic) gradient of $D$. Moreover, our approach requires relatively few environment interactions, and fewer hyper-parameters to tune. We test our method on the MuJoCo physics simulator and report initial results that surpass the current state-of-the-art.


Alternating Back-Propagation for Generator Network

arXiv.org Machine Learning

This paper proposes an alternating back-propagation algorithm for learning the generator network model. The model is a non-linear generalization of factor analysis. In this model, the mapping from the continuous latent factors to the observed signal is parametrized by a convolutional neural network. The alternating back-propagation algorithm iterates the following two steps: (1) Inferential back-propagation, which infers the latent factors by Langevin dynamics or gradient descent. (2) Learning back-propagation, which updates the parameters given the inferred latent factors by gradient descent. The gradient computations in both steps are powered by back-propagation, and they share most of their code in common. We show that the alternating back-propagation algorithm can learn realistic generator models of natural images, video sequences, and sounds. Moreover, it can also be used to learn from incomplete or indirect training data.