Goto

Collaborating Authors

 Gradient Descent


Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations

arXiv.org Machine Learning

This work explores non-negative matrix factorization based on regularized Poisson models for recommender systems with implicit-feedback data. The properties of Poisson likelihood allow a shortcut for very fast computation and optimization over elements with zero-value when the latent-factor matrices are non-negative, making it a more suitable approach than squared loss for very sparse inputs such as implicit-feedback data. A simple and embarrassingly parallel optimization approach based on proximal gradients is presented, which in large datasets converges 2-3 orders of magnitude faster than its Bayesian counterpart (Hierarchical Poisson Factorization) fit through variational inference techniques, and 1 order of magnitude faster than implicit-ALS fit with the Conjugate Gradient method.


On exponential convergence of SGD in non-convex over-parametrized learning

arXiv.org Machine Learning

Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exponential convergence of SGD with constant step size for convex loss functions. In this note, we extend those results to a much broader non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A number of important non-convex problems in machine learning, including some classes of neural networks, have been recently shown to satisfy the PL condition. We argue that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.


MaSS: an Accelerated Stochastic Method for Over-parametrized Learning

arXiv.org Machine Learning

Stochastic gradient based methods are dominant in optimization for most large-scale machine learning problems, due to the simplicity of computation and their compatibility with modern parallel hardware, such as GPU. In most cases these methods use over-parametrized models allowing for interpolation, i.e., perfect fitting of the training data. While we do not yet have a full understanding of why these solutions generalize (as indicated by a wealth of empirical evidence, e.g., [22, 2]) we are beginning to recognize their desirable properties for optimization, particularly in the SGD setting [11]. In this paper, we leverage the power of the interpolated setting to propose MaSS (Momentum-added Stochastic Solver), a stochastic momentum method for efficient training of over-parametrized models. See pseudo code in Appendix A. The algorithm keeps two variables (weights)w andu .


Lifted Proximal Operator Machines

arXiv.org Artificial Intelligence

We propose a new optimization method for training feed-forward neural networks. By rewriting the activation function as an equivalent proximal operator, we approximate a feed-forward neural network by adding the proximal operators to the objective function as penalties, hence we call the lifted proximal operator machine (LPOM). LPOM is block multi-convex in all layer-wise weights and activations. This allows us to use block coordinate descent to update the layer-wise weights and activations in parallel. Most notably, we only use the mapping of the activation function itself, rather than its derivatives, thus avoiding the gradient vanishing or blow-up issues in gradient based training methods. So our method is applicable to various non-decreasing Lipschitz continuous activation functions, which can be saturating and non-differentiable. LPOM does not require more auxiliary variables than the layer-wise activations, thus using roughly the same amount of memory as stochastic gradient descent (SGD) does. We further prove the convergence of updating the layer-wise weights and activations. Experiments on MNIST and CIFAR-10 datasets testify to the advantages of LPOM.


Nonlinear Collaborative Scheme for Deep Neural Networks

arXiv.org Machine Learning

Conventional research attributes the improvements of generalization ability of deep neural networks either to powerful optimizers or the new network design. Different from them, in this paper, we aim to link the generalization ability of a deep network to optimizing a new objective function. To this end, we propose a \textit{nonlinear collaborative scheme} for deep network training, with the key technique as combining different loss functions in a nonlinear manner. We find that after adaptively tuning the weights of different loss functions, the proposed objective function can efficiently guide the optimization process. What is more, we demonstrate that, from the mathematical perspective, the nonlinear collaborative scheme can lead to (i) smaller KL divergence with respect to optimal solutions; (ii) data-driven stochastic gradient descent; (iii) tighter PAC-Bayes bound. We also prove that its advantage can be strengthened by nonlinearity increasing. To some extent, we bridge the gap between learning (i.e., minimizing the new objective function) and generalization (i.e., minimizing a PAC-Bayes bound) in the new scheme. We also interpret our findings through the experiments on Residual Networks and DenseNet, showing that our new scheme performs superior to single-loss and multi-loss schemes no matter with randomization or not.


Stochastic Neighbor Embedding under f-divergences

arXiv.org Machine Learning

The t-distributed Stochastic Neighbor Embedding (t-SNE) is a powerful and popular method for visualizing high-dimensional data. It minimizes the Kullback-Leibler (KL) divergence between the original and embedded data distributions. In this work, we propose extending this method to other f-divergences. We analytically and empirically evaluate the types of latent structure-manifold, cluster, and hierarchical-that are well-captured using both the original KL-divergence as well as the proposed f-divergence generalization, and find that different divergences perform better for different types of structure. A common concern with $t$-SNE criterion is that it is optimized using gradient descent, and can become stuck in poor local minima. We propose optimizing the f-divergence based loss criteria by minimizing a variational bound. This typically performs better than optimizing the primal form, and our experiments show that it can improve upon the embedding results obtained from the original $t$-SNE criterion as well.


Learning with SGD and Random Features

arXiv.org Machine Learning

Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments.


Gradient descent, how neural networks learn Deep learning, chapter 2

#artificialintelligence

Subscribe for more (part 3 will be on backpropagation): http://3b1b.co/subscribe Funding provided by Amplify Partners and viewers like you. His post on Neural networks and topology is particular beautiful, but honestly all of the stuff there is great. And if you like that, you'll *love* the publications at distill: https://distill.pub/ For more videos, Welch Labs also has some great series on machine learning: https://youtu.be/i8D90DkCLhI


Neural source-filter-based waveform model for statistical parametric speech synthesis

arXiv.org Machine Learning

NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS Xin W ang 1, Shinji T akaki 1, Junichi Y amagishi 1 1 National Institute of Informatics, Japan wangxin@nii.ac.jp, takaki@nii.ac.jp, jyamagis@nii.ac.jp ABSTRACT Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model. Index Terms -- speech synthesis, neural network, waveform modeling 1. INTRODUCTION Text-to-speech (TTS) synthesis, a technology that converts texts into speech waveforms, has been advanced by using end-to-end architectures [1] and neural-network-based waveform models [2, 3, 4]. Among those waveform models, the WaveNet [2] directly models the distributions of waveform sampling points and has demonstrated outstanding performance.


On Why Gradient Descent is Even Needed – Daniel Burkhardt Cerigo – Medium

#artificialintelligence

Gradient descent is taught as a de facto part of machine learning, but when I got asked some questions that brought up why we even use it, I realised I wasn't crystal clear on an answer, so I went and made sure of why myself. I was giving a presentation to a set of very talented young mathematicians at King's College London Mathematics School, and during that talk I showed a slide from the classic Stanford's Andrew Ng's MOOC Machine Learning course. It shows how the Cost Function J(or Error or Loss) varies as we alter our model parameters θ1 and θ2, or as we "move" in parameter space -- thus creating a surface. This slide is shown to visually represent and help to understand how gradient descent works. We start at the upper most point (black x-mark), and take a short step in the direction of the gradient of the surface at that point (strictly it's the opposite direction of the gradient so we go "down" and not "up"), with the goal that we get to a trough or minimum of the cost function and thus our model makes preditions that are close(r) to the actual labels of our training data. We had already had a Q&A post talk, but after a few students approached me with more detailed questions.