Goto

Collaborating Authors

 Gradient Descent


Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks

AAAI Conferences

Effective training of deep neural networks suffers from two main issues. The first is that the parameter space of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inefficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.


Improving Predictive State Representations via Gradient Descent

AAAI Conferences

Predictive state representations (PSRs) model dynamical systems using appropriately chosen predictions about future observations as a representation of the current state. In contrast to the hidden states posited by HMMs or RNNs, PSR states are directly observable in the training data; this gives rise to a moment-matching spectral algorithm for learning PSRs that is computationally efficient and statistically consistent when the model complexity matches that of the true system generating the data. In practice, however, model mismatch is inevitable and while spectral learning remains appealingly fast and simple it may fail to find optimal models. To address this problem, we investigate the use of gradient methods for improving spectrally-learned PSRs. We show that only a small amount of additional gradient optimization can lead to significant performance gains, and moreover that initializing gradient methods with the spectral learning solution yields better models in significantly less time than starting from scratch.


Optimal Discrete Matrix Completion

AAAI Conferences

In recent years, matrix completion methods have been successfully applied to solve recommender system applications. Most of them focus on the matrix completion problem in real number domain, and produce continuous prediction values. However, these methods are not appropriate in some occasions where the entries of matrix are discrete values, such as movie ratings prediction, social network relation and interaction prediction, because their continuous outputs are not probabilities and uninterpretable. In this case, an additional step to process the continuous results with either heuristic threshold parameters or complicated mapping is necessary, while it is inefficient and may diverge from the optimal solution. There are a few matrix completion methods working on discrete number domain, however, they are not applicable to sparse and large-scale data set. In this paper, we propose a novel optimal discrete matrix completion model, which is able to learn optimal thresholds automatically and also guarantees an exact low-rank structure of the target matrix. We use stochastic gradient descent algorithm with momentum method to optimize the new objective function and speed up optimization. In the experiments, it is proved that our method can predict discrete values with high accuracy, very close to or even better than these values obtained by carefully tuned thresholds on Movielens and YouTube data sets. Meanwhile, our model is able to handle online data and easy to parallelize.


Learning Step Size Controllers for Robust Neural Network Training

AAAI Conferences

This paper investigates algorithms to automatically adapt the learning rate of neural networks (NNs). Starting with stochastic gradient descent, a large variety of learning methods has been proposed for the NN setting. However, these methods are usually sensitive to the initial learning rate which has to be chosen by the experimenter. We investigate several features and show how an adaptive controller can adjust the learning rate without prior knowledge of the learning problem at hand.


Factorization Ranking Model for Move Prediction in the Game of Go

AAAI Conferences

In this paper, we investigate the move prediction problem in the game of Go by proposing a new ranking model named Factorization Bradley Terry (FBT) model. This new model considers the move prediction problem as group competitions while also taking the interaction between features into account. A FBT model is able to provide a probability distribution that expresses a preference over moves. Therefore it can be easily compiled into an evaluation function and applied in a modern Go program. We propose a Stochastic Gradient Decent (SGD) algorithm to train a FBT model using expert game records, and provide two methods for fast computation of the gradient in order to speed up the training process. Experimental results show that our FBT model outperforms the state-of-the-art move prediction system of Latent Factor Ranking (LFR).


Towards Optimal Binary Code Learning via Ordinal Embedding

AAAI Conferences

Binary code learning, a.k.a., hashing, has been recently popular due to its high efficiency in large-scale similarity search and recognition. It typically maps high-dimensional data points to binary codes, where data similarity can be efficiently computed via rapid Hamming distance. Most existing unsupervised hashing schemes pursue binary codes by reducing the quantization error from an original real-valued data space to a resulting Hamming space. On the other hand, most existing supervised hashing schemes constrain binary code learning to correlate with pairwise similarity labels. However, few methods consider ordinal relations in the binary code learning process, which serve as a very significant cue to learn the optimal binary codes for similarity search. In this paper, we propose a novel hashing scheme, dubbed Ordinal Embedding Hashing (OEH), which embeds given ordinal relations among data points to learn the ranking-preserving binary codes. The core idea is to construct a directed unweighted graph to capture the ordinal relations, and then train the hash functions using this ordinal graph to preserve the permutation relations in the Hamming space. To learn such hash functions effectively, we further relax the discrete constraints and design a stochastic gradient decent algorithm to obtain the optimal solution. Experimental results on two large-scale benchmark datasets demonstrate that the proposed OEH method can achieve superior performance over the state-of-the-arts approaches.At last, the evaluation on query by humming dataset demonstrates the OEH also has good performance for music retrieval by using user's humming or singing.


Question about learning with gradient descent โ€ข /r/MachineLearning

@machinelearnbot

Once one has computed all derivatives of the cost with respect to weights and biases / errors, what, exactly, does one do? How does one update the weights/biases? I can think of a simple way of doing it for a net with no hidden layer, but I have no idea how one would do it on one with more layers, since changing any weight or bias will change the value of the error of neurons in earlier and later layers - hence my question.


Gradient Descent for Elastic net Regression โ€ข /r/MachineLearning

@machinelearnbot

I am using the from the wikipedia page to find the gradient descent. What will be gradient descent equation for this. And as for ridge regression if i am using very large data sets. Instead of calculating the inverse is there another way to calculate it.


Dropping Convexity for Faster Semi-definite Optimization

arXiv.org Machine Learning

We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$. We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees. FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.


A Linearly-Convergent Stochastic L-BFGS Algorithm

arXiv.org Machine Learning

We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as a recent approach to variance reduction for stochastic gradient descent from Johnson and Zhang (2013). We demonstrate experimentally that our algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision. Furthermore, we show that our algorithm performs well for a wide-range of step sizes, often differing by several orders of magnitude.