Gradient Descent
Nonconvex Rectangular Matrix Completion via Gradient Descent without $\ell_{2,\infty}$ Regularization
Chen, Ji, Liu, Dekai, Li, Xiaodong
The analysis of nonconvex matrix completion has recently attracted much attention in the community of machine learning thanks to its computational convenience. Existing analysis on this problem, however, usually relies on $\ell_{2,\infty}$ projection or regularization that involves unknown model parameters, although they are observed to be unnecessary in numerical simulations, see, e.g. Zheng and Lafferty [2016]. In this paper, we extend the analysis of the vanilla gradient descent for positive semidefinite matrix completion proposed in Ma et al. [2017] to the rectangular case, and more significantly, improve the required sampling complexity from $\widetilde{O}(r^3)$ to $\widetilde{O}(r^2)$. Our technical ideas and contributions are potentially useful in improving the leave-one-out analysis in other related problems.
Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent
Hu, Wenqing, Zhu, Zhanxing, Xiong, Haoyi, Huan, Jun
We interpret the variational inference of the Stochastic Gradient Descent (SGD) as minimizing a new potential function named the \textit{quasi-potential}. We analytically construct the quasi-potential function in the case when the loss function is convex and admits only one global minimum point. We show in this case that the quasi-potential function is related to the noise covariance structure of SGD via a partial differential equation of Hamilton-Jacobi type. This relation helps us to show that anisotropic noise leads to faster escape than isotropic noise. We then consider the dynamics of SGD in the case when the loss function is non-convex and admits several different local minima. In this case, we demonstrate an example that shows how the noise covariance structure plays a role in "implicit regularization", a phenomenon in which SGD favors some particular local minimum points. This is done through the relation between the noise covariance structure and the quasi-potential function. Our analysis is based on Large Deviations Theory (LDT), and they are validated by numerical experiments.
A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
Simsekli, Umut, Sagun, Levent, Gurbuzbalaban, Mert
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\alpha$-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.
Designing neural networks through neuroevolution
Much of recent machine learning has focused on deep learning, in which neural network weights are trained through variants of stochastic gradient descent. An alternative approach comes from the field of neuroevolution, which harnesses evolutionary algorithms to optimize neural networks, inspired by the fact that natural brains themselves are the products of an evolutionary process. Neuroevolution enables important capabilities that are typically unavailable to gradient-based approaches, including learning neural network building blocks (for example activation functions), hyperparameters, architectures and even the algorithms for learning themselves. Neuroevolution also differs from deep learning (and deep reinforcement learning) by maintaining a population of solutions during search, enabling extreme exploration and massive parallelization. Finally, because neuroevolution research has (until recently) developed largely in isolation from gradient-based neural network research, it has developed many unique and effective techniques that should be effective in other machine learning areas too.
Generalization in Deep Networks: The Role of Distance from Initialization
Nagarajan, Vaishnavh, Kolter, J. Zico
Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on {\em a given random initialization of the network} and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of {\em the $\ell_2$ distance from the initialization}. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.
Delta Learning Rule & Gradient Descent Neural Networks
The development of the perceptron was a big step towards the goal of creating useful connectionist networks capable of learning complex relations between inputs and outputs. In the late 1950's, the connectionist community understood that what was needed for further development of connectionist models was a mathematically-derived (and thus potentially more flexible and powerful) rule for learning. By early 1960's, the Delta Rule [also known as the Widrow & Hoff Learning rule or the Least Mean Square (LMS) rule] was invented by Widrow and Hoff. This rule is similar to the perceptron learning rule by McClelland & Rumelhart, 1988, but is also characterized by a mathematical utility and elegance missing in the perceptron and other early learning rules. The Delta Rule uses the difference between target activation (i.e., target output values) and obtained activation to drive learning.
Dynamic Online Gradient Descent with Improved Query Complexity: A Theoretical Revisit
Zhao, Yawei, Zhu, En, Liu, Xinwang, Yin, Jianping
We provide a new theoretical analysis framework to investigate online gradient descent in the dynamic environment. Comparing with the previous work, the new framework recovers the state-of-the-art dynamic regret, but does not require extra gradient queries for every iteration. Specifically, when functions are $\alpha$ strongly convex and $\beta$ smooth, to achieve the state-of-the-art dynamic regret, the previous work requires $O(\kappa)$ with $\kappa = \frac{\beta}{\alpha}$ queries of gradients at every iteration. But, our framework shows that the query complexity can be improved to be $O(1)$, which does not depend on $\kappa$. The improvement is significant for ill-conditioned problems because that their objective function usually has a large $\kappa$.
Credit Assignment Techniques in Stochastic Computation Graphs
Weber, Thรฉophane, Heess, Nicolas, Buesing, Lars, Silver, David
Stochastic computation graphs (SCGs) provide a formalism to represent structured optimization problems arising in artificial intelligence, including supervised, unsupervised, and reinforcement learning. Previous work has shown that an unbiased estimator of the gradient of the expected loss of SCGs can be derived from a single principle. However, this estimator often has high variance and requires a full model evaluation per data point, making this algorithm costly in large graphs. In this work, we address these problems by generalizing concepts from the reinforcement learning literature. We introduce the concepts of value functions, baselines and critics for arbitrary SCGs, and show how to use them to derive lower-variance gradient estimates from partial model evaluations, paving the way towards general and efficient credit assignment for gradient-based optimization. In doing so, we demonstrate how our results unify recent advances in the probabilistic inference and reinforcement learning literature.
How to chose an activation function for your network
This is the third post in the optimization series, where we are trying to give the reader a comprehensive review of optimization in deep learning. Mini Batch Gradient Descent is used to combat local minima, and saddle points. How adaptive methods like Momentum, RMSProp and Adam, augment vanilla Gradient Descent to address the problem of pathological curvature. Neural networks, unlike the machine learning methods that came before it do not rest upon any probabilistic or statistical assumptions about the data they are fed. However, one of the most, if not the most important element required to ensure that neural networks learn properly is that the data fed to the layers of a neural network exhibit certain properties. In this article, we will cover problems No. 1 and 2, and how activation functions are used to address them.
The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space
The extended Kalman filter is perhaps the most standard tool to estimate in real time the state of a dynamical system from noisy measurements of some function of the system, with extensive practical applications (such as position tracking via GPS). While the plain Kalman filter for linear systems is well-understood, the extended Kalman filter relies on linearizations which have been debated. We recover the exact extended Kalman filter equations from first principles in statistical learning: the extended Kalman filter is equal to Amari's online natural gradient, applied in the space of trajectories of the system. Namely, each possible trajectory of the dynamical system defines a probability law over possible observations. In principle this makes it possible to treat the underlying trajectory as the parameter of a statistical model of the observations. Then the parameter can be learned by gradient ascent on the log-likelihood of observations, as they become available. Using Amari's natural gradient from information geometry (a gradient descent preconditioned with the Fisher matrix, which provides parameterization-invariance) exactly recovers the extended Kalman filter. This applies only to a particular choice of process noise in the Kalman filter, namely, taking noise proportional to the posterior covariance - a canonical choice in the absence of specific model information.