Gradient Descent
Next Hit Predictor - Self-exciting Risk Modeling for Predicting Next Locations of Serial Crimes
Our goal is to predict the location of the next crime in a crime series, based on the identified previous offenses in the series. We build a predictive model called Next Hit Predictor (NHP) that finds the most likely location of the next serial crime via a carefully designed risk model. The risk model follows the paradigm of a self-exciting point process which consists of a background crime risk and triggered risks stimulated by previous offenses in the series. Thus, NHP creates a risk map for a crime series at hand. To train the risk model, we formulate a convex learning objective that considers pairwise rankings of locations and use stochastic gradient descent to learn the optimal parameters. Next Hit Predictor incorporates both spatial-temporal features and geographical characteristics of prior crime locations in the series. Next Hit Predictor has demonstrated promising results on decades' worth of serial crime data collected by the Crime Analysis Unit of the Cambridge Police Department in Massachusetts, USA.
On the potential for open-endedness in neural networks
Guttenberg, Nicholas, Virgo, Nathaniel, Penn, Alexandra
Natural evolution gives the impression of leading to an open-ended process of increasing diversity and complexity. If our goal is to produce such open-endedness artificially, this suggests an approach driven by evolutionary metaphor. On the other hand, techniques from machine learning and artificial intelligence are often considered too narrow to provide the sort of exploratory dynamics associated with evolution. In this paper, we hope to bridge that gap by reviewing common barriers to open-endedness in the evolution-inspired approach and how they are dealt with in the evolutionary case - collapse of diversity, saturation of complexity, and failure to form new kinds of individuality. We then show how these problems map onto similar issues in the machine learning approach, and discuss how the same insights and solutions which alleviated those barriers in evolutionary approaches can be ported over. At the same time, the form these issues take in the machine learning formulation suggests new ways to analyze and resolve barriers to open-endedness. Ultimately, we hope to inspire researchers to be able to interchangeably use evolutionary and gradient-descent-based machine learning methods to approach the design and creation of open-ended systems.
Gradient Descent Happens in a Tiny Subspace
Gur-Ari, Guy, Roberts, Daniel A., Dyer, Ethan
We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.
Diagnostic Visualization for Deep Neural Networks Using Stochastic Gradient Langevin Dynamics
Jiang, Biye, Chan, David M., Zhang, Tianhao, Canny, John F.
The internal states of most deep neural networks are difficult to interpret, which makes diagnosis and debugging during training challenging. Activation maximization methods are widely used, but lead to multiple optima and are hard to interpret (appear noise-like) for complex neurons. Image-based methods use maximally-activating image regions which are easier to interpret, but do not provide pixel-level insight into why the neuron responds to them. In this work we introduce an MCMC method: Langevin Dynamics Activation Maximization (LDAM), which is designed for diagnostic visualization. LDAM provides two affordances in combination: the ability to explore the set of maximally activating pre-images, and the ability to trade-off interpretability and pixel-level accuracy using a GAN-style discriminator as a regularizer. We present case studies on MNIST, CIFAR and ImageNet datasets exploring these trade-offs. Finally we show that diagnostic visualization using LDAM leads to a novel insight into the parameter averaging method for deep net training.
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
Arora, Sanjeev, Li, Zhiyuan, Lyu, Kaifeng
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, $0.3$), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of $T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.
Weighted Risk Minimization & Deep Learning
Byrd, Jonathon, Lipton, Zachary C.
Importance weighting is a key ingredient in many algorithms for causal inference and related problems, such as off-policy evaluation for reinforcement learning. Recently, theorists proved that on separable data, unregularized linear networks, trained with cross-entropy loss and optimized by stochastic gradient descent converge in direction to the max margin solution. This solution depends on the location of data points but not their weights, nullifying the effect of importance weighting. This paper asks, for realistic deep networks, for which all datasets are separable, what is the effect of importance weighting? Lacking theoretical tools for analyzing modern deep (nonlinear, unregularized) networks, we investigate the question empirically on both realistic and synthetic data. Our results demonstrate that while importance weighting alters the learned model early in training, its effect diminishes to negligible with indefinite training. However, this diminishing effect does not occur in the presence of L2-regularization. These results (i) support the broader applicability of theoretical findings by Soudry et al (2018), who analyze linear networks; (ii) call into question the practice of importance weighting; and (iii) suggest that its usefulness interacts strongly with the early stopping criteria and regularization methods that interact with the loss function.
Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling
Luo, Rui, Zhang, Qiang, Liu, Yuanyuan
We propose a new sampler that integrates the protocol of parallel tempering with the Nos\'e-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.
On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case
Barkhagen, M., Chau, N. H., Moulines, ร., Rรกsonyi, M., Sabanis, S., Zhang, Y.
Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbins-Monro type algorithm with Langevin dynamics in order to perform data-driven stochastic optimization. In this paper, the SGLD method with fixed step size $\lambda$ is considered in order to sample from a logconcave target distribution $\pi$, known up to a normalisation factor. We assume that unbiased estimates of the gradient from possibly dependent observations are available. It is shown that, for all $\varepsilon>0$, the Wasserstein-$2$ distance of the $n$th iterate of the SGLD algorithm from $\pi$ is dominated by $c_1(\varepsilon)[\lambda^{1/2 - \varepsilon}+e^{-a\lambda n}]$ with appropriate constants $c_1(\varepsilon), a>0$.
Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training
Adya, Saurabh, Palakkode, Vinay, Tuzel, Oncel
Nonlinear conjugate gradient (NLCG) based optimizers have shown superior loss convergence properties compared to gradient descent based optimizers for traditional optimization problems. However, in Deep Neural Network (DNN) training, the dominant optimization algorithm of choice is still Stochastic Gradient Descent (SGD) and its variants. In this work, we propose and evaluate the stochastic preconditioned nonlinear conjugate gradient algorithm for large scale DNN training tasks. We show that a nonlinear conjugate gradient algorithm improves the convergence speed of DNN training, especially in the large mini-batch scenario, which is essential for scaling synchronous distributed DNN training to large number of workers. We show how to efficiently use second order information in the NLCG pre-conditioner for improving DNN training convergence. For the ImageNet classification task, at extremely large mini-batch sizes of greater than 65k, NLCG optimizer is able to improve top-1 accuracy by more than 10 percentage points for standard training of the Resnet-50 model for 90 epochs. For the CIFAR-100 classification task, at extremely large mini-batch sizes of greater than 16k, NLCG optimizer is able to improve top-1 accuracy by more than 15 percentage points for standard training of the Resnet-32 model for 200 epochs.
Limited Gradient Descent: Learning With Noisy Labels
Sun, Yi, Tian, Yan, Xu, Yiping
Label noise may handicap the generalization of classifiers, and the effective learning of the main pattern from samples with noisy labels is an important issue. Recent studies have shown that deep neural networks tend to prioritize the learning of simple patterns over the memorization of noise patterns. This suggests the need for a method to search for the best generalization that learns the main pattern until noise begins to be memorized. An intuitive idea is to use a supervised approach to find the stop timing of learning by, for example, employing a clean verification set. In practice, however, a clean verification set is sometimes difficult to obtain. To solve this problem, we propose an unsupervised method called limited gradient descent to estimate the best stop timing. We modified the labels of a few samples in a noisy dataset to be almost false labels, creating a reverse pattern. By monitoring the learning progresses of the noisy samples and the reverse samples, we could determine the stop timing of learning. In this paper, we also provide some sufficient conditions on learning with noisy labels. Experimental results on CIFAR-10 demonstrate that our approach has a similar generalization performance to supervised methods. For uncomplicated datasets, such as MNIST, we add a relabeling strategy to further improve generalization and achieve state-of-the-art performance.