AITopics

1812.05224

Country: North America > United States > Massachusetts (0.34)

Genre: Research Report (0.50)

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Guttenberg, Nicholas, Virgo, Nathaniel, Penn, Alexandra

On the potential for open-endedness in neural networks

arXiv.org Artificial IntelligenceDec-12-2018

Natural evolution gives the impression of leading to an open-ended process of increasing diversity and complexity. If our goal is to produce such open-endedness artificially, this suggests an approach driven by evolutionary metaphor. On the other hand, techniques from machine learning and artificial intelligence are often considered too narrow to provide the sort of exploratory dynamics associated with evolution. In this paper, we hope to bridge that gap by reviewing common barriers to open-endedness in the evolution-inspired approach and how they are dealt with in the evolutionary case - collapse of diversity, saturation of complexity, and failure to form new kinds of individuality. We then show how these problems map onto similar issues in the machine learning approach, and discuss how the same insights and solutions which alleviated those barriers in evolutionary approaches can be ported over. At the same time, the form these issues take in the machine learning formulation suggests new ways to analyze and resolve barriers to open-endedness. Ultimately, we hope to inspire researchers to be able to interchangeably use evolutionary and gradient-descent-based machine learning methods to approach the design and creation of open-ended systems.

artificial intelligence, evolutionary algorithm, machine learning, (15 more...)

1812.04907

Country: North America > United States > New York (0.28)

Genre:

Research Report (0.64)
Overview (0.46)

Industry:

Leisure & Entertainment > Games (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Gur-Ari, Guy, Roberts, Daniel A., Dyer, Ethan

Gradient Descent Happens in a Tiny Subspace

arXiv.org Artificial IntelligenceDec-11-2018

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.

artificial intelligence, machine learning, subspace, (16 more...)

1812.04754

Country: North America > United States (0.93)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

arXiv.org Artificial IntelligenceDec-11-2018

Diagnostic Visualization for Deep Neural Networks Using Stochastic Gradient Langevin Dynamics

Jiang, Biye, Chan, David M., Zhang, Tianhao, Canny, John F.

The internal states of most deep neural networks are difficult to interpret, which makes diagnosis and debugging during training challenging. Activation maximization methods are widely used, but lead to multiple optima and are hard to interpret (appear noise-like) for complex neurons. Image-based methods use maximally-activating image regions which are easier to interpret, but do not provide pixel-level insight into why the neuron responds to them. In this work we introduce an MCMC method: Langevin Dynamics Activation Maximization (LDAM), which is designed for diagnostic visualization. LDAM provides two affordances in combination: the ability to explore the set of maximally activating pre-images, and the ability to trade-off interpretability and pixel-level accuracy using a GAN-style discriminator as a regularizer. We present case studies on MNIST, CIFAR and ImageNet datasets exploring these trade-offs. Finally we show that diagnostic visualization using LDAM leads to a novel insight into the parameter averaging method for deep net training.

artificial intelligence, machine learning, neuron, (16 more...)

1812.04604

Country: North America > United States > New York (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Arora, Sanjeev, Li, Zhiyuan, Lyu, Kaifeng

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

arXiv.org Machine LearningDec-10-2018

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, $0.3$), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of $T^{-1/2}$ in $T$ iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate $T^{-1/4}$ is also shown for stochastic gradient descent.

artificial intelligence, gradient descent, machine learning, (16 more...)

1812.03981

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Byrd, Jonathon, Lipton, Zachary C.

Weighted Risk Minimization & Deep Learning

arXiv.org Machine LearningDec-8-2018

Importance weighting is a key ingredient in many algorithms for causal inference and related problems, such as off-policy evaluation for reinforcement learning. Recently, theorists proved that on separable data, unregularized linear networks, trained with cross-entropy loss and optimized by stochastic gradient descent converge in direction to the max margin solution. This solution depends on the location of data points but not their weights, nullifying the effect of importance weighting. This paper asks, for realistic deep networks, for which all datasets are separable, what is the effect of importance weighting? Lacking theoretical tools for analyzing modern deep (nonlinear, unregularized) networks, we investigate the question empirically on both realistic and synthetic data. Our results demonstrate that while importance weighting alters the learned model early in training, its effect diminishes to negligible with indefinite training. However, this diminishing effect does not occur in the presence of L2-regularization. These results (i) support the broader applicability of theoretical findings by Soudry et al (2018), who analyze linear networks; (ii) call into question the practice of importance weighting; and (iii) suggest that its usefulness interacts strongly with the early stopping criteria and regularization methods that interact with the loss function.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

1812.03372

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

arXiv.org Machine LearningDec-7-2018

Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling

Luo, Rui, Zhang, Qiang, Liu, Yuanyuan

We propose a new sampler that integrates the protocol of parallel tempering with the Nos\'e-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

artificial intelligence, machine learning, replica, (11 more...)

1812.01181

Country: Oceania > Australia (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

arXiv.org Machine LearningDec-6-2018

On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case

Barkhagen, M., Chau, N. H., Moulines, É., Rásonyi, M., Sabanis, S., Zhang, Y.

Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbins-Monro type algorithm with Langevin dynamics in order to perform data-driven stochastic optimization. In this paper, the SGLD method with fixed step size $\lambda$ is considered in order to sample from a logconcave target distribution $\pi$, known up to a normalisation factor. We assume that unbiased estimates of the gradient from possibly dependent observations are available. It is shown that, for all $\varepsilon>0$, the Wasserstein-$2$ distance of the $n$th iterate of the SGLD algorithm from $\pi$ is dominated by $c_1(\varepsilon)[\lambda^{1/2 - \varepsilon}+e^{-a\lambda n}]$ with appropriate constants $c_1(\varepsilon), a>0$.

artificial intelligence, machine learning, sequence, (14 more...)

1812.02709

Country:

Europe > Hungary (0.04)
Asia > Middle East > Syria > Daraa Governorate > Dar'a (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Adya, Saurabh, Palakkode, Vinay, Tuzel, Oncel

Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training

arXiv.org Machine LearningDec-6-2018

Nonlinear conjugate gradient (NLCG) based optimizers have shown superior loss convergence properties compared to gradient descent based optimizers for traditional optimization problems. However, in Deep Neural Network (DNN) training, the dominant optimization algorithm of choice is still Stochastic Gradient Descent (SGD) and its variants. In this work, we propose and evaluate the stochastic preconditioned nonlinear conjugate gradient algorithm for large scale DNN training tasks. We show that a nonlinear conjugate gradient algorithm improves the convergence speed of DNN training, especially in the large mini-batch scenario, which is essential for scaling synchronous distributed DNN training to large number of workers. We show how to efficiently use second order information in the NLCG pre-conditioner for improving DNN training convergence. For the ImageNet classification task, at extremely large mini-batch sizes of greater than 65k, NLCG optimizer is able to improve top-1 accuracy by more than 10 percentage points for standard training of the Resnet-50 model for 90 epochs. For the CIFAR-100 classification task, at extremely large mini-batch sizes of greater than 16k, NLCG optimizer is able to improve top-1 accuracy by more than 15 percentage points for standard training of the Resnet-32 model for 200 epochs.

artificial intelligence, machine learning, optimization problem, (17 more...)

1812.02886

Country: North America > United States (0.28)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Machine LearningDec-6-2018

Limited Gradient Descent: Learning With Noisy Labels

Sun, Yi, Tian, Yan, Xu, Yiping

Label noise may handicap the generalization of classifiers, and the effective learning of the main pattern from samples with noisy labels is an important issue. Recent studies have shown that deep neural networks tend to prioritize the learning of simple patterns over the memorization of noise patterns. This suggests the need for a method to search for the best generalization that learns the main pattern until noise begins to be memorized. An intuitive idea is to use a supervised approach to find the stop timing of learning by, for example, employing a clean verification set. In practice, however, a clean verification set is sometimes difficult to obtain. To solve this problem, we propose an unsupervised method called limited gradient descent to estimate the best stop timing. We modified the labels of a few samples in a noisy dataset to be almost false labels, creating a reverse pattern. By monitoring the learning progresses of the noisy samples and the reverse samples, we could determine the stop timing of learning. In this paper, we also provide some sufficient conditions on learning with noisy labels. Experimental results on CIFAR-10 demonstrate that our approach has a similar generalization performance to supervised methods. For uncomplicated datasets, such as MNIST, we add a relabeling strategy to further improve generalization and achieve state-of-the-art performance.

artificial intelligence, deep learning, machine learning, (16 more...)

1811.08117

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)