AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Samples are not all useful: Denoising policy gradient updates using variance

arXiv.org Machine LearningApr-10-2019

Policy gradient algorithms in reinforcement learning rely on efficiently sampling an environment. Most sampling procedures are based solely on sampling the agent's policy. However, other measures made available through these algorithms could be used in order to improve the sampling prior to each policy update. Following this line of thoughts, we propose a method where a transition is used in the gradient update if it meets a particular criterion, and rejected otherwise. This criterion is the fraction of variance explained ($\mathcal{V}^{ex}$), a measure of the discrepancy between a model and actual samples. $\mathcal{V}^{ex}$ can be used to evaluate the impact each transition will have on the learning. This criterion refines sampling and improves the policy gradient algorithm. In this paper: (1) We introduce and explore $\mathcal{V}^{ex}$, the selection criterion used to improve the sampling procedure. (2) We conduct experiments across a variety of standard benchmark environments, including continuous control problems. Our results show better performance than if we did not use the $\mathcal{V}^{ex}$ criterion for the policy gradient update. (3) We investigate why $\mathcal{V}^{ex}$ gives a good evaluation for the selection of samples that will positively impact the learning. (4) We show how this criterion can be interpreted as a dynamic way to adjust the ratio between exploration and exploitation.

gradient update, policy gradient update, value function, (13 more...)

arXiv.org Machine Learning

1904.04025

Country:

Asia > Middle East > Jordan (0.04)
Europe > France > Hauts-de-France > Pas-de-Calais (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On the Adaptivity of Stochastic Gradient-Based Optimization

Lei, Lihua, Jordan, Michael I.

arXiv.org Machine LearningApr-9-2019

Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as \emph{geometrization}, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1904.0448

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.91)

Add feedback

Perturbative estimation of stochastic gradients

Ambrogioni, Luca, van Gerven, Marcel A. J.

arXiv.org Machine LearningApr-8-2019

In this paper we introduce a family of stochastic gradient estimation techniques based of the perturbative expansion around the mean of the sampling distribution. We characterize the bias and variance of the resulting Taylor-corrected estimators using the Lagrange error formula. Furthermore, we introduce a family of variance reduction techniques that can be applied to other gradient estimators. Finally, we show that these new perturbative methods can be extended to discrete functions using analytic continuation. Using this technique, we derive a new gradient descent method for training stochastic networks with binary weights. In our experiments, we show that the perturbative correction improves the convergence of stochastic variational inference both in the continuous and in the discrete case.

artificial intelligence, estimator, machine learning, (17 more...)

arXiv.org Machine Learning

1904.00469

Genre: Research Report > New Finding (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Add feedback

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

E, Weinan, Ma, Chao, Wu, Lei

arXiv.org Machine LearningApr-8-2019

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space. Our analysis suggests strongly that in terms of `implicit regularization', two-layer neural network models do not outperform the kernel method.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Machine Learning

1904.04326

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)

Add feedback

Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

Anthony, Thomas, Nishihara, Robert, Moritz, Philipp, Salimans, Tim, Schulman, John

arXiv.org Machine LearningApr-7-2019

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. We propose an alternative simulation-based search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest open-source Hex agent, in 9x9 Hex.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1904.03646

Country:

North America > Canada > Alberta (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > Quebec > Montreal (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Li, Mingchen, Soltanolkotabi, Mahdi, Oymak, Samet

arXiv.org Machine LearningApr-7-2019

Deep neural networks (DNN) are ubiquitous in a growing number of domains ranging from computer vision to healthcare. State-of-the-art DNN models are typically overparameterized and contain more parameters than the size of the training dataset. It is well understood that in this overparameterized regime, DNNs are highly expressive and have the capacity to (over)fit arbitrary training datasets including pure noise [56]. Mysteriously however neural network models trained via simple algorithms such as stochastic gradient descent continue to predict well on yet unseen test data. In such over-parametrized scenarios there maybe infinitely many globally optimal network parameters consistent with the training data, the key challenge is to understand which network parameters (stochastic) gradient descent converges to and what are its properties. Indeed, a recent series of papers [16, 52, 56], suggest that solutions found by first order methods tend to have favorable generalization properties. As DNNs begin to be deployed in safety critical applications, the need for foundational understanding of their noise robustness and their unique prediction capabilities intensifies. This paper focuses on an intriguing phenomena: overparameterized neural networks are surprisingly robust to label noise when first order methods with early stopping is used to train them [25]. To observe this phenomena consider Figure 1 where we perform experiments on the MNIST data set.

artificial intelligence, machine learning, neural network, (15 more...)

arXiv.org Machine Learning

1903.1168

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > Riverside County > Riverside (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Data Shapley: Equitable Valuation of Data for Machine Learning

Ghorbani, Amirata, Zou, James

arXiv.org Artificial IntelligenceApr-5-2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

artificial intelligence, machine learning, shapley, (15 more...)

arXiv.org Artificial Intelligence

1904.02868

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
(3 more...)

Genre: Research Report (0.85)

Industry:

Information Technology > Security & Privacy (0.68)
Health & Medicine > Therapeutic Area > Oncology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.30)

Add feedback

Multi-Preference Actor Critic

Durugkar, Ishan, Hausknecht, Matthew, Swaminathan, Adith, MacAlpine, Patrick

arXiv.org Artificial IntelligenceApr-5-2019

Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.

demonstration, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

1904.03295

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Washington > King County > Redmond (0.04)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.69)
Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Backtracking gradient descent method for general $C^1$ functions, with applications to Deep Learning

Truong, Tuyen Trung, Nguyen, Tuan Hang

arXiv.org Machine LearningApr-4-2019

While Standard gradient descent is one very popular optimisation method, its convergence cannot be proven beyond the class of functions whose gradient is globally Lipschitz continuous. As such, it is not actually applicable to realistic applications such as Deep Neural Networks. In this paper, we prove that its backtracking variant behaves very nicely, in particular convergence can be shown for all Morse functions. The main theoretical result of this paper is as follows. Theorem. Let $f:\mathbb{R}^k\rightarrow \mathbb{R}$ be a $C^1$ function, and $\{z_n\}$ a sequence constructed from the Backtracking gradient descent algorithm. (1) Either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\lim _{n\rightarrow\infty}||z_{n+1}-z_n||=0$. (2) Assume that $f$ has at most countably many critical points. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ converges to a critical point of $f$. (3) More generally, assume that all connected components of the set of critical points of $f$ are compact. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ is bounded. Moreover, in the latter case the set of cluster points of $\{z_n\}$ is connected. Some generalised versions of this result, including an inexact version, are included. Another result in this paper concerns the problem of saddle points. We then present a heuristic argument to explain why Standard gradient descent method works so well, and modifications of the backtracking versions of GD, MMT and NAG. Experiments with datasets CIFAR10 and CIFAR100 on various popular architectures verify the heuristic argument also for the mini-batch practice and show that our new algorithms, while automatically fine tuning learning rates, perform better than current state-of-the-art methods such as MMT, NAG, Adagrad, Adadelta, RMSProp, Adam and Adamax.

artificial intelligence, backtracking gd, machine learning, (17 more...)

arXiv.org Machine Learning

1808.0516

Country:

Europe (0.46)
North America > United States (0.28)
Asia (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adaptive Sequential Machine Learning

Wilson, Craig, Bu, Yuheng, Veeravalli, Venugopal

arXiv.org Machine LearningApr-4-2019

A framework previously introduced in [3] for solving a sequence of stochastic optimization problems with bounded changes in the minimizers is extended and applied to machine learning problems such as regression and classification. The stochastic optimization problems arising in these machine learning problems is solved using algorithms such as stochastic gradient descent (SGD). A method based on estimates of the change in the minimizers and properties of the optimization algorithm is introduced for adaptively selecting the number of samples at each time step to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer produced by the optimization algorithm and the exact minimizer, does not exceed a target level. A bound is developed to show that the estimate of the change in the minimizers is non-trivial provided that the excess risk is small enough. Extensions relevant to the machine learning setting are considered, including a cost-based approach to select the number of samples with a cost budget over a fixed horizon, and an approach to applying cross-validation for model selection. Finally, experiments with synthetic and real data are used to validate the algorithms.

artificial intelligence, machine learning, optimization problem, (15 more...)

arXiv.org Machine Learning

1904.02773

Country: North America > United States > New York (0.28)

Genre: Research Report (1.00)

Industry:

Education > Educational Setting > Online (0.46)
Education > Focused Education > Special Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback