AITopics

1811.06889

Country: North America (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

#artificialintelligenceNov-15-2018, 18:54:48 GMT

Everything You Need to Know About – Reinforcement Learning Vinod Sharma's Blog

It learn from interaction with environment to achieve a goal or simply learns from reward and punishments. This learning is inspired by behaviourist phycology. From the best research I got the answer as it got termed in 1980's while some research study was conducted on animals behaviour. Especially how some new born baby animals learns to stand, run, and survive in the given environment. Rewards is a survival from learning and punishment can be compared with being eaten by others.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Kumaraswamy, Raksha, Schlegel, Matthew, White, Adam, White, Martha

Context-Dependent Upper-Confidence Bounds for Directed Exploration

arXiv.org Artificial IntelligenceNov-15-2018

Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like \epsilon-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches -- because they summarize past interactions -- with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.

artificial intelligence, exploration, upstream oil & gas, (18 more...)

1811.06629

Country: North America > Canada > Alberta (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Energy > Oil & Gas > Upstream (0.74)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Journal of Artificial Intelligence ResearchNov-15-2018

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Liu, Bo, Gemp, Ian, Ghavamzadeh, Mohammad, Liu, Ji, Mahadevan, Sridhar, Petrik, Marek

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal "mirror maps" to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

algorithm, finite-sample analysis, objective function, (14 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1.11251

AI Access Foundation

11260

Journal of Artificial Intelligence Research

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(6 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Energy > Energy Storage (1.00)
Electrical Industrial Apparatus (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Artificial IntelligenceNov-15-2018

The Utility of Sparse Representations for Control in Reinforcement Learning

Liu, Vincent, Kumaraswamy, Raksha, Le, Lei, White, Martha

We investigate sparse representations for control in reinforcement learning. While these representations are widely used in computer vision, their prevalence in reinforcement learning is limited to sparse coding where extracting representations for new data can be computationally intensive. Here, we begin by demonstrating that learning a control policy incrementally with a representation from a standard neural network fails in classic control domains, whereas learning with a representation obtained from a neural network that has sparsity properties enforced is effective. We provide evidence that the reason for this is that the sparse representation provides locality, and so avoids catastrophic interference, and particularly keeps consistent, stable values for bootstrapping. We then discuss how to learn such sparse representations. We explore the idea of Distributional Regularizers, where the activation of hidden nodes is encouraged to match a particular distribution that results in sparse activation across time. We identify a simple but effective way to obtain sparse representations, not afforded by previously proposed strategies, making it more practical for further investigation into sparse representations for reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

1811.06626

Country:

North America > United States (0.28)
North America > Canada > Alberta (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Klawonn, Matthew, Heim, Eric, Hendler, James

Exploiting Class Learnability in Noisy Data

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

data mining, machine learning, reinforcement learning, (23 more...)

1811.06524

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
(2 more...)

Russel, Reazul Hasan, Petrik, Marek

Tight Bayesian Ambiguity Sets for Robust MDPs

Robustness is important for sequential decision making in a stochastic dynamic environment with uncertain probabilistic parameters. We address the problem of using robust MDPs (RMDPs) to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution is determined by its ambiguity set. Existing methods construct ambiguity sets that lead to impractically conservative solutions. In this paper, we propose RSVF, which achieves less conservative solutions with the same worst-case guarantees by 1) leveraging a Bayesian prior, 2) optimizing the size and location of the ambiguity set, and, most importantly, 3) relaxing the requirement that the set is a confidence interval. Our theoretical analysis shows the safety of RSVF, and the empirical results demonstrate its practical promise.

ambiguity, machine learning, reinforcement learning, (14 more...)

1811.06512

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)

Buesing, Lars, Weber, Theophane, Zwols, Yori, Racaniere, Sebastien, Guez, Arthur, Lespiau, Jean-Baptiste, Heess, Nicolas

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, i.e. actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a nontrivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic V alue Gradient can be interpreted as counterfactual methods. This example tries to illustrate the everyday human capacity to reason about alternate, counterfactual outcomes of past experience with the goal of "mining worlds that could have been" (Pearl & Mackenzie, 2018). Social psychologists theorize that such cognitive processes are beneficial for improving future decision making (Roese, 1997). In this paper we aim to leverage possible advantages of counterfactual reasoning for learning decision making in the reinforcement learning (RL) framework. In spite of recent success, learning policies with standard, model-free RL algorithms can be notoriously data inefficient. This issue can in principle be addressed by learning policies on data synthesized from a model.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

1811.06272

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)

Reward-estimation variance elimination in sequential decision processes

Pankov, Sergey

Policy gradient methods are very attractive in reinforcement learning due to their model-free nature and convergence guarantees. These methods, however, suffer from high variance in gradient estimation, resulting in poor sample efficiency. To mitigate this issue, a number of variance-reduction approaches have been proposed. Unfortunately, in the challenging problems with delayed rewards, these approaches either bring a relatively modest improvement or do reduce variance at expense of introducing a bias and undermining convergence. The unbiased methods of gradient estimation, in general, only partially reduce variance, without eliminating it completely even in the limit of exact knowledge of the value functions and problem dynamics, as one might have wished. In this work we propose an unbiased method that does completely eliminate variance under some, commonly encountered, conditions. Of practical interest is the limit of deterministic dynamics and small policy stochasticity. In the case of a quadratic value function, as in linear quadratic Gaussian models, the policy randomness need not be small. We use such a model to analyze performance of the proposed variance-elimination approach and compare it with standard variance-reduction methods. The core idea behind the approach is to use control variates at all future times down the trajectory. We present both a model-based and model-free formulations.

machine learning, reinforcement learning, variance, (13 more...)

1811.06225

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

arXiv.org Artificial IntelligenceNov-15-2018

Reward learning from human preferences and demonstrations in Atari

Ibarz, Borja, Leike, Jan, Pohlen, Tobias, Irving, Geoffrey, Legg, Shane, Amodei, Dario

To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.

demonstration, machine learning, reinforcement learning, (16 more...)

1811.06521

Genre: Research Report > New Finding (0.46)

Industry:

Education (0.93)
Leisure & Entertainment > Games > Computer Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)