AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

SimpleDS: A Simple Deep Reinforcement Learning Dialogue System

Cuayáhuitl, Heriberto

arXiv.org Artificial IntelligenceJan-18-2016

Almost two decades ago, the (spoken) dialogue systems community adopted the Reinforcement Learning (RL) paradigm since it offered the possibility to treat dialogue design as an optimisation problem, and because RL-based systems can improve their performance over time with experience. Although a large number of methods have been proposed for training (spoken) dialogue systems using RL, the question of "How to train dialogue policies in an efficient, scalable and effective way across domains?" still remains as an open problem. One limitation of current approaches is the fact that RL-based dialogue systems still require high-levels of human intervention (from system developers), as opposed to automating the dialogue design. Training a system of this kind requires a system developer to provide a set of features to describe the dialogue state, a set of actions to control the interaction, and a performance function to reward or penalise the action-selection process. All of these elements have to be carefully engineered in order to learn a good dialogue policy (or policies). This suggests that one way of advancing the state-of-the-art in this field is by reducing the amount of human intervention in the dialogue design process through higher degrees of automation, i.e. by moving towards truly autonomous learning.

dialogue system, reinforcement learning, simpleds, (11 more...)

arXiv.org Artificial Intelligence

1601.04574

Country: Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Learning Continuous Control Policies by Stochastic Value Gradients

Heess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap, Timothy, Erez, Tom, Tassa, Yuval

Neural Information Processing SystemsDec-31-2015

We present a unified framework for learning continuous control policies usingbackpropagation. It supports stochastic control by treating stochasticity in theBellman equation as a deterministic function of exogenous noise. The productis a spectrum of general policy gradient algorithms that range from model-freemethods with value functions to model-based methods without value functions.We use learned models but only require observations from the environment insteadof observations from model-predicted trajectories, minimizing the impactof compounded model errors. We apply these algorithms first to a toy stochasticcontrol problem and then to several physics-based control problems in simulation.One of these variants, SVG(1), shows the effectiveness of learning models, valuefunctions, and policies simultaneously in continuous domains.

algorithm, svg, value function, (16 more...)

Neural Information Processing Systems

Country:

Europe > Netherlands > South Holland > Delft (0.04)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Basis refinement strategies for linear value function approximation in MDPs

Comanici, Gheorghe, Precup, Doina, Panangaden, Prakash

Neural Information Processing SystemsDec-31-2015

We provide a theoretical framework for analyzing basis function construction for linear value function approximation in Markov Decision Processes (MDPs). We show that important existing methods, such as Krylov bases and Bellman-error-based methods are a special case of the general framework we develop. We provide a general algorithmic framework for computing basis function refinements which “respect” the dynamics of the environment, and we derive approximation error bounds that apply for any algorithm respecting this general framework. We also show how, using ideas related to bisimulation metrics, one can translate basis refinement into a process of finding “prototypes” that are diverse enough to represent the given MDP.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Inverse Reinforcement Learning with Locally Consistent Reward Functions

Nguyen, Quoc Phong, Low, Bryan Kian Hsiang, Jaillet, Patrick

Neural Information Processing SystemsDec-31-2015

Existing inverse reinforcement learning (IRL) algorithms have assumed each expert’s demonstrated trajectory to be produced by only a single reward function. This paper presents a novel generalization of the IRL problem that allows each trajectory to be generated by multiple locally consistent reward functions, hence catering to more realistic and complex experts’ behaviors. Solving our generalized IRL problem thus involves not only learning these reward functions but also the stochastic transitions between them at any state (including unvisited states). By representing our IRL problem with a probabilistic graphical model, an expectation-maximization (EM) algorithm can be devised to iteratively learn the different reward functions and the stochastic transitions between them in order to jointly improve the likelihood of the expert’s demonstrated trajectories. As a result, the most likely partition of a trajectory into segments that are generated from different locally consistent reward functions selected by EM can be derived. Empirical evaluation on synthetic and real-world datasets shows that our IRL algorithm outperforms the state-of-the-art EM clustering with maximum likelihood IRL, which is, interestingly, a reduced variant of our approach.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.93)

Industry: Transportation > Ground > Road (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)

Add feedback

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

Dann, Christoph, Brunskill, Emma

Neural Information Processing SystemsDec-31-2015

Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound of order O(|S|²|A|H² log(1/δ)/ɛ²) and a lower PAC bound Ω(|S||A|H² log(1/(δ+c))/ɛ²) (ignoring log-terms) that match up to log-terms and an additional linear dependency on the number of states |S|. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least H³.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.87)

Add feedback

Policy Evaluation Using the Ω-Return

Thomas, Philip S., Niekum, Scott, Theocharous, Georgios, Konidaris, George

Neural Information Processing SystemsDec-31-2015

We propose the Ω-return as an alternative to the λ-return currently used by the TD(λ) family of algorithms. The benefit of the Ω-return is that it accounts for the correlation of different length returns. Because it is difficult to compute exactly, we suggest one way of approximating the Ω-return. We provide empirical studies that suggest that it is superior to the λ-return and γ-return for a variety of problems.

machine learning, reinforcement learning, trajectory, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)

Add feedback

Information-Theoretic Bounded Rationality

Ortega, Pedro A., Braun, Daniel A., Dyer, Justin, Kim, Kee-Eung, Tishby, Naftali

arXiv.org Machine LearningDec-21-2015

Bounded rationality, that is, decision-making and planning under resource limitations, is widely regarded as an important open problem in artificial intelligence, reinforcement learning, computational neuroscience and economics. This paper offers a consolidated presentation of a theory of bounded rationality based on information-theoretic ideas. We provide a conceptual justification for using the free energy functional as the objective function for characterizing bounded-rational decisions. This functional possesses three crucial properties: it controls the size of the solution space; it has Monte Carlo planners that are exact, yet bypass the need for exhaustive search; and it captures model uncertainty arising from lack of evidence or from interacting with other agents having unknown intentions. We discuss the single-step decision-making case, and show how to extend it to sequential decisions using equivalence transformations. This extension yields a very general class of decision problems that encompass classical decision rules (e.g.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1512.06789

Country:

North America > United States > Massachusetts (0.28)
North America > United States > Pennsylvania (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.68)
Education (0.48)
Energy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(4 more...)

Add feedback

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Hallak, Assaf, Tamar, Aviv, Munos, Remi, Mannor, Shie

arXiv.org Machine LearningNov-27-2015

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD($\lambda$), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter $\beta$ controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling $\beta$, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

1509.05172

Genre: Research Report > New Finding (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reinforcement Learning with Parameterized Actions

Masson, Warwick, Ranchod, Pravesh, Konidaris, George

arXiv.org Artificial IntelligenceNov-26-2015

We introduce a model-free algorithm for learning in Markov decision processes with parameterized actions--discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with that action. We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goalscoring and Platform domains.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

1509.01644

Country:

North America > United States (0.46)
Africa (0.28)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Sports > Soccer (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Searching for Objects using Structure in Indoor Scenes

Nagaraja, Varun K., Morariu, Vlad I., Davis, Larry S.

arXiv.org Artificial IntelligenceNov-24-2015

To identify the location of objects of a particular class, a passive computer vision system generally processes all the regions in an image to finally output few regions. However, we can use structure in the scene to search for objects without processing the entire image. We propose a search technique that sequentially processes image regions such that the regions that are more likely to correspond to the query class object are explored earlier. We frame the problem as a Markov decision process and use an imitation learning algorithm to learn a search strategy. Since structure in the scene is essential for search, we work with indoor scene images as they contain both unary scene context information and object-object context in the scene. We perform experiments on the NYU-depth v2 dataset and show that the unary scene context features alone can achieve a significantly high average precision while processing only 20-25\% of the regions for classes like bed and sofa. By considering object-object context along with the scene context features, the performance is further improved for classes like counter, lamp, pillow and sofa.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.5244/C.29.53

1511.0771

Country: North America > United States > Maryland (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)

Add feedback