Reinforcement Learning
Integrating Representation Learning and Temporal Difference Learning: A Matrix Factorization Approach
White, Martha (University of Alberta)
Reinforcement learning is a general formalism for sequential decision-making, with recent algorithm development focusing on function approximation to handle large state spaces and high-dimensional, high-velocity (sensor) data. The success of function approximators, however, hinges on the quality of the data representation. In this work, we explore representation learning within least-squares temporal difference learning (LSTD), with a focus on making the assumptions on the representation explicit and making the learning problem amenable to principled optimization techniques. We reformulate LSTD as a least-squares loss plus concave regularizer, facilitating the addition of a regularized matrix factorization objective to specify the desired class of representations. The resulting joint optimization over the representation and value function parameters enables us to take advantages of recent advances in unsupervised learning and presents a general yet simple formalism for learning representations in reinforcement learning.
Feature Reinforcement Learning: State of the Art
Daswani, Mayank (Australian National University) | Sunehag, Peter (Australian National University) | Hutter, Marcus (Australian National University)
Feature reinforcement learning was introduced five years ago as a principled and practical approach to history-based learning. This paper examines the progress since its inception. We now have both model-based and model-free cost functions, most recently extended to the function approximation setting. Our current work is geared towards playing ATARI games using imitation learning, where we use Feature RL as a feature selection method for high-dimensional domains.
Quantifying Uncertainty in Batch Personalized Sequential Decision Making
Marivate, Vukosi Ntsakisi (Rutgers University) | Chemali, Jessica (Carnegie Mellon University) | Brunskill, Emma (Carnegie Mellon University) | Littman, Michael (Brown University)
As the amount of data collected from individuals increases, there are more opportunities to use it to offer personalized experiences (e.g., using electronic health records to offer personalized treatments). We advocate applying techniques from batch reinforcement learning to predict the range of effectiveness that policies might have for individuals. We identify three sources of uncertainty and present a method that addresses all of them. It handles the uncertainty caused by population mismatch by modeling the data as a latent mixture of different subpopulations of individuals, it explicitly quantifies data sparsity by accounting for the limited data available about the underlying models, and incorporates intrinsic stochasticity to yield estimated percentile ranges of the effectiveness of a policy for a particular new individual. Using this approach, we highlight some interesting variability in policy effectiveness amongst HIV patients given a prior patient treatment dataset. Our approach highlights the potential benefit of taking into account individual variability and data limitations when performing batch policy evaluation for new individuals.
Adapting Difficulty Levels in Personalized Robot-Child Tutoring Interactions
Ramachandran, Aditi (Yale University) | Scassellati, Brian (Yale University)
Social robots can be used to tutor children in one-on-one interactions. Because students have different learning needs, they consequently require complex, non-scripted teaching behaviors that adapt to the learning needs of each child. As a result of this, robot tutors are more effective given a means of adaptively customizing the pace and content of a student's curriculum. In this paper we propose a reinforcement learning-based approach that affords such capabilities to a tutoring robot, with the goals of fostering measurable learning gains and sustained engagement. We outline an architecture in which the robot uses reinforcement learning to adapt the difficulty of its exercises. Further, we describe a proposed study capable of evaluating the effectiveness of our Intelligent Tutoring System.
An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning
Ammar, Haitham Bou (University of Pennsylvania) | Eaton, Eric (University of Pennsylvania) | Taylor, Matthew E. (Washington State University) | Mocanu, Decebal Constantin (Eindhoven University of Technology) | Driessens, Kurt (Maastricht University) | Weiss, Gerhard (Maastricht University) | Tuyls, Karl (University of Liverpool)
Transfer learning can improve the reinforcement learning of a new task by allowing the agent to reuse knowledge acquired from other source tasks. Despite their success, transfer learning methods rely on having relevant source tasks; transfer from inappropriate tasks can inhibit performance on the new task. For fully autonomous transfer, it is critical to have a method for automatically choosing relevant source tasks, which requires a similarity measure between Markov Decision Processes (MDPs). This issue has received little attention, and is therefore still a largely open problem. This paper presents a data-driven automated similarity measure for MDPs. This novel measure is a significant step toward autonomous reinforcement learning transfer, allowing agents to: (1) characterize when transfer will be useful and, (2) automatically select tasks to use for transfer. The proposed measure is based on the reconstruction error of a restricted Boltzmann machine that attempts to model the behavioral dynamics of the two MDPs being compared. Empirical results illustrate that this measure is correlated with the performance of transfer and therefore can be used to identify similar source tasks for transfer learning.
Preface
Cuayรกhuitl, Heriberto (Heriot-Watt University) | Frommberger, Lutz (University of Bremen) | Dethlefs, Nina (Heriot-Watt University) | Otterlo, Martijn van (Radboud University)
This workshop contains papers with a strong relationship to interactive systems and robots in the following topics (in no particular order): robot learning from natural language interactions; robot learning from social multimodal interactions; robot learning using crowdsourcing; reinforcement learning with reward inference of conversational behaviors; reinforcement and neural learning to transfer learnt behaviors across tasks; learning from demonstration for human-robot interaction/collaboration; supervised learning for coaching physical skills; visually-aware reinforcement learning in unknown environments; Markov decision processes for adaptive interactions in video games; and Markov decision processes for grounding natural language commands.
PolicyBoost: Functional Policy Gradient with Ranking-based Reward Objective
Yu, Yang (Nanjing University) | Da, Qing (Nanjing University)
Learning policies in nonlinear representations is an important step toward real-world applications of reinforcement learning in robotics. While functional representation has been widely applied in state-of-the-art supervised learning techniques (as known as boosting approaches) to adaptively learn nonlinear functions, in reinforcement learning the boosting-style approaches have been little investigated. Only a few pieces of work explored in this direction, which however may suffer from the occurring-probability-pursuing problem. In this paper, to alleviate the problem, we propose to employ a ranking-based objective function to guide the policy search in a function space, resulting in the PolicyBoost approach. Experiment results verify the effectiveness as well as the robustness of the PolicyBoost.
Practical Kernel-Based Reinforcement Learning
Barreto, Andrรฉ M. S., Precup, Doina, Pineau, Joelle
Kernel-based reinforcement learning (KBRL) stands out among reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which is statistically consistent and converges to a unique solution. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to large-scale or on-line domains. In this paper we introduce an algorithm that turns KBRL into a practical reinforcement learning tool. Kernel-based stochastic factorization (KBSF) builds on a simple idea: when a transition matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller, which retains some fundamental properties of its precursor. KBSF exploits such an insight to compress the information contained in KBRL's model into an approximator of fixed size. This makes it possible to build an approximation that takes into account both the difficulty of the problem and the associated computational cost. KBSF's computational complexity is linear in the number of sample transitions, which is the best one can do without discarding data. Moreover, the algorithm's simple mechanics allow for a fully incremental implementation that makes the amount of memory used independent of the number of sample transitions. The result is a kernel-based reinforcement learning algorithm that can be applied to large-scale problems in both off-line and on-line regimes. We derive upper bounds for the distance between the value functions computed by KBRL and KBSF using the same data. We also illustrate the potential of our algorithm in an extensive empirical study in which KBSF is applied to difficult tasks based on real-world data.
Efficient Learning and Planning with Compressed Predictive States
Hamilton, William L., Fard, Mahdi Milani, Pineau, Joelle
Predictive state representations (PSRs) offer an expressive framework for modelling partially observable systems. By compactly representing systems as functions of observable quantities, the PSR learning approach avoids using local-minima prone expectation-maximization and instead employs a globally optimal moment-based algorithm. Moreover, since PSRs do not require a predetermined latent state structure as an input, they offer an attractive framework for model-based reinforcement learning when agents must plan without a priori access to a system model. Unfortunately, the expressiveness of PSRs comes with significant computational cost, and this cost is a major factor inhibiting the use of PSRs in applications. In order to alleviate this shortcoming, we introduce the notion of compressed PSRs (CPSRs). The CPSR learning approach combines recent advancements in dimensionality reduction, incremental matrix decomposition, and compressed sensing. We show how this approach provides a principled avenue for learning accurate approximations of PSRs, drastically reducing the computational costs associated with learning while also providing effective regularization. Going further, we propose a planning framework which exploits these learned models. And we show that this approach facilitates model-learning and planning in large complex partially observable domains, a task that is infeasible without the principled use of compression.
Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs
Nguyen, Duc Thien (Singapore Management University) | Yeoh, William (New Mexico State University) | Lau, Hoong Chuin (Singapore Management University) | Zilberstein, Shlomo (University of Massachusetts, Amherst) | Zhang, Chongjie (Massachusetts Institute of Technology)
Researchers have introduced the Dynamic Distributed Constraint Optimization Problem (Dynamic DCOP) formulation to model dynamically changing multi-agent coordination problems, where a dynamic DCOP is a sequence of (static canonical) DCOPs, each partially different from the DCOP preceding it. Existing work typically assumes that the problem in each time step is decoupled from the problems in other time steps, which might not hold in some applications. Therefore, in this paper, we make the following contributions: (i) We introduce a new model, called Markovian Dynamic DCOPs (MD-DCOPs), where the DCOP in the next time step is a function of the value assignments in the current time step; (ii) We introduce two distributed reinforcement learning algorithms, the Distributed RVI Q-learning algorithm and the Distributed R-learning algorithm, that balance exploration and exploitation to solve MD-DCOPs in an online manner; and (iii) We empirically evaluate them against an existing multi-arm bandit DCOP algorithm on dynamic DCOPs.