Reinforcement Learning
Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
Niekum, Scott (University of Massachusetts Amherst) | Barto, Andrew G. (University of Massachusetts Amherst)
Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to potential task-specific subgoals. However, such methods do not directly address the question of how many distinct skills are appropriate for solving the tasks that the agent faces. This can be highly inefficient when many identified subgoals correspond to the same underlying skill, but are all used in- dividually as skill goals. Furthermore, skills created in this manner are often only transferable to tasks that share iden- tical state spaces, since corresponding subgoals across tasks are not merged into a single skill goal. We show that these problems can be overcome by clustering subgoal data defined in an agent-space and using the resulting clusters as templates for skill termination conditions. Clustering via a Dirichlet process mixture model is used to discover a minimal, suffi- cient collection of portable skills.
FAQ-Learning in Matrix Games: Demonstrating Convergence Near Nash Equilibria, and Bifurcation of Attractors in the Battle of Sexes
Kaisers, Michael (Maastricht University) | Tuyls, Karl (Maastricht University)
This article studies Frequency Adjusted Q-learning (FAQ-learning), a variation of Q-learning that simulates simultaneous value function updates. The main contributions are empirical and theoretical support for the convergence of FAQ-learning to attractors near Nash equilibria in two-agent two-action matrix games.The games can be divided into three types: Matching pennies, Prisoners' Dilemma and Battle of Sexes. This article shows that the Matching pennies and Prisoners' Dilemma yield one attractor of the learning dynamics, while the Battle of Sexes exhibits a supercritical pitchfork bifurcation at a critical temperature, where one attractor splits into two attractors and one repellent fixed point. Experiments illustrate that the distance between fixed points of the FAQ-learning dynamics and Nash equilibria tends to zero as the exploration parameter of FAQ-learning approaches zero.
Comparing Action-Query Strategies in Semi-Autonomous Agents
Cohn, Robert (University of Michigan, Ann Arbor) | Durfee, Edmund (University of Michigan, Ann Arbor) | Singh, Satinder (University of Michigan)
We consider settings in which a semi-autonomous agent has uncertain knowledge about its environment, but can ask what action the human operator would prefer taking in the current or in a potential future state. Asking queries can improve behavior, but if queries come at a cost (e.g., due to limited operator attention), the value of each query should be maximized. We compare two strategies for selecting action queries: 1) based on myopically maximizing expected gain in long-term value, and 2) based on myopically minimizing uncertainty in the agent's policy representation. We show empirically that the first strategy tends to select more valuable queries, and that a hybrid method can outperform either method alone in settings with limited computation.
Teaching Reinforcement Learning with Mario: An Argument and Case Study
Taylor, Matthew Edmund (Lafayette College)
Integrating games into the computer science curriculum has been gaining acceptance in recent years, particularly when used to improve student engagement in introductory courses. This paper argues that games can also be useful in upper level courses, such as general artificial intelligence and machine learning. We provide a case study of using a Mario game in a machine learning class to provide one successful data point where both content-specific and general learning outcomes were successfully achieved.
A Bayesian Reinforcement Learning framework Using Relevant Vector Machines
Tziortziotis, Nikolaos (University of Ioannina) | Blekas, Konstantinos (University of Ioannina)
In this work we present an advanced Bayesian formulation to the task of control learning that employs the Relevance Vector Machines (RVM) generative model for value function evaluation. The key aspect of the proposed method is the design of the discount return as a generalized linear model that constitutes a well-known probabilistic approach. This allows to augment the model with advantageous sparse priors provided by the RVM's regression framework. We have also taken into account the significant issue of selecting the proper parameters of the kernel design matrix. Experiments have shown that our method produces improved performance in both simulated and real test environments.
Coordinated Multi-Agent Reinforcement Learning in Networked Distributed POMDPs
Zhang, Chongjie (University of Massachusetts Amherst) | Lesser, Victor (University of Massachusetts Amherst)
In many multi-agent applications such as distributed sensor nets, a network of agents act collaboratively under uncertainty and local interactions. Networked Distributed POMDP (ND-POMDP) provides a framework to model such cooperative multi-agent decision making. Existing work on ND-POMDPs has focused on offline techniques that require accurate models, which are usually costly to obtain in practice. This paper presents a model-free, scalable learning approach that synthesizes multi-agent reinforcement learning (MARL) and distributed constraint optimization (DCOP). By exploiting structured interaction in ND-POMDPs, our approach distributes the learning of the joint policy and employs DCOP techniques to coordinate distributed learning to ensure the global learning performance. Our approach can learn a globally optimal policy for ND-POMDPs with a property called groupwise observability. Experimental results show that, with communication during learning and execution, our approach significantly outperforms the nearly-optimal non-communication policies computed offline.
Differential Eligibility Vectors for Advantage Updating and Gradient Methods
Melo, Francisco S. (Instituto Superior Técnico/INESC-ID)
In this paper we propose differential eligibility vectors (DEV) for temporal-difference (TD) learning, a new class of eligibility vectors designed to bring out the contribution of each action in the TD-error at each state. Specifically, we use DEV in TD-Q(lambda) to more accurately learn the relative value of the actions, rather than their absolute value. We identify conditions that ensure convergence w.p.1 of TD-Q(lambda) with DEV and show that this algorithm can also be used to directly approximate the advantage function associated with a given policy, without the need to compute an auxiliary function - something that, to the extent of our knowledge, was not known possible. Finally, we discuss the integration of DEV in LSTDQ and actor-critic algorithms.
Scaling Up Reinforcement Learning through Targeted Exploration
Mann, Timothy Arthur (Texas A&M University) | Choe, Yoonsuck (Texas A&M University)
Recent Reinforcement Learning (RL) algorithms, such as R-MAX, make (with high probability) only a small number of poor decisions. In practice, these algorithms do not scale well as the number of states grows because the algorithms spend too much effort exploring. We introduce an RL algorithm State TArgeted R-MAX (STAR-MAX) that explores a subset of the state space, called the exploration envelope ฮพ. When ฮพ equals the total state space, STAR-MAX behaves identically to R-MAX. When ฮพ is a subset of the state space, to keep exploration within ฮพ, a recovery rule ฮฒ is needed. We compared existing algorithms with our algorithm employing various exploration envelopes. With an appropriate choice of ฮพ, STAR-MAX scales far better than existing RL algorithms as the number of states increases. A possible drawback of our algorithm is its dependence on a good choice of ฮพ and ฮฒ. However, we show that an effective recovery rule ฮฒ can be learned on-line and ฮพ can be learned from demonstrations. We also find that even randomly sampled exploration envelopes can improve cumulative rewards compared to R-MAX. We expect these results to lead to more efficient methods for RL in large-scale problems.
Value Function Approximation in Reinforcement Learning Using the Fourier Basis
Konidaris, George (Massachusetts Institute of Technology) | Osentoski, Sarah (Brown University) | Thomas, Philip (University of Massachusetts Amherst)
We describe the Fourier basis, a linear value function approximation scheme based on the Fourier series. We empirically demonstrate that it performs well compared to radial basis functions and the polynomial basis, the two most popular fixed bases for linear value function approximation, and is competitive with learned proto-value functions.
Basis Function Discovery Using Spectral Clustering and Bisimulation Metrics
Comanici, Gheorghe (McGill University) | Precup, Doina (McGill University)
We study the problem of automatically generating features for function approximation in reinforcement learning. We build on the work of Mahadevan and his colleagues, who pioneered the use of spectral clustering methods for basis function construction. Their methods work on top of a graph that captures state adjacency. Instead, we use bisimulation metrics in order to provide state distances for spectral clustering. The advantage of these metrics is that they incorporate reward information in a natural way, in addition to the state transition information. We provide theoretical bounds on the quality of the obtained approximation, which justify the importance of incorporating reward information. We also demonstrate empirically that the approximation quality improves when bisimulation metrics are used instead of the state adjacency graph in the basis function construction process.