Reinforcement Learning
Policy Invariance under Reward Transformations for General-Sum Stochastic Games
Lu, X., Schwartz, H. M., Givigi, S. N.
We extend the potential-based shaping method from Markov decision processes to multi-player general-sum stochastic games. We prove that the Nash equilibria in a stochastic game remains unchanged after potential-based shaping is applied to the environment. The property of policy invariance provides a possible way of speeding convergence when learning to play a stochastic game.
Regret Minimization in Multiplayer Extensive Games
Gibson, Richard Geoffrey (University of Alberta) | Szafron, Duane (University of Alberta)
The counterfactual regret minimization (CFR) algorithm is state-of-the-art for computing strategies in large games and other sequential decision-making problems. Little is known, however, about CFR in games with more than 2 players. This extended abstract outlines research towards a better understanding of CFR in multiplayer games and new procedures for computing even stronger multiplayer strategies. We summarize work already completed that investigates techniques for creating "expert" strategies for playing smaller sub-games, and work that proves CFR avoids classes of undesirable strategies. In addition, we provide an outline of our future research direction. Our goals are to apply regret minimization to the problem of playing multiple games simultaneously, and augment CFR to achieve effective on-line opponent modelling of multiple opponents. The objective of this research is to build a world-class computer poker player for multiplayer Limit Texas Hold'em.
Reinforcement Learning to Adjust Robot Movements to New Situations
Kober, Jens (Max Planck Institute for Intelligent Systems) | Oztop, Erhan (Advanced Telecommunications Research Institute) | Peters, Jan (Max Planck Institute for Intelligent Systems)
Many complex robot motor skills can be represented using elementary movements, and there exist efficient techniques for learning parametrized motor plans using demonstrations and self-improvement. However with current techniques, in many cases, the robot currently needs to learn a new elementary movement even if a parametrized motor plan exists that covers a related situation. A method is needed that modulates the elementary movement through the meta-parameters of its representation. In this paper, we describe how to learn such mappings from circumstances to meta-parameters using reinforcement learning. In particular we use a kernelized version of the reward-weighted regression. We show two robot applications of the presented setup in robotic domains; the generalization of throwing movements in darts, and of hitting movements in table tennis. We demonstrate that both tasks can be learned successfully using simulated and real robots.
Integrated Learning for Goal-Driven Autonomy
Jaidee, Ulit (Lehigh University) | Munoz-Avila, Hector (Lehigh University) | Aha, David W. (Naval Research Laboratory)
This requires, for Goal-driven autonomy (GDA) is a reflective model example, experts to anticipate what discrepancies can occur, of goal reasoning that controls the focus of an identify what goals can be formulated, and define their agent's planning activities by dynamically relative priority. However, few techniques have been resolving unexpected discrepancies in the world investigated for learning this knowledge, and those that do state, which frequently arise when solving tasks in learn only goal formulation knowledge (Weber et al. 2010; complex environments. GDA agents have Powell et al. 2011). This can be problematic; while these performed well on such tasks by integrating agents may perform well in simple environments, in others a methods for discrepancy recognition, explanation, domain expert might not know the (state) expectations for goal formulation, and goal management. However, executing every action in every state, nor which goal should they require substantial domain knowledge, be pursued to resolve every possible discrepancy, or even including what constitutes a discrepancy and how the space of all possible discrepancies.
Sample Efficient On-Line Learning of Optimal Dialogue Policies with Kalman Temporal Differences
Pietquin, Olivier (SUPELEC / UMI 2958) | Geist, Matthieu (SUPELEC) | Chandramohan, Senthilkumar (SUPELEC)
Designing dialog policies for voice-enabled interfaces is a tailoring job that is most often left to natural language processing experts. This job is generally redone for every new dialog task because cross-domain transfer is not possible. For this reason, machine learning methods for dialog policy optimization have been investigated during the last 15 years. Especially, reinforcement learning (RL) is now part of the state of the art in this domain. Standard RL methods require to test more or less random changes in the policy on users to assess them as improvements or degradations. This is called on policy learning. Nevertheless, it can result in system behaviors that are not acceptable by users. Learning algorithms should ideally infer an optimal strategy by observing interactions generated by a non-optimal but acceptable strategy, that is learning off-policy. In this contribution, a sample-efficient, online and off-policy reinforcement learning algorithm is proposed to learn an optimal policy from few hundreds of dialogues generated with a very simple handcrafted policy.
Q-Error as a Selection Mechanism in Modular Reinforcement-Learning Systems
Ring, Mark B. (IDSIA) | Schaul, Tom (IDSIA)
This paper introduces a novel multi-modular method for reinforcement learning.A multi-modular system is one that partitions the learning task among a set of experts (modules), where each expert is incapable of solving the entire task by itself.There are many advantages to splitting up large tasks in this way, but existing methods face difficulties when choosing which module(s) should contribute to the agent's actions at any particular moment.We introduce a novel selection mechanism where every module, besides calculating a set of action values, also estimates its own error for the current input.The selection mechanism combines each module's estimate of long-term reward and self-error to produce a score by which the next module is chosen.As a result, the modules can use their resources effectively and efficiently divide up the task.The system is shown to learn complex tasks even when the individual modules use only linear function approximators.
Imitation Learning in Relational Domains: A Functional-Gradient Boosting Approach
Natarajan, Sriraam (Wake Forest University School of Medicine) | Joshi, Saket (Oregon State University) | Tadepalli, Prasad (Oregon State University) | Kersting, Kristian (Fraunhofer IAIS) | Shavlik, Jude (University of Wisconsin-Madiso)
Imitation learning refers to the problem of learning how to behave by observinga teacher in action. We consider imitation learning in relational domains, in which there is a varying number of objects and relations among them. In prior work, simple relational policies are learned by viewing imitation learning as supervised learning of a function from states to actions. For propositional worlds, functional gradient methods have been proved to be beneficial. They are simpler to implement than most existing methods, more efficient, more naturally satisfy common constraints on the cost function, and better represent our prior beliefs about the form of the function. Building on recent generalizations of functional gradient boosting to relational representations, we implement a functional gradient boosting approach to imitation learning in relational domains. In particular, given a set of traces from the human teacher, our system learns a policy in the form of a set of relational regression trees that additively approximate the functional gradients. The use of multiple additive trees combined with relational representation allows for learning more expressive policies than what has been done before. We demonstrate the usefulness of our approach in several different domains.
Automatic State Abstraction from Demonstration
Cobo, Luis Carlos (Georgia Institute of Technology) | Zang, Peng (Georgia Institute of Technology) | Jr., Charles Lee Isbell (Georgia Institute of Technology) | Thomaz, Andrea Lockerd (Georgia Institute of Technology)
Learning from Demonstration (LfD) is a popular technique for building decision-making agents from human help. Traditional LfD methods use demonstrations as training examples for supervised learning, but complex tasks can require more examples than is practical to obtain. We present Abstraction from Demonstration (AfD), a novel form of LfD that uses demonstrations to infer state abstractions and reinforcement learning (RL) methods in those abstract state spaces to build a policy. Empirical results show that AfD is greater than an order of magnitude more sample efficient than jus tusing demonstrations as training examples, and exponentially faster than RL alone.
Using Cases as Heuristics in Reinforcement Learning: A Transfer Learning Application
Jr., Luiz A. Celiberto (Technological Institute of Aeronautics) | Matsuura, Jackson P. (Technological Institute of Aeronautics) | Mantaras, Ramon Lopez de (Artificial Intelligence Research Institute (IIIA-CSIC)) | Bianchi, Reinaldo A. C. (Centro Universitario da FEI)
Another way to speed up a RL algorithm is by using Transfer Learning, a paradigm of machine learning that In this paper we propose to combine three AI techniques reuses knowledge accumulated in a previous task to speed up to speed up a Reinforcement Learning algorithm the learning of a novel, but related, target task [Taylor and in a Transfer Learning problem: Casebased Stone, 2009]. Reasoning, Heuristically Accelerated Reinforcement This paper investigates the use of the Case-Based Heuristically Learning and Neural Networks. To do Accelerated Reinforcement Learning (CB-HARL) algorithm so, we propose a new algorithm, called L3, which [Bianchi et al., 2009] as a means to transfer learning works in 3 stages: in the first stage, it uses Reinforcement acquired by one agent during its training in one problem to Learning to learn how to perform one another agent that has to learn how to solve a similar, but task, and stores the optimal policy for this problem more complex, problem. To do so, we propose a new algorithm, as a case-base; in the second stage, it uses a Neural called L3, which works in 3 stages: in the first stage, Network to map actions from one domain to actions it uses the Q-learning algorithm [Watkins, 1989] to learn how in the other domain and; in the third stage, it uses to perform one task, and stores the optimal policy for this the case-base learned in the first stage as heuristics problem as a case-base; in the second stage, it uses a Neural to speed up the learning performance in a related, Network to map actions from one domain to actions in but different, task. The RL algorithm used the other domain and; in the third stage, it uses the case-base in the first phase is the Q-learning and in the third learned in the first stage as heuristics in the CB-HARL algorithm, phase is the recently proposed Case-based Heuristically speeding up the learning process.
A Competitive Strategy for Function Approximation in Q-Learning
Agostini, Alejandro Gabriel (Institut de Robotica i Informatica Industrial (UPC-CSIC)) | Celaya, Enric (Institut de Robotica i Informatica Industrial (UPC-CSIC))
In this work we propose an approach for generalization in continuous domain Reinforcement Learning that, instead of using a single function approximator, tries many different function approximators in parallel, each one defined in a different region of the domain. Associated with each approximator is a relevance function that locally quantifies the quality of its approximation, so that, at each input point, the approximator with highest relevance can be selected. The relevance function is defined using parametric estimations of the variance of the q-values and the density of samples in the input space, which are used to quantify the accuracy and the confidence in the approximation, respectively. These parametric estimations are obtained from a probability density distribution represented as a Gaussian Mixture Model embedded in the input-output space of each approximator. In our experiments, the proposed approach required a lesser number of experiences for learning and produced more stable convergence profiles than when using a single function approximator.