Reinforcement Learning
Active Imitation Learning via Reduction to I.I.D. Active Learning
Judah, Kshitij, Fern, Alan, Dietterich, Thomas G.
In standard passive imitation learning, the goal is to learn a target policy by passively observing full execution trajectories of it. Unfortunately, generating such trajectories can require substantial expert effort and be impractical in some cases. In this paper, we consider active imitation learning with the goal of reducing this effort by querying the expert about the desired action at individual states, which are selected based on answers to past queries and the learner's interactions with an environment simulator. We introduce a new approach based on reducing active imitation learning to i.i.d. active learning, which can leverage progress in the i.i.d. setting. Our first contribution, is to analyze reductions for both non-stationary and stationary policies, showing that the label complexity (number of queries) of active imitation learning can be substantially less than passive learning. Our second contribution, is to introduce a practical algorithm inspired by the reductions, which is shown to be highly effective in four test domains compared to a number of alternatives.
CLASSQ-L: A Q-Learning Algorithm for Adversarial Real-Time Strategy Games
Jaidee, Ulit (Lehigh University) | Munoz-Avila, Hector (Lehigh University)
We present CLASS Q-L (for: class Q-learning) an application of the Q-learning reinforcement learning algorithm to play complete Wargus games. Wargus is a real-time strategy game where players control armies consisting of units of different classes (e.g., archers, knights). CLASS Q-L uses a single table for each class ย of unit so that each unit is controlled and updates its classโ Q-table. This enables rapid learning as in Wargus there are many units of the same class. We present initial results of CLASS Q-L against a variety of opponents.
Hybrid systems modeling for gas transmission network
Noori, Amir, Menhaj, Mohammad Bagher, Shafiee, Masoud
Gas Transmission Networks are large-scale complex systems, and corresponding design and control problems are challenging. In this paper, we consider the problem of control and management of these systems in crisis situations. We present these networks by a hybrid systems framework that provides required analysis models. Further, we discuss decision-making using computational discrete and hybrid optimization methods. In particular, several reinforcement learning methods are employed to explore decision space and achieve the best policy in a specific crisis situation. Simulations are presented to illustrate the efficiency of the method.
Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
Fard, Mahdi Milani, Grinberg, Yuri, Farahmand, Amir-massoud, Pineau, Joelle, Precup, Doina
We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method.
Learning to Interpret Natural Language Instructions
MacGlashan, James (University of Maryland, Baltimore County) | Babes-Vroman, Monica (Rutgers University) | Winner, Kevin (University of Maryland, Baltimore County) | Gao, Ruoyuan (Rutgers University) | Adjogah, Richard (University of Maryland, Baltimore County) | desJardins, Marie (University of Maryland, Baltimore County) | Littman, Michael (Rutgers University) | Muresan, Smaranda (Rutgers University)
We address the problem of training an artificial agent to follow verbal commands using a set of instructions paired with demonstration traces of appropriate behavior. From this data, a mapping from instructions to tasks is learned, enabling the agent to carry out new instructions in novel environments. Our system consists of three components: semantic parsing (SP), inverse reinforcement learning (IRL), and task abstraction (TA). SP parses sentences into logical form representations, but when learning begins, the domain/task specific meanings of these representations are unknown. IRL takes demonstration traces and determines the likely reward functions that gave rise to these traces, defined over a set of provided features. TA combines results from SP and IRL over a set of training instances to create abstract goal definitions of tasks. TA also provides SP domain specific meanings for its logical forms and provides IRL the set of task-relevant features.
Sequential Decision Making with Rank Dependent Utility: A Minimax Regret Approach
Jeantet, Gildas (AiRPX) | Perny, Patrice (University Pierre and Marie Curie and CNRS) | Spanjaard, Olivier (University Pierre and Marie Curie and CNRS)
This paper is devoted to sequential decision making with Rank Dependent expected Utility (RDU). This decision criterion generalizes Expected Utility and enables to model a wider range of observed (rational) behaviors. In such a sequential decision setting, two conflicting objectives can be identified in the assessment of a strategy: maximizing the performance viewed from the initial state (optimality), and minimizing the incentive to deviate during implementation (deviation-proofness). In this paper, we propose a minimax regret approach taking these two aspects into account, and we provide a search procedure to determine an optimal strategy for this model. Numerical results are presented to show the interest of the proposed approach in terms of optimality, deviation-proofness and computability.
TD-DeltaPi: A Model-Free Algorithm for Efficient Exploration
Silva, Bruno C. da (University of Massachusetts Amherst) | Barto, Andrew G. (University of Massachusetts Amherst)
We study the problem of finding efficient exploration policies for the case in which an agent is momentarily not concerned with exploiting, and instead tries to compute a policy for later use. We first formally define the Optimal Exploration Problem as one of sequential sampling and show that its solutions correspond to paths of minimum expected length in the space of policies. We derive a model-free, local linear approximation to such solutions and use it to construct efficient exploration policies. We compare our model-free approach to other exploration techniques, including one with the best known PAC bounds, and show that ours is both based on a well-defined optimization problem and empirically efficient.
Improving Hybrid Vehicle Fuel Efficiency Using Inverse Reinforcement Learning
Vogel, Adam (Stanford University) | Ramachandran, Deepak (Honda Research Institute (USA) Inc.) | Gupta, Rakesh (Honda Research Institute (USA) Inc.) | Raux, Antoine (Honda Research Institute (USA) Inc.)
Deciding what mix of engine and battery power to use is critical to hybrid vehicles' fuel efficiency. Current solutions consider several factors such as the charge of the battery and how efficient the engine operates at a given speed. Previous research has shown that by taking into account the future power requirements of the vehicle, a more efficient balance of engine vs. battery power can be attained. In this paper, we utilize a probabilistic driving route prediction system, trained using Inverse Reinforcement Learning, to optimize the hybrid control policy. Our approach considers routes that the driver is likely to be taking, computing an optimal mix of engine and battery power. This approach has the potential to increase vehicle power efficiency while not requiring any hardware modification or change in driver behavior. Our method outperforms a standard hybrid control policy, yielding an average of 1.22% fuel savings.
Scalable Inverse Reinforcement Learning via Instructed Feature Construction
Singliar, Tomas (Boeing Research and Technology) | Margineantu, Dragos D. (Boeing Research and Technology)
Inverse reinforcement learning (IRL) techniques (Ng and Russell, 2000) provide a foundation for detecting abnormal agent behavior and predicting agent intent through estimating its reward function. Unfortunately, IRL algorithms suffer from the large dimensionality of the reward function space. Meanwhile, most applications that can benefit from an IRL-based approach to assessing agent intent, involve interaction with an analyst or domain expert. This paper proposes a procedure for scaling up IRL by eliciting good IRL basis functions from the domain expert. Further, we propose a new paradigm for modeling limited rationality. Unlike traditional models of limited rationality that assume an agent making stochastic choices with the value function being treated as if it is known, we propose that observed irrational behavior is actually due to uncertainty about the cost of future actions. This treatment normally leads to a POMDP formulation which is unnecessarily complicated, and we show that adding a simple noise term to the value function approximation accomplishes the same at a much smaller cost.