Goto

Collaborating Authors

 Reinforcement Learning


Nonparametric Bayesian Policy Priors for Reinforcement Learning

Neural Information Processing Systems

We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. Papers published at the Neural Information Processing Systems Conference.


Bootstrapping Apprenticeship Learning

Neural Information Processing Systems

We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert's policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known.


Predictive State Temporal Difference Learning

Neural Information Processing Systems

We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation.


Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Neural Information Processing Systems

In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected long-term total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. Papers published at the Neural Information Processing Systems Conference.


Selecting the State-Representation in Reinforcement Learning

Neural Information Processing Systems

The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T {2/3} where T is the horizon time. Papers published at the Neural Information Processing Systems Conference.


Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Neural Information Processing Systems

We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, an asymptotic regret bound of $\tilde{O}(\sqrt{T})$ was shown for $T \gg p$ where $p$ is the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present an adaptive control scheme that for $p \gg 1$ and $T \gg \polylog(p)$ achieves a regret bound of $\tilde{O}(p \sqrt{T})$.


A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Neural Information Processing Systems

A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches.


Universal Value Density Estimation for Imitation Learning and Goal-Conditioned Reinforcement Learning

arXiv.org Machine Learning

This work considers two distinct settings: imitation learning and goal-conditioned reinforcement learning. In either case, effective solutions require the agent to reliably reach a specified state (a goal), or set of states (a demonstration). Drawing a connection between probabilistic long-term dynamics and the desired value function, this work introduces an approach which utilizes recent advances in density estimation to effectively learn to reach a given state. As our first contribution, we use this approach for goal-conditioned reinforcement learning and show that it is both efficient and does not suffer from hindsight bias in stochastic domains. As our second contribution, we extend the approach to imitation learning and show that it achieves state-of-the art demonstration sample-efficiency on standard benchmark tasks.


The Archimedean trap: Why traditional reinforcement learning will probably not yield AGI

arXiv.org Artificial Intelligence

Whenever we measure anything using a particular number system, the corresponding measurements will be constrained by the structure of that number system. If the number system has a different structure than the things we are measuring with it, then our measurements will suffer accordingly, just as if we were trying to force square pegs into round holes. For example, the natural numbers make lousy candidates for measuring lengths in a physics laboratory. Lengths in the lab have properties such as, for example, the fact that for any two distinct lengths, there is an intermediate length strictly between them. The natural numbers lack this property. Imagine the poor physicist, brought up in a world of only natural numbers, scratching his or her head upon encountering a rod with length strictly between two rods of length 1 and 2.


Maxmin Q-learning: Controlling the Estimation Bias of Q-learning

arXiv.org Artificial Intelligence

Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.