AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Nonparametric Bayesian Policy Priors for Reinforcement Learning

Doshi-velez, Finale, Wingate, David, Roy, Nicholas, Tenenbaum, Joshua B.

Neural Information Processing SystemsFeb-15-2020, 00:56:47 GMT

We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. Papers published at the Neural Information Processing Systems Conference.

agent, knowledge, reinforcement learning

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)

Add feedback

Bootstrapping Apprenticeship Learning

Boularias, Abdeslam, Chaib-draa, Brahim

Neural Information Processing SystemsFeb-15-2020, 00:42:36 GMT

We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert's policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known.

bootstrapping apprenticeship learning, demonstration, feature count

Neural Information Processing Systems

Genre: Research Report (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)

Add feedback

Predictive State Temporal Difference Learning

Boots, Byron, Gordon, Geoffrey J.

Neural Information Processing SystemsFeb-15-2020, 00:42:28 GMT

We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation.

predictive state temporal difference learning, reinforcement, value function approximation, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Kim, Dongho, Kim, Kee-eung, Poupart, Pascal

Neural Information Processing SystemsFeb-15-2020, 00:27:00 GMT

In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected long-term total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. Papers published at the Neural Information Processing Systems Conference.

bayesian reinforcement learning, cost-sensitive exploration

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback

Selecting the State-Representation in Reinforcement Learning

Maillard, Odalric-ambrym, Ryabko, Daniil, Munos, Rémi

Neural Information Processing SystemsFeb-15-2020, 00:11:38 GMT

The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T {2/3} where T is the horizon time. Papers published at the Neural Information Processing Systems Conference.

reinforcement learning, selecting, state-representation, (1 more...)

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)

Add feedback

Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

Ibrahimi, Morteza, Javanmard, Adel, Roy, Benjamin V.

Neural Information Processing SystemsFeb-15-2020, 00:11:20 GMT

We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, an asymptotic regret bound of $\tilde{O}(\sqrt{T})$ was shown for $T \gg p$ where $p$ is the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present an adaptive control scheme that for $p \gg 1$ and $T \gg \polylog(p)$ achieves a regret bound of $\tilde{O}(p \sqrt{T})$.

adaptive control scheme, efficient reinforcement learning, high dimensional linear quadratic system, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Wagner, Paul

Neural Information Processing SystemsFeb-15-2020, 00:10:55 GMT

A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches.

approximate policy iteration, policy oscillation phenomenon, reinterpretation, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

Add feedback

Universal Value Density Estimation for Imitation Learning and Goal-Conditioned Reinforcement Learning

Schroecker, Yannick, Isbell, Charles

arXiv.org Machine LearningFeb-15-2020

This work considers two distinct settings: imitation learning and goal-conditioned reinforcement learning. In either case, effective solutions require the agent to reliably reach a specified state (a goal), or set of states (a demonstration). Drawing a connection between probabilistic long-term dynamics and the desired value function, this work introduces an approach which utilizes recent advances in density estimation to effectively learn to reach a given state. As our first contribution, we use this approach for goal-conditioned reinforcement learning and show that it is both efficient and does not suffer from hindsight bias in stochastic domains. As our second contribution, we extend the approach to imitation learning and show that it achieves state-of-the art demonstration sample-efficiency on standard benchmark tasks.

agent, imitation learning, universal value density estimation, (7 more...)

arXiv.org Machine Learning

2002.06473

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

The Archimedean trap: Why traditional reinforcement learning will probably not yield AGI

Alexander, Samuel Allen

arXiv.org Artificial IntelligenceFeb-15-2020

Whenever we measure anything using a particular number system, the corresponding measurements will be constrained by the structure of that number system. If the number system has a different structure than the things we are measuring with it, then our measurements will suffer accordingly, just as if we were trying to force square pegs into round holes. For example, the natural numbers make lousy candidates for measuring lengths in a physics laboratory. Lengths in the lab have properties such as, for example, the fact that for any two distinct lengths, there is an intermediate length strictly between them. The natural numbers lack this property. Imagine the poor physicist, brought up in a world of only natural numbers, scratching his or her head upon encountering a rod with length strictly between two rods of length 1 and 2.

number system, real number, reinforcement, (14 more...)

arXiv.org Artificial Intelligence

2002.10221

Country:

North America > United States > Ohio (0.04)
North America > United States > Indiana (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Add feedback

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning

Lan, Qingfeng, Pan, Yangchen, Fyshe, Alona, White, Martha

arXiv.org Artificial IntelligenceFeb-15-2020

Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.

algorithm, maxmin q-learning, q-learning, (14 more...)

arXiv.org Artificial Intelligence

2002.06487

Country:

North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback