Goto

Collaborating Authors

 Reinforcement Learning


Algorithmic and Human Teaching of Sequential Decision Tasks

AAAI Conferences

A helpful teacher can significantly improve the learning rate of a learning agent. Teaching algorithms have been formally studied within the field of Algorithmic Teaching. These give important insights into how a teacher can select the most informative examples while teachinga new concept. However the field has so far focused purely on classification tasks. In this paper we introducea novel method for optimally teaching sequential decision tasks. We present an algorithm that automatically selects the set of most informative demonstrations andevaluate it on several navigation tasks. Next, we explore the idea of using this algorithm to produce instructions for humans on how to choose examples when teaching sequential decision tasks. We present a user study that demonstrates the utility of such instructions.


Competing with Humans at Fantasy Football: Team Formation in Large Partially-Observable Domains

AAAI Conferences

We present the first real-world benchmark for sequentially-optimal team formation, working within the framework of a class of online football prediction games known as Fantasy Football. We model the problem as a Bayesian reinforcement learning one, where the action space is exponential in the number of players and where the decision maker's beliefs are over multiple characteristics of each footballer. We then exploit domain knowledge to construct computationally tractable solution techniques in order to build a competitive automated Fantasy Football manager. Thus, we are able to establish the baseline performance in this domain, even without complete information on footballers' performances (accessible to human managers), showing that our agent is able to rank at around the top percentile when pitched against 2.5M human players.


Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs

AAAI Conferences

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a powerful modeling technique for realistic multi-agent coordination problems under uncertainty. Prevalent solution techniques are centralized and assume prior knowledge of the model. We propose a distributed reinforcement learning approach, where agents take turns to learn best responses to each otherโ€™s policies. This promotes decentralization of the policy computation problem, and relaxes reliance on the full knowledge of the problem parameters. We derive the relation between the sample complexity of best response learning and error tolerance. Our key contribution is to show that sample complexity could grow exponentially with the problem horizon. We show empirically that even if the sample requirement is set lower than what theory demands, our learning approach can produce (near) optimal policies in some benchmark Dec-POMDP problems.


Approximate Policy Iteration with Linear Action Models

AAAI Conferences

In this paper we consider the problem of finding a good policy given some batch data.We propose a new approach, LAM-API, that first builds a so-called linear action model (LAM) from the data and then uses the learned model and the collected data in approximate policy iteration (API) to find a good policy.A natural choice for the policy evaluation step in this algorithm is to use least-squares temporal difference (LSTD) learning algorithm.Empirical results on three benchmark problems show that this particular instance of LAM-API performs competitively as compared with LSPI, both from the point of view of data and computational efficiency.


Context Tree Maximizing

AAAI Conferences

Recent developments in reinforcement learning for non-Markovianproblems witness a surge in history-based methods, among which weare particularly interested in two frameworks, PhiMDP and MC-AIXI-CTW. PhiMDP attempts to reduce the general RL problem, where the environment's states and dynamics are both unknown, toan MDP, while MC-AIXI-CTW incrementally learns a mixture of contexttrees as its environment model. The main idea of PhiMDP is toconnect generic reinforcement learning with classical reinforcementlearning. The first implementation of PhiMDP relies on astochastic search procedure for finding a tree that minimizes acertain cost function. This does not guarantee finding theminimizing tree, or even a good one, given limited search time. As aconsequence it appears that the approach has difficulties with largedomains. MC-AIXI-CTW is attractive in that it can incrementally andanalytically compute the internal model through interactions withthe environment. Unfortunately, it is computationally demanding dueto requiring heavy planning simulations at every single time step.We devise a novel approach called CTMRL, which analytically andefficiently finds the cost-minimizing tree. Instead of thecontext-tree weighting method that MC-AIXI-CTW is based on, we usethe closely related context-tree maximizing algorithm that selectsjust one single tree. This approach falls under the PhiMDPframework, which allows the replacement of the costly planningcomponent of MC-AIXI-CTW with simple Q-Learning. Our empiricalinvestigation show that CTMRL finds policies of quality as good as MC-AIXI-CTW's on sixdomains including a challenging Pacman domain, but in an order ofmagnitude less time.


Kernel-Based Reinforcement Learning on Representative States

AAAI Conferences

Markov decision processes (MDPs) are an established framework for solving sequential decision-making problems under uncertainty. In this work, we propose a new method for batch-mode reinforcement learning (RL) with continuous state variables. The method is an approximation to kernel-based RL on a set of k representative states. Similarly to kernel-based RL, our solution is a fixed point of a kernelized Bellman operator and can approximate the optimal solution to an arbitrary level of granularity. Unlike kernel-based RL, our method is fast. In particular, our policies can be computed in O ( n ) time, where n is the number of training examples. The time complexity of kernel-based RL is ฮฉ( n 2 ). We introduce our method, analyze its convergence, and compare it to existing work. The method is evaluated on two existing control problems with 2 to 4 continuous variables and a new problem with 64 variables. In all cases, we outperform state-of-the-art results and offer simpler solutions.


Adaptive Step-Size for Online Temporal Difference Learning

AAAI Conferences

The step-size, often denoted as ฮฑ, is a key parameter for most incremental learning algorithms. Its importance is especially pronounced when performing online temporal difference (TD) learning with function approximation. Several methods have been developed to adapt the step-size online. These range from straightforward back-off strategies to adaptive algorithms based on gradient descent. We derive an adaptive upper bound on the step-size parameter to guarantee that online TD learning with linear function approximation will not diverge. We then empirically evaluate algorithms using this upper bound as a heuristic for adapting the step-size parameter online. We compare performance with related work including HL(ฮป) and Autostep. Our results show that this adaptive upper bound heuristic out-performs all existing methods without requiring any meta-parameters. This effectively eliminates the need to tune the learning rate of temporal difference learning with linear function approximation.


Investigating Contingency Awareness Using Atari 2600 Games

AAAI Conferences

Contingency awareness is the recognition that some aspects of a future observation are under an agent's control while others are solely determined by the environment. This paper explores the idea of contingency awareness in reinforcement learning using the platform of Atari 2600 games. We introduce a technique for accurately identifying contingent regions and describe how to exploit this knowledge to generate improved features for value function approximation. We evaluate the performance of our techniques empirically, using 46 unseen, diverse, and challenging games for the Atari 2600 console. Our results suggest that contingency awareness is a generally useful concept for model-free reinforcement learning agents.


An Intelligent Battery Controller Using Bias-Corrected Q-learning

AAAI Conferences

The transition to renewables requires storage to help smooth short-term variations in energy from wind and solar sources, as well as to respond to spikes in electricity spot prices, which can easily exceed 20 times their average. Efficient operation of an energy storage device is a fundamental problem, yet classical algorithms such as $Q$-learning can diverge for millions of iterations, limiting practical applications. We have traced this behavior to the max-operator bias, which is exacerbated by high volatility in the reward function, and high discount factors due to the small time steps. We propose an elegant bias correction procedure and demonstrate its effectiveness.


Planning with Markov Decision Processes: An AI Perspective

Morgan & Claypool Publishers

Markov Decision Processes (MDPs) are widely popular in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics. They are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes. MDPs are actively researched in two related subareas of AI, probabilistic planning and reinforcement learning. Probabilistic planning assumes known models for the agent's goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives. On the other hand, reinforcement learning additionally learns these models based on the feedback the agent gets from the environment.