Goto

Collaborating Authors

 Reinforcement Learning


Open Problems in Universal Induction & Intelligence

arXiv.org Artificial Intelligence

Specialized intelligent systems can be found everywhere: finger print, handwriting, speech, and face recognition, spam filtering, chess and other game programs, robots, et al. This decade the first presumably complete mathematical theory of artificial intelligence based on universal induction-prediction-decision-action has been proposed. This information-theoretic approach solidifies the foundations of inductive inference and artificial intelligence. Getting the foundations right usually marks a significant progress and maturing of a field. The theory provides a gold standard and guidance for researchers working on intelligent algorithms. The roots of universal induction have been laid exactly half-a-century ago and the roots of universal intelligence exactly one decade ago. So it is timely to take stock of what has been achieved and what remains to be done. Since there are already good recent surveys, I describe the state-of-the-art only in passing and refer the reader to the literature. This article concentrates on the open problems in universal induction and its extension to universal intelligence.


Efficient Skill Learning Using Abstraction Selection

AAAI Conferences

We present an algorithm for selecting an appropriate abstraction when learning a new skill. We show empirically that it can consistently select an appropriate abstraction using very little sample data, and that it significantly improves skill learning performance in a reasonably large real-valued reinforcement learning domain.


Inverse Reinforcement Learning in Partially Observable Environments

AAAI Conferences

Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behaviour of an expert. Most of the existing algorithms for IRL assume that the expert's environment is modeled as a Markov decision process (MDP), although they should be able to handle partially observable settings in order to widen the applicability to more realistic scenarios. In this paper, we present an extension of the classical IRL algorithm by Ng and Russell to partially observable environments. We discuss technical issues and challenges, and present the experimental results on some of the benchmark partially observable domains.


Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

AAAI Conferences

Appropriately designing sampling policies is highly important for obtaining better control policies in reinforcement learning. In this paper, we first show that the least-squares policy iteration (LSPI) framework allows us to employ statistical active learning methods for linear regression. Then we propose a design method of good sampling policies for efficient exploration, which is particularly useful when the sampling cost of immediate rewards is high. We demonstrate the usefulness of the proposed method, named active policy iteration (API), through simulations with a batting robot.


Feature Reinforcement Learning: Part I: Unstructured MDPs

arXiv.org Artificial Intelligence

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II [Hut09c]. The role of POMDPs is also considered there.


The Crawler, A Class Room Demonstrator for Reinforcement Learning

AAAI Conferences

We present a little crawling robot with a two DOF arm that learns to move forward within about 15 seconds in real time. Due to its small size and weight the robot is ideally suited for classroom demonstrations as well as for talks to the public. Students who want to practice their knowledge about reinforcement learning and value iteration can use a wireless connection to a PC and monitor the internal state of the robot such as the value function or the reward table. Due to its adaptivity, depending on the surface properties of the underground the robot may surprise its audience with unexpected but efficient walking policies. The GUI is open source and the robot hardware is available as a kit from the authors.


Generalizing and Categorizing Skills in Reinforcement Learning Agents Using Partial Policy Homomorphisms

AAAI Conferences

A reinforcement learning agent involved in life-long learning in a complex and dynamic environment has to have the ability to utilize control knowledge acquired in one situation in novel contexts. As part of this, it is important for the learning agent not only to be able to learn a new skill for a specific instance of a task but also to identify similar tasks, form a reusable skill and representational abstractions for the corresponding ''task type'', and to apply these abstractions in new, previously unseen contexts. This paper presents a new approach to policy generalization that derives an abstract policy for a set of similar tasks (a ''task type'') by constructing a partial policy homomorphism from a set of basic policies learned for previously seen task instances. The resulting generalized policy can then be applied in new contexts to address new instances of similar tasks. As opposed to many recent approaches in lifelong learning systems, this approach allows to identify similar tasks based on the functional characteristics of the corresponding skills and provides a means of transferring the learned knowledge to new situations without the need for complete knowledge of the state space and the system dynamics in the new environment. To illustrate the new policy generalization method and to demonstrate its ability to reuse the gained knowledge in new contexts, it is applied to a set of grid world examples.


Optimistic Simulated Exploration as an Incentive for Real Exploration

arXiv.org Artificial Intelligence

Many reinforcement learning exploration techniques are overly optimistic and try to explore every state. Such exploration is impossible in environments with the unlimited number of states. I propose to use simulated exploration with an optimistic model to discover promising paths for real exploration. This reduces the needs for the real exploration.


Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs - Extended Version

arXiv.org Artificial Intelligence

In this paper we propose an algorithm for polynomial-time reinforcement learning in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an empirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized optimistically. We prove that with suitable initialization (i) FOIM converges to the fixed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with respect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the first algorithm with these properties. This extended version contains the rigorous proofs of the main theorem. A version of this paper appeared in ICML'09.


A Methodology for Learning Players' Styles from Game Records

arXiv.org Artificial Intelligence

In Chess, as in other popular strategic board games, players have different styles. For example, in Chess some players are more "positional" and other more "tactical", and this difference in style will affect their move choice in any given board position, and more generally their overall plan. The problem we tackle in this paper is that of applying machine learning to teach a computer to discriminate between players based on their style. Before we explain our methodology, we briefly review the method of temporal difference learning, which is central to our approach. Temporal difference learning [Sut88] is a machine learning technique, originating from the seminal work of Samuel [Sam59], in which learning occurs by minimising the differences between predictions and actual outcomes of a temporal sequence of observations. Samuel [Sam59] used the game of Checkers as a vehicle to study the feasibility of a computer learning from experience. Although the program written by Samuel did not achieve master strength, it was the precursor of the Checkers program Chinook [Sch97, SHJ01], which was the first computer program to win a match against a human world champion.