Goto

Collaborating Authors

 Country


Balancing Multiple Sources of Reward in Reinforcement Learning

Neural Information Processing Systems

For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.


Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Neural Information Processing Systems

The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned online using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a cooperative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.


Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

Neural Information Processing Systems

Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem.


Robust Reinforcement Learning

Neural Information Processing Systems

This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a'disturbing' agent (disturber) tries to make the worst possible disturbance while a'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance.


Automated State Abstraction for Options using the U-Tree Algorithm

Neural Information Processing Systems

Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.


Reinforcement Learning with Function Approximation Converges to a Region

Neural Information Processing Systems

Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1 Introduction Although there are convergent online algorithms (such as TD()') [1]) for learning the parameters of a linear approximation to the value function of a Markov process, no way is known to extend these convergence proofs to the task of online approximation of either the state-value (V*) or the action-value (Q*) function of a general Markov decision process. In fact, there are known counterexamples to many proposed algorithms.


Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

Neural Information Processing Systems

In multi-service communications networks, such as Asynchronous Transfer Mode (ATM) networks, resource control is of crucial importance for the network operator as well as for the users. The objective is to maintain the service quality while maximizing the operator's revenue. At the call level, service quality (Grade of Service) is measured in terms of call blocking probabilities, and the key resource to be controlled is bandwidth. Network routing and call admission control (CAC) are two such resource control problems. Markov decision processes offer a framework for optimal CAC and routing [1]. By modelling the dynamics of the network with traffic and computing control policies using dynamic programming [2], resource control is optimized. A standard assumption in such models is that calls arrive according to Poisson processes. This makes the models of the dynamics relatively simple. Although the Poisson assumption is valid for most user-initiated requests in communications networks, a number of studies [3, 4, 5] indicate that many types of arrival similar.



Programmable Reinforcement Learning Agents

Neural Information Processing Systems

We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills. 1 Introduction The field of reinforcement learning has recently adopted the idea that the application of prior knowledge may allow much faster learning and may indeed be essential if realworld environments are to be addressed. For learning behaviors, the most obvious form of prior knowledge provides a partial description of desired behaviors. Several languages for partial descriptions have been proposed, including Hierarchical Abstract Machines (HAMs) [8], semi-Markov options [12], and the MAXQ framework [4]. This paper describes extensions to the HAM language that substantially increase its expressive power, using constructs borrowed from programming languages. Obviously, increasing expressiveness makes it easier for the user to supply whatever prior knowledge is available, and to do so more concisely.


Bayesian Video Shot Segmentation

Neural Information Processing Systems

Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.