Goto

Collaborating Authors

 finite-state controller


Finite-Time Analysis of Natural Actor-Critic for POMDPs

Cayci, Semih, He, Niao, Srikant, R.

arXiv.org Artificial Intelligence

We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.


An Improved Policy Iteration Algorithm for Partially Observable MDPs

Neural Information Processing Systems

A new policy iteration algorithm for partially observable Markov decision processes is presented that is simpler and more efficient than an earlier policy iteration algorithm of Sondik (1971,1978). The key simplification is representation of a policy as a finite-state controller. This representation makes policy evaluation straightforward. The pa(cid:173) per's contribution is to show that the dynamic-programming update used in the policy improvement step can be interpreted as the trans(cid:173) formation of a finite-state controller into an improved finite-state con(cid:173) troller. The new algorithm consistently outperforms value iteration as an approach to solving infinite-horizon problems.


Robust Finite-State Controllers for Uncertain POMDPs

Cubuktepe, Murat, Jansen, Nils, Junges, Sebastian, Marandi, Ahmadreza, Suilen, Marnix, Topcu, Ufuk

arXiv.org Artificial Intelligence

Uncertain partially observable Markov decision processes (uPOMDPs) allow the probabilistic transition and observation functions of standard POMDPs to belong to a so-called uncertainty set. Such uncertainty sets capture uncountable sets of probability distributions. We develop an algorithm to compute finite-memory policies for uPOMDPs that robustly satisfy given specifications against any admissible distribution. In general, computing such policies is both theoretically and practically intractable. We provide an efficient solution to this problem in four steps. (1) We state the underlying problem as a nonconvex optimization problem with infinitely many constraints. (2) A dedicated dualization scheme yields a dual problem that is still nonconvex but has finitely many constraints. (3) We linearize this dual problem and (4) solve the resulting finite linear program to obtain locally optimal solutions to the original problem. The resulting problem formulation is exponentially smaller than those resulting from existing methods. We demonstrate the applicability of our algorithm using large instances of an aircraft collision-avoidance scenario and a novel spacecraft motion planning case study.


Policies that Generalize: Solving Many Planning Problems with the Same Policy

Bonet, Blai (Universidad Simon Bolivar) | Geffner, Hector (ICREA and Universitat Pompeu Fabra)

AAAI Conferences

We establish conditions under which memoryless policies and finite-state controllers that solve one partially observable non-deterministic problem (PONDP) generalize to other problems; namely, problems that have a similar structure and share the same action and observation space. This is relevant to generalized planning where plans that work for many problems are sought, and to transfer learning where knowledge gained in the solution of one problem is to be used on related problems. We use a logical setting where uncertainty is represented by sets of states and the goal is to be achieved with certainty. While this gives us crisp notions of solution policies and generalization, the account also applies to probabilistic  PONDs, i.e., Goal POMDPs.


On the Inference of Turing Machines from Sample Computations A. W. Biermann

AI Classics

This paper will be concerned with the problem of obtaining this performance from the machine by giving it examples of the desired computation and having it program itself. We will be concerned with designing a trainable Turing machine although the concepts presented are applicable in a much more general context as discussed in Section 4. The Turing machine to be discussed here will have an infinite one dimensional tape and will have the capability in one move to read a symbol on the tape, print a new symbol to replace the one just read, and step right or left one increment on the tape. It will have a deterministic finite-state controller with a designated initial state which will upon receiving an input symbol read from the tape, yield the symbol to be printed and the step direction (right or left) to be made.


Solving POMDPs by Searching in Policy Space

Hansen, Eric A.

arXiv.org Artificial Intelligence

Most algorithms for solving POMDPs iteratively improve a value function that implicitly represents a policy and are said to search in value function space. This paper presents an approach to solving POMDPs that represents a policy explicitly as a finite-state controller and iteratively improves the controller by search in policy space. Two related algorithms illustrate this approach. The first is a policy iteration algorithm that can outperform value iteration in solving infinitehorizon POMDPs. It provides the foundation for a new heuristic search algorithm that promises further speedup by focusing computational effort on regions of the problem space that are reachable, or likely to be reached, from a start state.


My Brain is Full: When More Memory Helps

Lusena, Christopher, Li, Tong, Sittinger, Shelia, Wells, Chris, Goldsmith, Judy

arXiv.org Artificial Intelligence

We consider the problem of finding good finite-horizon policies for POMDPs under the expected reward metric. The policies considered are {em free finite-memory policies with limited memory}; a policy is a mapping from the space of observation-memory pairs to the space of action-memeory pairs (the policy updates the memory as it goes), and the number of possible memory states is a parameter of the input to the policy-finding algorithms. The algorithms considered here are preliminary implementations of three search heuristics: local search, simulated annealing, and genetic algorithms. We compare their outcomes to each other and to the optimal policies for each instance. We compare run times of each policy and of a dynamic programming algorithm for POMDPs developed by Hansen that iteratively improves a finite-state controller --- the previous state of the art for finite memory policies. The value of the best policy can only improve as the amount of memory increases, up to the amount needed for an optimal finite-memory policy. Our most surprising finding is that more memory helps in another way: given more memory than is needed for an optimal policy, the algorithms are more likely to converge to optimal-valued policies.


A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Yu, Huizhen

arXiv.org Machine Learning

We consider the estimation of the policy gradient in partially observable Markov decision processes (POMDP) with a special class of structured policies that are finite-state controllers. We show that the gradient estimation can be done in the Actor-Critic framework, by making the critic compute a "value" function that does not depend on the states of POMDP. This function is the conditional mean of the true value function that depends on the states. We show that the critic can be implemented using temporal difference (TD) methods with linear function approximations, and the analytical results on TD and Actor-Critic can be transfered to this case. Although Actor-Critic algorithms have been used extensively in Markov decision processes (MDP), up to now they have not been proposed for POMDP as an alternative to the earlier proposal GPOMDP algorithm, an actor-only method. Furthermore, we show that the same idea applies to semi-Markov problems with a subset of finite-state controllers.


Automatic Derivation of Finite-State Machines for Behavior Control

Bonet, Blai (Universidad Simon Bolivar) | Palacios, Hector (Universidad Simon Bolivar) | Geffner, Hector (Universidad Pompeu Fabra &)

AAAI Conferences

Finite-state controllers represent an effective action selection mechanisms widely used in domains such as video-games and mobile robotics. In contrast to the policies obtained from MDPs and POMDPs, finite-state controllers have two advantages: they are often extremely compact, and they are general, applying to many problems and not just one. A limitation of finite-state controllers, on the other hand, is that they are written by hand. In this paper, we address this limitation, presenting a method for deriving controllers automatically from models. The models represent a class of contingent problems where actions are deterministic and some fluents are observable. The problem of deriving a controller is converted into a conformant problem that is solved using classical planners, taking advantage of a complete translation into classical planning introduced recently. The controllers derived are ‘general’ in the sense that they do not solve the original problem only, but many variations as well, including changes in the size of the problem or in the uncertainty of the initial situation and action effects. Several experiments illustrating the automatic derivation of controllers are presented.


Automatic Derivation of Memoryless Policies and Finite-State Controllers Using Classical Planners

Bonet, Blai (Universidad Simón Bolívar) | Palacios, Héctor (Universidad Simón Bolívar) | Geffner, Héctor (ICREA and Universitat Pompeu Fabra)

AAAI Conferences

Finite-state and memoryless controllers are simple action selection mechanisms widely used in domains such as video-games and mobile robotics.  Memoryless controllers stand for functions that map observations into actions, while finite-state controllers generalize memoryless ones with a finite amount of memory.  In contrast to the policies obtained from MDPs and POMDPs, finite-state controllers have two advantages: they are often extremely compact, involving a small number of controller states or none at all, and they are general, applying to many problems and not just one. A limitation of finite-state controllers is that they must be written by hand. In this work, we address this limitation, and develop a method for deriving finite-state controllers automatically from models. These models represent a class of contingent problems where actions are deterministic and some fluents are observable.  The problem of deriving a controller from such models is converted into a conformant planning problem that is solved using classical planners, taking advantage of a complete translation introduced recently.  The controllers derived in this way are 'general' in the sense that they do not solve the original problem only, but many variations as well, including changes in the size of the problem or in the uncertainty of the initial situation and action effects.  Experiments illustrating the derivation of such controllers are presented.