Goto

Collaborating Authors

 Learning Graphical Models


Approximate Planning in Large POMDPs via Reusable Trajectories

Neural Information Processing Systems

We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies TI in a partially observable Markov decision process(POMDP). We assume we are given the ability to simulate the POMDP, and study what might be called the sample complexity - that is, the amount of data one must generate in the POMDP in order to choose a good strategy. We prove upper bounds on the sample complexity showingthat, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class TI, and exponentially onthe horizon time. This latter dependence can be eased in a variety of ways, including the application of gradient and local search algorithms.


Hierarchical Image Probability (H1P) Models

Neural Information Processing Systems

We formulate a model for probability distributions on image spaces. We show that any distribution of images can be factored exactly into conditional distributionsof feature vectors at one resolution (pyramid level) conditioned on the image information at lower resolutions. We would like to factor this over positions in the pyramid levels to make it tractable, but such factoring may miss long-range dependencies. To fix this, we introduce hiddenclass labels at each pixel in the pyramid. The result is a hierarchical mixture of conditional probabilities, similar to a hidden Markov model on a tree. The model parameters can be found with maximum likelihoodestimation using the EM algorithm. We have obtained encouraging preliminary results on the problems of detecting various objects inSAR images and target recognition in optical aerial images. 1 Introduction


Monte Carlo POMDPs

Neural Information Processing Systems

We present a Monte Carlo algorithm for learning to act in partially observable Markov decision processes (POMDPs) with real-valued state and action spaces. Our approach uses importance sampling for representing beliefs, and Monte Carlo approximation for belief propagation. A reinforcement learning algorithm, value iteration, is employed to learn value functions over belief states. Finally, a samplebased versionof nearest neighbor is used to generalize across states. Initial empirical results suggest that our approach works well in practical applications.


Learning Factored Representations for Partially Observable Markov Decision Processes

Neural Information Processing Systems

The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional independence assumptionsbetween random variables are compactly represented by network parameters. The parameters are learned online, and approximations areused to perform inference and to compute the optimal value function. The relative effects of inference and value function approximations onthe quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approximations, linearand quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed below thelevel of human performance on the task.


Coastal Navigation with Mobile Robots

Neural Information Processing Systems

The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms oftenassume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not always be feasible. Partially Observable Markov Decision Processes (POMDPs) provide one way to maximize the certainty of reaching the goal state, but at the cost of computational intractability for large state spaces. The method we propose explicitly models the uncertainty of the robot's position as a state variable, and generates trajectories through the augmented pose-uncertainty space. By minimizing the positional uncertainty at the goal, the robot reduces the likelihood it becomes lost. We demonstrate experimentally that coastal navigation reduces the uncertainty at the goal, especially with degraded localization.


Reinforcement Learning Using Approximate Belief States

Neural Information Processing Systems

The problem of developing good policies for partially observable Markov decision problems (POMDPs) remains one of the most challenging areas ofresearch in stochastic planning. One line of research in this area involves the use of reinforcement learning with belief states, probability distributionsover the underlying model states. This is a promising methodfor small problems, but its application is limited by the intractability ofcomputing or representing a full belief state for large problems. Recent work shows that, in many settings, we can maintain an approximate belief state, which is fairly close to the true belief state. In particular, great success has been shown with approximate belief states that marginalize out correlations between state variables. In this paper, we investigate two methods of full belief state reinforcement learning and one novel method for reinforcement learning using factored approximate belief states. We compare the performance of these algorithms on several well-known problem from the literature. Our results demonstrate the importance ofapproximate belief state representations for large problems.


Policy Search via Density Estimation

Neural Information Processing Systems

We propose a new approach to the problem of searching a space of stochastic controllers for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). Following several other authors, our approach is based on searching in parameterized families of policies (for example, via gradient descent) to optimize solution quality. However,rather than trying to estimate the values and derivatives of a policy directly, we do so indirectly using estimates for the probability densitiesthat the policy induces on states at the different points in time. This enables our algorithms to exploit the many techniques for efficient and robust approximate density propagation in stochastic systems. Weshow how our techniques can be applied both to deterministic propagation schemes (where the MDP's dynamics are given explicitly in compact form,) and to stochastic propagation schemes (where we have access only to a generative model, or simulator, of the MDP).


Actor-Critic Algorithms

Neural Information Processing Systems

We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over a parameterized family of randomized stationary policies. These are two-time-scale algorithms in which the critic uses TD learning with a linear approximation architecture and the actor is updated in an approximate gradient direction based on information provided bythe critic. We show that the features for the critic should span a subspace prescribed by the choice of parameterization of the actor. We conclude by discussing convergence properties and some open problems.


An Environment Model for Nonstationary Reinforcement Learning

Neural Information Processing Systems

Reinforcement learning in nonstationary environments is generally regarded as an important and yet difficult problem. This paper partially addresses the problem by formalizing a subclass of nonstationary environments.The environment model, called hidden-mode Markov decision process (HM-MDP), assumes that environmental changes are always confined to a small number of hidden modes.