sarsop
7f2be1b45d278ac18804b79207a24c53-AuthorFeedback.pdf
We thank the reviewers for their insightful feedback. We address reviewer comments below and begin by situating the paper's intended contribution: Why is this our goal? POMDP planners incur the complexity of full, closed-loop planning only when necessary. V oI is "contrary to the core concept of POMDPs", V oI macro-actions expand the set of problems that can be efficiently What is not our goal? The primary critique of reviewers is the limited scope of our experimental results.
Tighter Value-Function Approximations for POMDPs
Krale, Merlijn, Koops, Wietze, Junges, Sebastian, Simão, Thiago D., Jansen, Nils
Solving partially observable Markov decision processes (POMDPs) typically requires reasoning about the values of exponentially many state beliefs. Towards practical performance, state-of-the-art solvers use value bounds to guide this reasoning. However, sound upper value bounds are often computationally expensive to compute, and there is a tradeoff between the tightness of such bounds and their computational cost. This paper introduces new and provably tighter upper value bounds than the commonly used fast informed bound. Our empirical evaluation shows that, despite their additional computational overhead, the new upper bounds accelerate state-of-the-art POMDP solvers on a wide range of benchmarks.
Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes
Bouton, Maxime, Tumova, Jana, Kochenderfer, Mykel J.
Autonomous systems are often required to operate in partially observable environments. They must reliably execute a specified objective even with incomplete information about the state of the environment. We propose a methodology to synthesize policies that satisfy a linear temporal logic formula in a partially observable Markov decision process (POMDP). By formulating a planning problem, we show how to use point-based value iteration methods to efficiently approximate the maximum probability of satisfying a desired logical formula and compute the associated belief state policy. We demonstrate that our method scales to large POMDP domains and provides strong bounds on the performance of the resulting policy.
PLEASE: Palm Leaf Search for POMDPs with Large Observation Spaces
Zhang, Zongzhang (Soochow University) | Hsu, David (National University of Singapore) | Lee, Wee Sun (National University of Singapore) | Lim, Zhan Wei (National University of Singapore) | Bai, Aijun (University of Science and Technology of China)
This paper provides a novel POMDP planning method, called Palm LEAf SEarch (PLEASE), which allows the selection of more than one outcome when their potential impacts are close to the highest one during its forward exploration. Compared with existing trial-based algorithms, PLEASE can save considerable time to propagate the bound improvements of beliefs in deep levels of the search tree to the root belief because of fewer backup operations. Experiments showed that PLEASE scales up SARSOP, one of the fastest algorithms, by orders of magnitude on some POMDP tasks with large observation spaces.
A Fast Pairwise Heuristic for Planning under Uncertainty
Khalvati, Koosha (University of British Columbia) | Mackworth, Alan (University of British Columbia)
POMDP (Partially Observable Markov Decision Process) is a mathematical framework that models planning under uncertainty. Solving a POMDP is an intractable problem and even the state of the art POMDP solvers are too computationally expensive for large domains. This is a major bottleneck. In this paper, we propose a new heuristic, called the pairwise heuristic, that can be used in a one-step greedy strategy to find a near optimal solution for POMDP problems very quickly. This approach is a good candidate for large problems where real-time solution is a necessity but exact optimality of the solution is not vital. The pairwise heuristic uses the optimal solutions for pairs of states. For each pair of states in the POMDP, we find the optimal sequence of actions to resolve the uncertainty and to maximize the reward, given that the agent is uncertain about which state of the pair it is in. Then we use these sequences as a heuristic and find the optimal action in each step of the greedy strategy using this heuristic. We have tested our method on the available large classical test benchmarks in various domains. The resulting total reward is close to, if not greater than, the total reward obtained by other state of the art POMDP solvers, while the time required to find the solution is always much less.
Structured Parameter Elicitation
Ko, Li Ling (National University of Singapore) | Hsu, David (National University of Singapore) | Lee, Wee Sun (National University of Singapore) | Ong, Sylvie C. W. (National University of Singapore)
The behavior of a complex system often depends on parameters whose values are unknown in advance. To operate effectively, an autonomous agent must actively gather information on the parameter values while progressing towards its goal. We call this problem parameter elicitation. Partially observable Markov decision processes (POMDPs) provide a principled framework for such uncertainty planning tasks, but they suffer from high computational complexity. However, POMDPs for parameter elicitation often possess special structural properties, specifically, factorization and symmetry. This work identifies these properties and exploits them for efficient solution through a factored belief representation. The experimental results show that our new POMDP solvers outperform SARSOP and MOMDP, two of the fastest general-purpose POMDP solvers available, and can handle significantly larger problems.
PUMA: Planning Under Uncertainty with Macro-Actions
He, Ruijie (Massachusetts Institute of Technology) | Brunskill, Emma (University of California, Berkeley) | Roy, Nicholas (Massachusetts Institute of Technology)
Planning in large, partially observable domains is challenging, especially when a long-horizon lookahead is necessary to obtain a good policy. Traditional POMDP planners that plan a different potential action for each future observation can be prohibitively expensive when planning many steps ahead. An efficient solution for planning far into the future in fully observable domains is to use temporally-extended sequences of actions, or "macro-actions." In this paper, we present a POMDP algorithm for planning under uncertainty with macro-actions (PUMA) that automatically constructs and evaluates open-loop macro-actions within forward-search planning, where the planner branches on observations only at the end of each macro-action. Additionally, we show how to incrementally refine the plan over time, resulting in an anytime algorithm that provably converges to an epsilon-optimal policy. In experiments on several large POMDP problems which require a long horizon lookahead, PUMA outperforms existing state-of-the art solvers.