Undirected Networks
Feng
We describe an approximate dynamic programming algorithm for partially observable Markov decision processes represented in factored form. Two complementary forms of approximation are used to simplify a piecewise linear and convex value function, where each linear facet of the function is represented compactly by an algebraic decision diagram.
Lee
We introduce a generalized dual decomposition bound for computing the maximum expected utility of influence diagrams based on the dual decomposition method generalized to $L p$ space. The main goal is to devise an approximation scheme free from translations required by existing variational approaches while exploiting the local structure of sum of utility functions as well as the conditional independence of probability functions. In this work, the generalized dual decomposition method is applied to the algebraic framework called valuation algebra for influence diagrams which handles probability and expected utility as a pair. The proposed approach allows a sequential decision problem to be decomposed as a collection of sub-decision problems of bounded complexity and the upper bound of maximum expected utility to be computed by combining the local expected utilities. Thus, it has a flexible control of space and time complexity for computing the bound. In addition, the upper bounds can be further minimized by reparameterizing the utility functions. Since the global objective function for the minimization is nonconvex, we present a gradient-based local search algorithm in which the outer loop controls the randomization of the initial configurations and the inner loop tightens the upper-bound based on block coordinate descent with gradients perturbed by a random noise. The experimental evaluation demonstrates highlights of the proposed approach on finite horizon MDP/POMDP instances.
Han
Interactive partially observable Markov decision processes (I-POMDPs) provide a principled framework for planning and acting in a partially observable, stochastic and multi-agent environment, extending POMDPs to multi-agent settings by including models of other agents in the state space and forming a hierarchical belief structure. In order to predict other agents' actions using I-POMDP, we propose an approach that effectively uses Bayesian inference and sequential Monte Carlo (SMC) sampling to learn others' intentional models which ascribe to them beliefs, preferences and rationality in action selection. Empirical results show that our algorithm accurately learns models of other agents and has superior performance when compared to other methods. Our approach serves as a generalized reinforcement learning algorithm that learns other agents' beliefs, and transition, observation and reward functions.
Zhang
Mobile robots deployed in complex real-world domains typically find it difficult to process all sensor inputs or operate without substantial domain knowledge. At the same time, humans may not have the time and expertise to provide elaborateand accurate knowledge or feedback. The architecture described in this paper combines declarative programming and probabilistic sequential decision-making to address these challenges. Specifically, Answer Set Programming (ASP), a declarative programming paradigm, is combined with hierarchical partially observable Markov decision processes (POMDPs), enabling robots to: (a) represent and reason with incomplete domain knowledge, revising existing knowledge using information extracted from sensor inputs; (b) probabilistically model the uncertainty in sensor input processing and navigation; and (c) use domain knowledge to revise probabilistic beliefs, exploiting positive and negative observations to identify situations in which the assigned task can no longer be pursued. All algorithms are evaluated in simulation and on mobile robots locating target objects in indoor domains.
Grady
The Partially-Observable Markov Decision Process (POMDP) is a general framework to determine reward-maximizing action policies under noisy action and sensing conditions. However, determining an optimal policy for POMDPs is often intractable for robotic tasks due to the PSPACE-complete nature of the computation required. Several recent solvers have been introduced that expand the size of problems that can be considered. Although these POMDP solvers can respect complex motion constraints in theory, we show that the computational cost does not provide a benefit in the eventual online execution, compared to our alternative approach that relies on a policy that ignores some of the motion constraints. We advocate using the POMDP framework where it is critical -- to find a policy that provides the optimal action given all past noisy sensor observations, while abstracting some of the motion constraints to reduce solution time. However, the actions of an abstract robot are generally not executable under its true motion constraints. The problem is addressed offline with a less-constrained POMDP, and navigation under the full system constraints is handled online with replanning. We empirically demonstrate that the policy generated using this abstracted motion model is faster to compute and achieves similar or higher reward than addressing the motion constraints for a car-like robot as used in our experiments directly in the POMDP.
Bai
In order to successfully interact with multiple humans in social situations, an intelligent robot should have the ability to track multi-humans, and understand their motion intentions. We formalize this problem as a hidden Markov model, and estimate the posterior densities by particle filtering over sets approach. Our approach avoids directly performing observation-to-target association by defining a set as a joint state. The human identification problem is then solved in an expectation-maximization way. We evaluate the effectiveness of our approach by both benchamark test and real robot experiments.
Walraven
External factors are hard to model using a Markovian state in several real-world planning domains. Although planning can be difficult in such domains, it may be possible to exploit long-term dependencies between states of the environment during planning. We introduce weighted state scenarios to model long-term sequences of states, and we use a model based on a Partially Observable Markov Decision Process to reason about scenarios during planning. Experiments show that our model outperforms other methods for decision making in two real-world domains.
Wu
Issuing and following instructions is a common task in many forms of both human-human and human-robot collaboration. With two human participants, the accuracy of instruction following increases if the collaborators can monitor the state of their partners and respond to them through conversation (Clark and Krych 2004), a process we call social feedback. Despite this benefit in human-human interaction, current human-robot collaboration systems process instructions in non-incremental batches, which can achieve good accuracy but does not allow for reactive feedback (Tellex et al. 2011; Matuszek et al. 2012; Tellex et al. 2012; Misra et al.2014). In this paper, we show that giving a robot the ability to ask the user questions results in responsive conversations and allows the robot to quickly determine the object that the user desires. This social feedback loop between person and robot allows a person to create an internal model for the robot's mental state and adapt their own behavior to better inform the robot. To close the human-robot feedback loop, we employ a Partially Observable Markov Decision Process (POMDP) to produce a policy which will lead to the determination of the object in the shortest amount of time. To test our approach, we perform user studies to measure our robot's ability to deliver common household items requested by the participant. We compare delivery speed and accuracy both with and without social feedback.
Curran
Although we would like our robots to have completely autonomous behavior, this is often not possible. Some parts of a task might be hard to automate, perhaps due to hard-to-interpret sensor information, or a complex environment. In this case, using shared autonomy or teleoperation is preferable to an error-prone autonomous approach. However, the question of which parts of a task to allocate to the human, and which to the robot can often be tricky. In this work, we introduce A3P, a risk-aware task-level reinforcement learning algorithm. A3P represents a task-level state machine as a POMDP. In this paper, we introduce A3P, a risk-aware algorithm that discovers when to hand off subtasks to a human assistant. A3P models the task as a Partially Observably Markov Decision Process (POMDP) and explicitly represents failures as additional state-action pairs. Based on the model, the algorithm allows the user to allocate subtasks the robot or the human in such a way as to manage the worst-case performance time for the overall task.