Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence.
This paper presents a new Monte-Carlo search algorithm for very large sequential decision-making problems. We apply non-linear regression within Monte-Carlo search, online, to estimate a state-action value function from the outcomes of random roll-outs. This value function generalizes between related states and actions, and can therefore provide more accurate evaluations after fewer rollouts. A further significant advantage of this approach is its ability to automatically extract and leverage domain knowledge from external sources such as game manuals. We apply our algorithm to the game of Civilization II, a challenging multi-agent strategy game with an enormous state space and around 10 21 joint actions. We approximate the value function by a neural network, augmented by linguistic knowledge that is extracted automatically from the official game manual. We show that this non-linear value function is significantly more efficient than a linear value function, which is itself more efficient than Monte-Carlo tree search. Our non-linear Monte-Carlo search wins over 78% of games against the built-in AI of Civilization II.
In Reinforcement Learning, an intelligent agent has to make a sequence of decisions to accomplish a goal. If this sequence is long, then the agent has to plan over a long horizon. While learning the optimal policy and its value function is a well studied problem in Reinforcement Learning, this paper focuses on the structure of the optimal value function and how hard it is to represent the optimal value function. We show that the generalized Rademacher complexity of the hypothesis space of all optimal value functions is dependent on the planning horizon and independent of the state and action space size. Further, we present bounds on the action-gaps of action value functions and show that they can collapse if a long planning horizon is used. The theoretical results are verified empirically on randomly generated MDPs and on a grid-world fruit collection task using deep value function approximation. Our theoretical results highlight a connection between value function approximation and the Options framework and suggest that value functions should be decomposed along bottlenecks of the MDP's transition dynamics.
Planning as Satisfiability is an important approach to Propositional Planning. A serious drawback of the method is its limited scalability, as the instances that arise from large planning problems are often too hard for modern SAT solvers. This work tackles this problem by combining two powerful techniques that aim at decomposing a planning problem into smaller subproblems, so that the satisfiability instances that need to be solved do not grow prohibitively large. The first technique, incremental goal achievement, turns planning into a series of boolean optimization problems, each seeking to maximize the number of goals that are achieved within a limited planning horizon. This is coupled with a second technique, called heuristic guidance, that directs search towards a state which satisfies all goals.
Temporal planning methods usually focus on the objective of minimizing makespan. Unfortunately, this misses a large class of planning problems where it is important to consider a wider variety of temporal and non-temporal preferences, making makespan lower-order concern. In this paper we consider modeling and reasoning with plan quality metrics that are not directly correlated with plan makespan, building on the planner POPF. We begin with the preferences defined in PDDL3, and present a mixed integer programming encoding to manage the the interaction between the hard temporal constraints for plan steps, and soft temporal constraints for preferences. To widen the support of metrics that can be expressed directly in PDDL, we then discuss an extension to soft-deadlines with continuous cost functions, avoiding the need to approximate these with several PDDL3 discrete-cost preferences. We demonstrate the success of our new planner on the benchmark temporal planning problems with preferences, showing that it is the state-of-the-art for such problems. We then analyze the benefits of reasoning with continuous (versus discretized) models of domains with continuous cost functions, showing the improvement in solution quality afforded through making the continuous cost function directly available to the planner.