"Planning is the process of generating (possibly partial) representations of future behavior prior to the use of such plans to constrain or control that behavior. The outcome is usually a set of actions, with temporal and other constraints on them, for execution by some agent or agents. As a core aspect of human intelligence, planning has been studied since the earliest days of AI and cognitive science. Planning research has led to many useful tools for real-world applications, and has yielded significant insights into the organization of behavior and the nature of reasoning about actions."
– Planning entry by Austin Tate in the MIT Encyclopedia of Cognitive Science.
We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation. To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error. We then extend this approach to general sequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS). We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT.
This week saw the return, for a third season, of the critically acclaimed HBO series Westworld. WW's central premise in its first 2 seasons was a theme park, sometime in the near future, populated by highly realistic robots or'hosts'. Human guests can pay exorbitant sums to interact with these robots, in a huge range of ways. In the'western' themed area – after which the show is named – guests can choose to be white-hatted heroes or black-hatted villains. The good guys get to be brave, chivalrous, honourable and generally decent.
A planning domain, as any model, is never complete and inevitably makes assumptions on the environment's dynamic. By allowing the specification of just one domain model, the knowledge engineer is only able to make one set of assumptions, and to specify a single objective-goal. Borrowing from work in Software Engineering, we propose a multi-tier framework for planning that allows the specification of different sets of assumptions, and of different corresponding objectives. The framework aims to support the synthesis of adaptive behavior so as to mitigate the intrinsic risk in any planning modeling task. After defining the multi-tier planning task and its solution concept, we show how to solve problem instances by a succinct compilation to a form of non-deterministic planning. In doing so, our technique justifies the applicability of planning with both fair and unfair actions, and the need for more efforts in developing planning systems supporting dual fairness assumptions.
The problem of compiling general quantum algorithms for implementation on near-term quantum processors has been introduced to the AI community. Previous work demonstrated that temporal planning is an attractive approach for part of this compilationtask, specifically, the routing of circuits that implement the Quantum Alternating Operator Ansatz (QAOA) applied to the MaxCut problem on a quantum processor architecture. In this paper, we extend the earlier work to route circuits that implement QAOA for Graph Coloring problems. QAOA for coloring requires execution of more, and more complex, operations on the chip, which makes routing a more challenging problem. We evaluate the approach on state-of-the-art hardware architectures from leading quantum computing companies. Additionally, we apply a planning approach to qubit initialization. Our empirical evaluation shows that temporal planning compares well to reasonable analytic upper bounds, and that solving qubit initialization with a classical planner generally helps temporal planners in finding shorter-makespan compilations for QAOA for Graph Coloring. These advances suggest that temporal planning can be an effective approach for more complex quantum computing algorithms and architectures.
FP Alpha, an AI-powered technology solution for financial advisors, was launched today by Andrew Altfest, President of Altfest Personal Wealth Management. FP Alpha is the first comprehensive wealth management platform to utilize artificial intelligence (AI). The software enables advisors to transform financial planning into comprehensive wealth management by streamlining financial planning processes and offering more services to clients. Designed to easily integrate with financial planning software on the market today, the firm's technology allows advisors to scale efficiently and decrease the burden of time-consuming spreadsheets, checklists and labor-intensive tasks – enabling them to save time in the process and add more value to client relationships. By reducing laborious, manual tasks within wealth management services and financial planning processes, FP Alpha helps advisors deploy high impact and personalized recommendations to clients in a scalable, intelligent and cost-efficient manner.
Monte-Carlo Tree Search (MCTS) has proven to be a powerful, generic planning technique for decision-making in single-agent and adversarial environments. The stochastic nature of the Monte-Carlo simulations introduces errors in the value estimates, both in terms of bias and variance. Whilst reducing bias (typically through the addition of domain knowledge) has been studied in the MCTS literature, comparatively little effort has focused on reducing variance. This is somewhat surprising, since variance reduction techniques are a well-studied area in classical statistics. In this paper, we examine the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates.
This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, Monte-Carlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions.
Observations and actions in PDDLGym are relational, making the framework particularly well-suited for research in relational reinforcement learning and relational sequential decision-making. PDDLGym is also useful as a generic framework for rapidly building numerous, diverse benchmarks from a concise and familiar specification language. We discuss design decisions and implementation details, and also illustrate empirical variations between the 15 built-in environments in terms of planning and model-learning difficulty. We hope that PDDLGym will facilitate bridge-building between the reinforcement learning community (from which Gym emerged) and the AI planning community (which produced PDDL). We look forward to gathering feedback from all those interested and expanding the set of available environments and features accordingly.
Multi-Agent Plan Recognition (MAPR) aims to recognize dynamic team structures and team behaviors from the observed team traces (activity sequences) of a set of intelligent agents. Previous MAPR approaches required a library of team activity sequences (team plans) be given as input. However, collecting a library of team plans to ensure adequate coverage is often difficult and costly. In this paper, we relax this constraint, so that team plans are not required to be provided beforehand. We assume instead that a set of action models are available.
Monte-Carlo Tree Search (MCTS) has been successfully applied to very large POMDPs, a standard model for stochastic sequential decision-making problems. However, many real-world problems inherently have multiple goals, where multi-objective formulations are more natural. The constrained POMDP (CPOMDP) is such a model that maximizes the reward while constraining the cost, extending the standard POMDP model. To date, solution methods for CPOMDPs assume an explicit model of the environment, and thus are hardly applicable to large-scale real-world problems. In this paper, we present CC-POMCP (Cost-Constrained POMCP), an online MCTS algorithm for large CPOMDPs that leverages the optimization of LP-induced parameters and only requires a black-box simulator of the environment.