Planning & Scheduling
TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions
Weisz, Gellért, Szepesvári, Csaba, György, András
We consider the minimax query complexity of online planning with a generative model in fixed-horizon Markov decision processes (MDPs) with linear function approximation. Following recent works, we consider broad classes of problems where either (i) the optimal value function $v^\star$ or (ii) the optimal action-value function $q^\star$ lie in the linear span of some features; or (iii) both $v^\star$ and $q^\star$ lie in the linear span when restricted to the states reachable from the starting state. Recently, Weisz et al. (2021b) showed that under (ii) the minimax query complexity of any planning algorithm is at least exponential in the horizon $H$ or in the feature dimension $d$ when the size $A$ of the action set can be chosen to be exponential in $\min(d,H)$. On the other hand, for the setting (i), Weisz et al. (2021a) introduced TensorPlan, a planner whose query cost is polynomial in all relevant quantities when the number of actions is fixed. Among other things, these two works left open the question whether polynomial query complexity is possible when $A$ is subexponential in $min(d,H)$. In this paper we answer this question in the negative: we show that an exponentially large lower bound holds when $A=\Omega(\min(d^{1/4},H^{1/2}))$, under either (i), (ii) or (iii). In particular, this implies a perhaps surprising exponential separation of query complexity compared to the work of Du et al. (2021) who prove a polynomial upper bound when (iii) holds for all states. Furthermore, we show that the upper bound of TensorPlan can be extended to hold under (iii) and, for MDPs with deterministic transitions and stochastic rewards, also under (ii).
Deep Synoptic Monte Carlo Planning in Reconnaissance Blind Chess
This paper introduces deep synoptic Monte Carlo planning (DSMCP) for large imperfect information games. The algorithm constructs a belief state with an unweighted particle filter and plans via playouts that start at samples drawn from the belief state. The algorithm accounts for uncertainty by performing inference on "synopses," a novel stochastic abstraction of information states. DSMCP is the basis of the program Penumbra, which won the official 2020 reconnaissance blind chess competition versus 33 other programs. This paper also evaluates algorithm variants that incorporate caution, paranoia, and a novel bandit algorithm. Furthermore, it audits the synopsis features used in Penumbra with per-bit saliency statistics.
Multi-Agent Path Planning Using Deep Reinforcement Learning
In this paper a deep reinforcement based multi-agent path planning approach is introduced. The experiments are realized in a simulation environment and in this environment different multi-agent path planning problems are produced. The produced problems are actually similar to a vehicle routing problem and they are solved using multi-agent deep reinforcement learning. In the simulation environment, the model is trained on different consecutive problems in this way and, as the time passes, it is observed that the model's performance to solve a problem increases. Always the same simulation environment is used and only the location of target points for the agents to visit is changed. This contributes the model to learn its environment and the right attitude against a problem as the episodes pass. At the end, a model who has already learned a lot to solve a path planning or routing problem in this environment is obtained and this model can already find a nice and instant solution to a given unseen problem even without any training. In routing problems, standard mathematical modeling or heuristics seem to suffer from high computational time to find the solution and it is also difficult and critical to find an instant solution. In this paper a new solution method against these points is proposed and its efficiency is proven experimentally.
A Novel Automated Curriculum Strategy to Solve Hard Sokoban Planning Instances
Feng, Dieqiao, Gomes, Carla P., Selman, Bart
In recent years, we have witnessed tremendous progress in deep reinforcement learning (RL) for tasks such as Go, Chess, video games, and robot control. Nevertheless, other combinatorial domains, such as AI planning, still pose considerable challenges for RL approaches. The key difficulty in those domains is that a positive reward signal becomes {\em exponentially rare} as the minimal solution length increases. So, an RL approach loses its training signal. There has been promising recent progress by using a curriculum-driven learning approach that is designed to solve a single hard instance. We present a novel {\em automated} curriculum approach that dynamically selects from a pool of unlabeled training instances of varying task complexity guided by our {\em difficulty quantum momentum} strategy. We show how the smoothness of the task hardness impacts the final learning results. In particular, as the size of the instance pool increases, the ``hardness gap'' decreases, which facilitates a smoother automated curriculum based learning process. Our automated curriculum approach dramatically improves upon the previous approaches. We show our results on Sokoban, which is a traditional PSPACE-complete planning problem and presents a great challenge even for specialized solvers. Our RL agent can solve hard instances that are far out of reach for any previous state-of-the-art Sokoban solver. In particular, our approach can uncover plans that require hundreds of steps, while the best previous search methods would take many years of computing time to solve such instances. In addition, we show that we can further boost the RL performance with an intricate coupling of our automated curriculum approach with a curiosity-driven search strategy and a graph neural net representation.
Width-Based Planning and Active Learning for Atari
Ayton, Benjamin, Asai, Masataro
Width-based planning has shown promising results on Atari 2600 games using pixel input, while using substantially fewer environment interactions than reinforcement learning. Recent width-based approaches have computed feature vectors for each screen using a hand designed feature set or a variational autoencoder (VAE) trained on game screens, and prune screens that do not have novel features during the search. In this paper, we explore consideration of uncertainty in features generated by a VAE during width-based planning. Our primary contribution is the introduction of active learning to maximize the utility of screens observed during planning. Experimental results demonstrate that use of active learning strategies increases gameplay scores compared to alternative width-based approaches with equal numbers of environment interactions.
Reinforcement Learning for Classical Planning: Viewing Heuristics as Dense Reward Generators
Gehring, Clement, Asai, Masataro, Chitnis, Rohan, Silver, Tom, Kaelbling, Leslie Pack, Sohrabi, Shirin, Katz, Michael
Recent advances in reinforcement learning (RL) have led to a growing interest in applying RL to classical planning domains or applying classical planning methods to some complex RL domains. However, the long-horizon goal-based problems found in classical planning lead to sparse rewards for RL, making direct application inefficient. In this paper, we propose to leverage domain-independent heuristic functions commonly used in the classical planning literature to improve the sample efficiency of RL. These classical heuristics act as dense reward generators to alleviate the sparse-rewards issue and enable our RL agent to learn domain-specific value functions as residuals on these heuristics, making learning easier. Correct application of this technique requires consolidating the discounted metric used in RL and the non-discounted metric used in heuristics. We implement the value functions using Neural Logic Machines, a neural network architecture designed for grounded first-order logic inputs. We demonstrate on several classical planning domains that using classical heuristics for RL allows for good sample efficiency compared to sparse-reward RL. We further show that our learned value functions generalize to novel problem instances in the same domain.
Finnish artificial intelligence company Plain Complex raises €150k from angel investors
Plain Complex is a Finnish startup company founded in 2021, although the initial product development started already three years ago. As an anesthesiologist Sasu Liuhanen repeatedly witnessed the challenges of planning nurses’ shifts in his daily work; the process was slow, required a lot of manual work and the quality of the rosters left all too often a lot to desire.To develop a solution to the problem, Sasu Liuhanen, an experienced anesthesiologist and software developer, Tuomo Peltola, a seasoned professional in health-tech sales and marketing, and Stefano Campadello, a professional in business development and information technology founded a company and are now commercializing its artificial intelligence-based roster planning software. 150k from Finnish angel investors The first angel investment round of the company was completed with four renowned angel investors. Ali Omar (FiBAN business angel of the year 2019), Reima Linnanvirta (chair of the board, FiBAN), Henry Nilert (founder of IoBox, FiBAN angel investor), and Pekka Ylitalo (Dimerent) made a 150 000 € seed investment into the company. - Roster planning affects a great number of nurses and their families. Hence, the quality of the rosters has an immense effect on employees’ work-life balance and their well-being. Artificial intelligence is a true game-changer and enables finding optimal shifts for each employee, says Ali Omar. - Well-being employees are the focus of Plain Complex, but at the same time, an organization can achieve significant cost savings brought by the uniform quality and fairness of the rosters. Using artificial intelligence is a true win-win, says FiBAN’s chair of the board, Reima Linnanvirta. Artificial intelligence improves well-being at work and brings costs savings to healthcare An ongoing pilot in a large Finnish hospital has already proven that artificial intelligence can plan rosters where employees can combine shift work and personal life in a way that has not been possible before. Transforming the previously long and tedious planning process from days or even weeks into a few minutes opens up new and unseen opportunities. - A good roster needs to comply with all applicable laws, collective agreements, organizational requirements, criteria for ergonomic design, and the employees’ personal wishes and preferences. Such a puzzle is often extremely difficult to solve and frequently it is the employees’ wishes that need to give in. Artificial intelligence can change all this and solve the puzzle in a way that everyone wins. A plan that takes all the aforementioned aspects into account is ready in minutes, says Sasu Liuhanen, CEO and co-founder of the company. More information: Sasu LiuhanenCEO, Co-Founder, Plain Complex040-516 6467sasu.liuhanen@plaincomplex.complaincomplex.com / linkedin.com/company/plaincomplex Antti ViitanenDeal Flow Manager, FiBAN+358 45 2565 220antti@fiban.orgfiban.org
NICE: Robust Scheduling through Reinforcement Learning-Guided Integer Programming
Kenworthy, Luke, Nayak, Siddharth, Chin, Christopher, Balakrishnan, Hamsa
Integer programs provide a powerful abstraction for representing a wide range of real-world scheduling problems. Despite their ability to model general scheduling problems, solving large-scale integer programs (IP) remains a computational challenge in practice. The incorporation of more complex objectives such as robustness to disruptions further exacerbates the computational challenge. We present NICE (Neural network IP Coefficient Extraction), a novel technique that combines reinforcement learning and integer programming to tackle the problem of robust scheduling. More specifically, NICE uses reinforcement learning to approximately represent complex objectives in an integer programming formulation. We use NICE to determine assignments of pilots to a flight crew schedule so as to reduce the impact of disruptions. We compare NICE with (1) a baseline integer programming formulation that produces a feasible crew schedule, and (2) a robust integer programming formulation that explicitly tries to minimize the impact of disruptions. Our experiments show that, across a variety of scenarios, NICE produces schedules resulting in 33\% to 48\% fewer disruptions than the baseline formulation. Moreover, in more severely constrained scheduling scenarios in which the robust integer program fails to produce a schedule within 90 minutes, NICE is able to build robust schedules in less than 2 seconds on average.
MCTS Based Agents for Multistage Single-Player Card Game
Godlewski, Konrad, Sawicki, Bartosz
The article presents the use of Monte Carlo Tree Search algorithms for the card game Lord of the Rings. The main challenge was the complexity of the game mechanics, in which each round consists of 5 decision stages and 2 random stages. To test various decision-making algorithms, a game simulator has been implemented. The research covered an agent based on expert rules, using flat Monte-Carlo search, as well as complete MCTS-UCB. Moreover different playout strategies has been compared. As a result of experiments, an optimal (assuming a limited time) combination of algorithms were formulated. The developed MCTS based method have demonstrated a advantage over agent with expert knowledge.
A dynamic programming algorithm for informative measurements and near-optimal path-planning
Loxley, Peter N., Cheung, Ka Wai
An informative measurement is the most efficient way to gain information about an unknown state. We give a first-principles derivation of a general-purpose dynamic programming algorithm that returns a sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. This algorithm is applicable to states and controls that are continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes. Recent results from approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow an agent or robot to solve the measurement task in real-time. The resulting near-optimal solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly-used greedy heuristics such as maximizing the entropy of each measurement outcome. This is demonstrated for a global search problem, where on-line planning with an extended local search is found to reduce the number of measurements in the search by half.