Kaisers, Michael
Mastering Board Games by External and Internal Planning with Language Models
Schultz, John, Adamek, Jakub, Jusup, Matej, Lanctot, Marc, Kaisers, Michael, Perrin, Sarah, Hennes, Daniel, Shar, Jeremy, Lewis, Cannada, Ruoss, Anian, Zahavy, Tom, Veličković, Petar, Prince, Laurel, Singh, Satinder, Malmi, Eric, Tomašev, Nenad
While large language models perform well on a range of complex tasks (e.g., text generation, question answering, summarization), robust multi-step planning and reasoning remains a considerable challenge for them. In this paper we show that search-based planning can significantly improve LLMs' playing strength across several board games (Chess, Fischer Random / Chess960, Connect Four, and Hex). We introduce, compare and contrast two major approaches: In external search, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external engine, and in internal search, the model directly generates in-context a linearized tree of potential futures and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, capturing the transition and value functions across these games. We find that our pre-training method minimizes hallucinations, as our model is highly accurate regarding state prediction and legal moves. Additionally, both internal and external search indeed improve win-rates against state-of-the-art bots, even reaching Grandmaster-level performance in chess while operating on a similar move count search budget per decision as human Grandmasters. The way we combine search with domain knowledge is not specific to board games, suggesting direct extensions into more general language model inference and training techniques.
Soft Condorcet Optimization for Ranking of General Agents
Lanctot, Marc, Larson, Kate, Kaisers, Michael, Berthet, Quentin, Gemp, Ian, Diaz, Manfred, Maura-Rivero, Roberto-Rafael, Bachrach, Yoram, Koop, Anna, Precup, Doina
A common way to drive progress of AI models and agents is to compare their performance on standardized benchmarks. Comparing the performance of general agents requires aggregating their individual performances across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59\% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.
Approximating the Core via Iterative Coalition Sampling
Gemp, Ian, Lanctot, Marc, Marris, Luke, Mao, Yiran, Duéñez-Guzmán, Edgar, Perrin, Sarah, Gyorgy, Andras, Elie, Romuald, Piliouras, Georgios, Kaisers, Michael, Hennes, Daniel, Bullard, Kalesha, Larson, Kate, Bachrach, Yoram
The core is a central solution concept in cooperative game theory, defined as the set of feasible allocations or payments such that no subset of agents has incentive to break away and form their own subgroup or coalition. However, it has long been known that the core (and approximations, such as the least-core) are hard to compute. This limits our ability to analyze cooperative games in general, and to fully embrace cooperative game theory contributions in domains such as explainable AI (XAI), where the core can complement the Shapley values to identify influential features or instances supporting predictions by black-box models. We propose novel iterative algorithms for computing variants of the core, which avoid the computational bottleneck of many other approaches; namely solving large linear programs. As such, they scale better to very large problems as we demonstrate across different classes of cooperative games, including weighted voting games, induced subgraph games, and marginal contribution networks. We also explore our algorithms in the context of XAI, providing further evidence of the power of the core for such applications.
TacticAI: an AI assistant for football tactics
Wang, Zhe, Veličković, Petar, Hennes, Daniel, Tomašev, Nenad, Prince, Laurel, Kaisers, Michael, Bachrach, Yoram, Elie, Romuald, Wenliang, Li Kevin, Piccinini, Federico, Spearman, William, Graham, Ian, Connor, Jerome, Yang, Yi, Recasens, Adrià, Khan, Mina, Beauguerlange, Nathalie, Sprechmann, Pablo, Moreno, Pol, Heess, Nicolas, Bowling, Michael, Hassabis, Demis, Tuyls, Karl
Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI's model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning.
BRExIt: On Opponent Modelling in Expert Iteration
Hernandez, Daniel, Baier, Hendrik, Kaisers, Michael
Finding a best response policy is a central objective in game theory and multi-agent learning, with modern population-based training approaches employing reinforcement learning algorithms as best-response oracles to improve play against candidate opponents (typically previously learnt policies). We propose Best Response Expert Iteration (BRExIt), which accelerates learning in games by incorporating opponent models into the state-of-the-art learning algorithm Expert Iteration (ExIt). BRExIt aims to (1) improve feature shaping in the apprentice, with a policy head predicting opponent policies as an auxiliary task, and (2) bias opponent moves in planning towards the given or learnt opponent model, to generate apprentice targets that better approximate a best response. In an empirical ablation on BRExIt's algorithmic variants against a set of fixed test agents, we provide statistical evidence that BRExIt learns better performing policies than ExIt.
Online Planning in POMDPs with Self-Improving Simulators
He, Jinke, Suau, Miguel, Baier, Hendrik, Kaisers, Michael, Oliehoek, Frans A.
How can we plan efficiently in a large and complex environment when the time budget is limited? However, there are three main limitations of this "twophase" Given the original simulator of the environment, paradigm, where a simulator is learned offline and which may be computationally very demanding, we then used as-is for online simulation and planning. First, no propose to learn online an approximate but much planning is possible until the offline learning phase finishes, faster simulator that improves over time. To plan which can take a long time. Second, the separation of learning reliably and efficiently while the approximate simulator and planning raises a question on what data collection policy is learning, we develop a method that adaptively should be used during training to ensure good online prediction decides which simulator to use for every simulation, during planning. We empirically demonstrate that when based on a statistic that measures the accuracy the training data is collected by a uniform random policy, the of the approximate simulator. This allows us to learned influence predictors can perform poorly during online use the approximate simulator to replace the original planning, due to distribution shift. Third, completely replacing simulator for faster simulations when it is accurate the original simulator with the approximate one after enough under the current context, thus trading training implies a risk of poor planning performance in certain off simulation speed and accuracy. Experimental situations, which is hard to detect in advance.
Robust temporal difference learning for critical domains
Klima, Richard, Bloembergen, Daan, Kaisers, Michael, Tuyls, Karl
We present a new Q-function operator for temporal difference (TD) learning methods that explicitly encodes robustness against significant rare events (SRE) in critical domains. The operator, which we call the $\kappa$-operator, allows to learn a safe policy in a model-based fashion without actually observing the SRE. We introduce single- and multi-agent robust TD methods using the operator $\kappa$. We prove convergence of the operator to the optimal safe Q-function with respect to the model using the theory of Generalized Markov Decision Processes. In addition we prove convergence to the optimal Q-function of the original MDP given that the probability of SREs vanishes. Empirical evaluations demonstrate the superior performance of $\kappa$-based TD methods both in the early learning phase as well as in the final converged stage. In addition we show robustness of the proposed method to small model errors, as well as its applicability in a multi-agent context.
Energy- and Cost-Efficient Pumping Station Control
Kanters, Timon V. (University of Amsterdam) | Oliehoek, Frans A. (University of Liverpool and University of Amsterdam) | Kaisers, Michael (Centrum Wiskunde and Informatica) | Bosch, Stan R. van den (Nelen and Schuurmans) | Grispen, Joep (Nelen and Schuurmans) | Hermans, Jeroen (Hoogheemraadschap Hollands Noorderkwartier)
With renewable energy becoming more common, energy prices fluctuate more depending on environmental factors such as the weather. Consuming energy without taking volatile prices into consideration can not only become expensive, but may also increase the peak load, which requires energy providers to generate additional energy using less environment-friendly methods. In the Netherlands, pumping stations that maintain the water levels of polder canals are large energy consumers, but the controller software currently used in the industry does not take real-time energy availability into account. We investigate if existing AI planning techniques have the potential to improve upon the current solutions. In particular, we propose a light weight but realistic simulator and investigate if an online planning method (UCT) can utilise this simulator to improve the cost-efficiency of pumping station control policies. An empirical comparison with the current control algorithms indicates that substantial cost, and thus peak load, reduction can be attained.
FAQ-Learning in Matrix Games: Demonstrating Convergence Near Nash Equilibria, and Bifurcation of Attractors in the Battle of Sexes
Kaisers, Michael (Maastricht University) | Tuyls, Karl (Maastricht University)
This article studies Frequency Adjusted Q-learning (FAQ-learning), a variation of Q-learning that simulates simultaneous value function updates. The main contributions are empirical and theoretical support for the convergence of FAQ-learning to attractors near Nash equilibria in two-agent two-action matrix games.The games can be divided into three types: Matching pennies, Prisoners' Dilemma and Battle of Sexes. This article shows that the Matching pennies and Prisoners' Dilemma yield one attractor of the learning dynamics, while the Battle of Sexes exhibits a supercritical pitchfork bifurcation at a critical temperature, where one attractor splits into two attractors and one repellent fixed point. Experiments illustrate that the distance between fixed points of the FAQ-learning dynamics and Nash equilibria tends to zero as the exploration parameter of FAQ-learning approaches zero.
A Cognitive Hierarchy Model Applied to the Lemonade Game
Wunder, Michael (Rutgers University) | Littman, Michael (Rutgers University) | Kaisers, Michael (University of Maastricht) | Yaros, John Robert (Rutgers University)
One of the challenges of multiagent decision making is that the behavior needed to maximize utility can depend on what other agents choose to do: sometimes there is no "right" answer in the absence of knowledge of how opponents will act. The Nash equilibrium is a sensible choice of behavior because it represents a mutual best response. But, even when there is a unique equilibrium, other players are under no obligation to take part in it. This observation has been forcefully illustrated in the behavioral economics community where repeated experiments have shown individuals playing Nash equilibria and performing badly as a result. In this paper, we show how to apply a tool from behavioral economics called the Cognitive Hierarchy (CH) to the design of agents in general sum games. We attack the recently introduced ``Lemonade Game'' and show how the results of an open competition are well explained by CH. We believe this game, and perhaps many other similar games, boils down to predicting how deeply other agents in the game will be reasoning. An agent that does not reason enough risks being exploited by its opponents, while an agent that reasons too much may not be able to interact productively with its opponents. We demonstrate these ideas by presenting empirical results using agents from the competition and idealizations arising from a CH analysis.