Goto

Collaborating Authors

 battleship



Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Grand, Gabriel, Pepe, Valerio, Andreas, Jacob, Tenenbaum, Joshua B.

arXiv.org Artificial Intelligence

Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.



Asynchronous Predictive Counterfactual Regret Minimization$^+$ Algorithm in Solving Extensive-Form Games

Meng, Linjian, Zhang, Youzhi, Ge, Zhenxing, Yang, Tianpei, Gao, Yang

arXiv.org Artificial Intelligence

Counterfactual Regret Minimization (CFR) algorithms are widely used to compute a Nash equilibrium (NE) in two-player zero-sum imperfect-information extensive-form games (IIGs). Among them, Predictive CFR$^+$ (PCFR$^+$) is particularly powerful, achieving an exceptionally fast empirical convergence rate via the prediction in many games. However, the empirical convergence rate of PCFR$^+$ would significantly degrade if the prediction is inaccurate, leading to unstable performance on certain IIGs. To enhance the robustness of PCFR$^+$, we propose a novel variant, Asynchronous PCFR$^+$ (APCFR$^+$), which employs an adaptive asynchronization of step-sizes between the updates of implicit and explicit accumulated counterfactual regrets to mitigate the impact of the prediction inaccuracy on convergence. We present a theoretical analysis demonstrating why APCFR$^+$ can enhance the robustness. Finally, we propose a simplified version of APCFR$^+$ called Simple APCFR$^+$ (SAPCFR$^+$), which uses a fixed asynchronization of step-sizes to simplify the implementation that only needs a single-line modification of the original PCFR+. Interestingly, SAPCFR$^+$ achieves a constant-factor lower theoretical regret bound than PCFR$^+$ in the worst case. Experimental results demonstrate that (i) both APCFR$^+$ and SAPCFR$^+$ outperform PCFR$^+$ in most of the tested games, as well as (ii) SAPCFR$^+$ achieves a comparable empirical convergence rate with APCFR$^+$.


Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

de Carvalho, Gonçalo Hora, Pollice, Robert, Knap, Oscar

arXiv.org Artificial Intelligence

We explore the hypothesis that LLMs, such as GPT-3.5 and GPT-4, possess broader cognitive functions, particularly in non-linguistic domains. Our approach extends beyond standard linguistic benchmarks by incorporating games like Tic-Tac-Toe, Connect Four, and Battleship, encoded via ASCII, to assess strategic thinking and decision-making. To evaluate the models' ability to generalize beyond their training data, we introduce two additional games. The first game, LEGO Connect Language (LCL), tests the models' capacity to understand spatial logic and follow assembly instructions. The second game, the game of shapes, challenges the models to identify shapes represented by 1s within a matrix of zeros, further testing their spatial reasoning skills. This "show, don't tell" strategy uses games instead of simply querying the models. Our results show that despite their proficiency on standard benchmarks, GPT-3.5 and GPT-4's abilities to play and reason about fully observable games without pre-training is mediocre. Both models fail to anticipate losing moves in Tic-Tac-Toe and Connect Four, and they are unable to play Battleship correctly. While GPT-4 shows some success in the game of shapes, both models fail at the assembly tasks presented in the LCL game. These results suggest that while GPT models can emulate conversational proficiency and basic rule comprehension, their performance in strategic gameplay and spatial reasoning tasks is very limited. Importantly, this reveals a blind spot in current LLM benchmarks that we highlight with our gameplay benchmark suite ChildPlay (https://github.com/child-play-neurips/child-play). Our findings provide a cautionary tale about claims of emergent intelligence and reasoning capabilities of LLMs that are roughly the size of GPT-3.5 and GPT-4.


Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling

Grand, Gabriel, Pepe, Valerio, Andreas, Jacob, Tenenbaum, Joshua B.

arXiv.org Artificial Intelligence

Questions combine our mastery of language with our remarkable facility for reasoning about uncertainty. How do people navigate vast hypothesis spaces to pose informative questions given limited cognitive resources? We study these tradeoffs in a classic grounded question-asking task based on the board game Battleship. Our language-informed program sampling (LIPS) model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain. We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance across varied Battleship board scenarios. In contrast, LLM-only baselines struggle to ground questions in the board state; notably, GPT-4V provides no improvement over non-visual baselines. Our results illustrate how Bayesian models of question-asking can leverage the statistics of language to capture human priors, while highlighting some shortcomings of pure LLMs as grounded reasoners.


Block-Coordinate Methods and Restarting for Solving Extensive-Form Games

Chakrabarti, Darshan, Diakonikolas, Jelena, Kroer, Christian

arXiv.org Artificial Intelligence

Coordinate descent methods are popular in machine learning and optimization for their simple sparse updates and excellent practical performance. In the context of large-scale sequential game solving, these same properties would be attractive, but until now no such methods were known, because the strategy spaces do not satisfy the typical separable block structure exploited by such methods. We present the first cyclic coordinate-descent-like method for the polytope of sequence-form strategies, which form the strategy spaces for the players in an extensive-form game (EFG). Our method exploits the recursive structure of the proximal update induced by what are known as dilated regularizers, in order to allow for a pseudo block-wise update. We show that our method enjoys a $O(1/T)$ convergence rate to a two-player zero-sum Nash equilibrium, while avoiding the worst-case polynomial scaling with the number of blocks common to cyclic methods. We empirically show that our algorithm usually performs better than other state-of-the-art first-order methods (i.e., mirror prox), and occasionally can even beat CFR$^+$, a state-of-the-art algorithm for numerical equilibrium computation in zero-sum EFGs. We then introduce a restarting heuristic for EFG solving. We show empirically that restarting can lead to speedups, sometimes huge, both for our cyclic method, as well as for existing methods such as mirror prox and predictive CFR$^+$.


General Game Playing with Imperfect Information

Schofield, Michael, Thielscher, Michael

Journal of Artificial Intelligence Research

General Game Playing is a field which allows the researcher to investigate techniques that might eventually be used in an agent capable of Artificial General Intelligence.  Game playing presents a controlled environment in which to evaluate AI techniques, and so we have seen an increase in interest in this field of research.  Games of imperfect information offer the researcher an additional challenge in terms of complexity over games with perfect information.  In this article, we look at imperfect-information games: their expression, their complexity, and the additional demands of their players.  We consider the problems of working with imperfect information and introduce a technique called HyperPlay, for efficiently sampling very large information sets, and present a formalism together with pseudo code so that others may implement it. We examine the design choices for the technique, show its soundness and completeness then provide some experimental results and demonstrate the use of the technique in a variety of imperfect-information games, revealing its strengths, weaknesses, and its efficiency against randomly generating samples.  Improving the technique, we present HyperPlay-II, capable of correctly valuing information-gathering moves.  Again, we provide some experimental results and demonstrate the use of the new technique revealing its strengths, weaknesses and its limitations.


Causal Belief Decomposition for Planning with Sensing: Completeness Results and Practical Approximation

Bonet, Blai, Geffner, Hector

arXiv.org Artificial Intelligence

Belief tracking is a basic problem in planning with sensing. While the problem is intractable, it has been recently shown that for both deterministic and non-deterministic systems expressed in compact form, it can be done in time and space that are exponential in the problem width. The width measures the maximum number of state variables that are all relevant to a given precondition or goal. In this work, we extend this result both theoretically and practically. First, we introduce an alternative decomposition scheme and algorithm with the same time complexity but different completeness guarantees, whose space complexity is much smaller: exponential in the causal width of the problem that measures the number of state variables that are causally relevant to a given precondition, goal, or observable. Second, we introduce a fast, meaningful, and powerful approximation that trades completeness by speed, and is both time and space exponential in the problem causal width . It is then shown empirically that the algorithm combined with simple heuristics yields state-of-the-art real-time performance in domains with high widths but low causal widths such as Minesweeper, Battleship, and Wumpus.


Adaptive Thompson Sampling Stacks for Memory Bounded Open-Loop Planning

Phan, Thomy, Gabor, Thomas, Müller, Robert, Roch, Christoph, Linnhoff-Popien, Claudia

arXiv.org Artificial Intelligence

We propose Stable Yet Memory Bounded Open-Loop (SYMBOL) planning, a general memory bounded approach to partially observable open-loop planning. SYMBOL maintains an adaptive stack of Thompson Sampling bandits, whose size is bounded by the planning horizon and can be automatically adapted according to the underlying domain without any prior domain knowledge beyond a generative model. We empirically test SYMBOL in four large POMDP benchmark problems to demonstrate its effectiveness and robustness w.r.t. the choice of hyperparameters and evaluate its adaptive memory consumption. We also compare its performance with other open-loop planning algorithms and POMCP.