TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
Wang, Haochuan, Feng, Xiachong, Li, Lei, Qin, Zhanyue, Sui, Dianbo, Kong, Lingpeng
–arXiv.org Artificial Intelligence
The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate the strategic reasoning capabilities of LLMs, game theory, with its concise structure, has become the preferred approach for many researchers. However, current research typically focuses on a limited selection of games, resulting in low coverage of game types. Additionally, classic game scenarios carry risks of data leakage, and the benchmarks used often lack extensibility, rendering them inadequate for evaluating state-of-the-art models. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2 2 games, which are constructed as classic games in our benchmark. Furthermore, we employ synthetic data generation techniques to create diverse, higher-quality game scenarios through topic guidance and human inspection for each classic game, which we refer to as story-based games. Lastly, to provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units and organize them into more complex forms through sequential, parallel, and nested structures. We conducted a comprehensive evaluation of mainstream LLMs, covering tests on rational reasoning, reasoning robustness, Theory-of-Mind capabilities, and reasoning in complex game forms. The results revealed that LLMs still have flaws in the accuracy and consistency of strategic reasoning processes, and their levels of mastery over Theory-of-Mind also vary. These achievements are largely attributed to LLMs' ability to assimilate vast amounts of knowledge during training, emerging with the capacity to organize information at a coarse level and link knowledge at a finegrained level through their internal representations (Min et al., 2023; Zhao et al., 2023). These core capabilities have driven the success of LLMs in numerous reasoning tasks, including mathematical reasoning (Hendrycks et al., 2021; Zhang et al., 2023), commonsense reasoning (Sap et al., 2019; Bisk et al., 2020), logical reasoning (Lei et al., 2023), and strategic reasoning (Lorè & Heydari, Work done during an internship at the University of Hong Kong. The dataset and evaluation codes will be available at https://github.com/PinkEx/TMGBench. Among these, strategic reasoning has attracted considerable attention due to its multi-agent nature and close association with social intelligence (Gandhi et al., 2023). Strategic reasoning refers to the cognitive process of anticipating, planning, and responding to others' actions to achieve specific objectives within competitive or cooperative contexts (Zhang et al., 2024a).
arXiv.org Artificial Intelligence
Oct-14-2024
- Country:
- Asia > China
- Heilongjiang Province > Harbin (0.04)
- Hong Kong (0.24)
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.04)
- North America > United States
- New York > New York County > New York City (0.04)
- Asia > China
- Genre:
- Overview (0.67)
- Research Report (1.00)
- Industry:
- Leisure & Entertainment > Games (1.00)
- Technology: