How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Huang, Jen-tse, Li, Eric John, Lam, Man Ho, Liang, Tian, Wang, Wenxuan, Yuan, Youliang, Jiao, Wenxiang, Wang, Xing, Tu, Zhaopeng, Lyu, Michael R.
–arXiv.org Artificial Intelligence
Figure 1: γ-Bench enables various LLMs and humans to participate in multi-agent, multi-round games. The framework includes eight classical games in Game Theory, each categorized into one of three groups. Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a wellestablished field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, γ-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through γ-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on γ-Bench, achieving a score of 60.5. Wenxiang Jiao is the corresponding author. We have recently witnessed the advancements in Artificial Intelligence (AI) made by Large Language Models (LLMs), which have marked a significant breakthrough in the field. Beyond the academic sphere, LLMs have entered diverse aspects of our everyday life, such as education (Baidoo-Anu & Ansah, 2023), legal service (Guha et al., 2023), product design (Lanzi & Loiacono, 2023), and healthcare (Johnson et al., 2023). Given their extensive capabilities, evaluating LLMs demands more than simple, isolated tasks. A comprehensive and multifaceted approach is highly in demand to assess the efficacy of these advanced models.
arXiv.org Artificial Intelligence
Apr-25-2024
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Leisure & Entertainment > Games (1.00)
- Technology: