How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Huang, Jen-tse, Li, Eric John, Lam, Man Ho, Liang, Tian, Wang, Wenxuan, Yuan, Youliang, Jiao, Wenxiang, Wang, Xing, Tu, Zhaopeng, Lyu, Michael R.

arXiv.org Artificial Intelligence 

Figure 1: γ-Bench enables various LLMs and humans to participate in multi-agent, multi-round games. The framework includes eight classical games in Game Theory, each categorized into one of three groups. Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a wellestablished field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, γ-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through γ-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on γ-Bench, achieving a score of 60.5. Wenxiang Jiao is the corresponding author. We have recently witnessed the advancements in Artificial Intelligence (AI) made by Large Language Models (LLMs), which have marked a significant breakthrough in the field. Beyond the academic sphere, LLMs have entered diverse aspects of our everyday life, such as education (Baidoo-Anu & Ansah, 2023), legal service (Guha et al., 2023), product design (Lanzi & Loiacono, 2023), and healthcare (Johnson et al., 2023). Given their extensive capabilities, evaluating LLMs demands more than simple, isolated tasks. A comprehensive and multifaceted approach is highly in demand to assess the efficacy of these advanced models.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found