Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena
Chen, Jiangjie, Yuan, Siyu, Ye, Rong, Majumder, Bodhisattwa Prasad, Richardson, Kyle
–arXiv.org Artificial Intelligence
Can Large Language Models (LLMs) simulate human behavior in complex environments? LLMs have recently been shown to exhibit advanced reasoning skills but much of NLP evaluation still relies on static benchmarks. Answering this requires evaluation environments that probe strategic reasoning in competitive, dynamic scenarios that involve long-term planning. We conduct several controlled simulations using state-of-the-art LLMs as bidding agents. We find that through simple prompting, LLMs do indeed demonstrate many of the skills needed for effectively engaging in auctions (e.g., managing budget, adhering to long-term goals and priorities), skills that we find can be sharpened by explicitly encouraging models to be adaptive and observe strategies in past auctions. These results are significant as they show the potential of using LLM agents to model intricate social dynamics, especially in competitive settings. However, we also observe considerable variability in the capabilities of individual LLMs. Notably, even our most advanced models (GPT-4) are occasionally surpassed by heuristic baselines and human agents, highlighting the potential for further improvements in the design of LLM agents and the important role that our simulation environment can play in further testing and refining agent architectures. A long-term goal of the AI community has been the development of autonomous agents that can independently make decisions and freely interact in the environment to carry out different tasks (Steels, 1995; Franklin & Graesser, 1996). Being autonomous requires an agent to have a certain set of skills, such as the ability to do complex reasoning, and manage risk and resources, among many others. Large Language Models (LLMs) have proven to be able to solve a wide range of different reasoning problems, with the boundaries of what's possible being pushed every day (Wei et al., 2022a; Bubeck et al., 2023). Despite the increasing view of these models as autonomous agents (Wang et al., 2023a; Sumers et al., 2023; Xi et al., 2023), a crucial question remains: Can these agents effectively do sequential decision-making in dynamic environments for achieving their strategic objectives? While the potential is evident (Nakajima, 2023; Significant-Gravitas, 2023), these capabilities have yet to be rigorously evaluated. Traditional reasoning and planning benchmarks in NLP (Geva et al., 2021; Sakaguchi et al., 2021; Yuan et al., 2023) mostly assess agents in static contexts. Yet, real-world scenarios demand that autonomous agents not merely respond to input but also have the ability to create long-term goals and plans, and continuously revise their decisions. To bridge this gap, one recent line of research focuses on immersing agents in simulation environments that mimic real-world scenarios (Wang et al., 2022; Park et al., 2023; Liu et al., 2023), ones that often focus on a targeted Work done during Jiangjie's internship at Allen Institute for Artificial Intelligence.
arXiv.org Artificial Intelligence
Oct-9-2023
- Country:
- Asia > Middle East
- UAE (0.14)
- Europe > France (0.14)
- North America
- Canada (0.14)
- United States (0.14)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Leisure & Entertainment > Games (1.00)
- Technology: