Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Sun, Haochen, Zhang, Shuwen, Ren, Lei, Xu, Hao, Fu, Hao, Yuan, Caixia, Wang, Xiaojie

Feb-27-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 10 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaption that are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. Environments, 30 open-ended tasks, and an integrated evaluation package are now publicly available at https://github.com/YusaeMeow/Collab-Overcooked.

agent, collaboration, collaborative action, (14 more...)

arXiv.org Artificial Intelligence

Feb-27-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Beijing > Beijing (0.04)

Genre:
- Workflow (1.00)
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Agents
    - Agent Societies (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found