REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems
Geng, Longling, Chang, Edward Y.
–arXiv.org Artificial Intelligence
This benchmark suite provides a comprehensive evaluation framework for assessing both individual LLMs and multi-agent systems in real-world planning scenarios. The suite encompasses eleven designed problems that progress from basic to highly complex, incorporating key aspects such as multi-agent coordination, inter-agent dependencies, and dynamic environmental disruptions. Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter-dependencies, and the frequency of unexpected disruptions requiring real-time adaptation. The benchmark includes detailed specifications, evaluation metrics, and baseline implementations using contemporary frameworks like LangGraph, enabling rigorous testing of both single-agent and multi-agent planning capabilities. Through standardized evaluation criteria and scalable complexity, this benchmark aims to drive progress in developing more robust and adaptable AI planning systems for real-world applications.
arXiv.org Artificial Intelligence
Feb-26-2025
- Country:
- Genre:
- Research Report (0.41)
- Workflow (0.47)
- Industry:
- Banking & Finance
- Consumer Products & Services > Travel (1.00)
- Information Technology (1.00)
- Leisure & Entertainment (0.68)
- Transportation (1.00)
- Technology: