MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
Lei, Fei, Yang, Yibo, Sun, Wenxiu, Lin, Dahua
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPV erse, an expansive, real-world benchmark for evaluating agentic tool use. MCPV erse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 147k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic models, such as Claude-4-Sonnet, can effectively leverage expanded tool spaces to improve accuracy. This finding not only exposes the limitations of state-of-the-art models in complex, real-world scenarios but also establishes MCPV erse as a critical benchmark for measuring and advancing agentic tool use capabilities. The ability of Large Language Models (LLMs) to interact with external tools, typically through function calling, is fundamental to their application in real-world scenarios. This capability allows them to access live data, execute code, and operate other systems, thereby moving beyond their static knowledge. Despite its importance, the evaluation of tool use is hampered by two shortcomings in current benchmarks.
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- Asia > Myanmar
- Tanintharyi Region > Dawei (0.04)
- Europe > Austria
- Vienna (0.14)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Florida > Miami-Dade County
- Asia > Myanmar
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology (0.93)
- Technology: