ACEBench: Who Wins the Match Point in Tool Usage?
Chen, Chen, Hao, Xinlong, Liu, Weiwen, Huang, Xu, Zeng, Xingshan, Yu, Shuai, Li, Dexun, Wang, Shuai, Gan, Weinan, Huang, Yuefeng, Liu, Wulong, Wang, Xinzhi, Lian, Defu, Yin, Baoqun, Wang, Yasheng, Liu, Wu
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
arXiv.org Artificial Intelligence
Feb-13-2025