VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series
Xu, Pengyu, Li, Shijia, Sun, Ao, Zhang, Feng, Li, Yahan, Wu, Bo, Ma, Zhanyu, Li, Jiguo, Xu, Jun, Gao, Jiuchong, Hao, Jinghua, He, Renqing, Wang, Rui, Liu, Yang, Hu, Xiaobo, Yang, Fan, Zheng, Jia, Yao, Guanghua
–arXiv.org Artificial Intelligence
We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.
arXiv.org Artificial Intelligence
Nov-17-2025
- Country:
- Asia
- China
- Beijing > Beijing (0.04)
- Guangdong Province > Shenzhen (0.04)
- Hong Kong (0.04)
- Middle East > Jordan (0.04)
- China
- Asia
- Genre:
- Research Report (0.82)
- Workflow (1.00)
- Industry:
- Banking & Finance (1.00)
- Education > Educational Setting (0.67)
- Health & Medicine
- Consumer Health (1.00)
- Therapeutic Area > Psychiatry/Psychology
- Mental Health (0.94)
- Information Technology > Security & Privacy (1.00)
- Marketing (0.68)
- Technology: