VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series

Xu, Pengyu, Li, Shijia, Sun, Ao, Zhang, Feng, Li, Yahan, Wu, Bo, Ma, Zhanyu, Li, Jiguo, Xu, Jun, Gao, Jiuchong, Hao, Jinghua, He, Renqing, Wang, Rui, Liu, Yang, Hu, Xiaobo, Yang, Fan, Zheng, Jia, Yao, Guanghua

Nov-17-2025–arXiv.org Artificial Intelligence

We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Workflow (1.00)
- Research Report (0.82)

Industry:
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (1.00)
- Marketing (0.68)
- Education > Educational Setting (0.67)
- Health & Medicine
  - Consumer Health (1.00)
  - Therapeutic Area > Psychiatry/Psychology
    - Mental Health (0.94)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found