A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

Nguyen, Long S. T., Hua, Truong P., Nguyen, Thanh M., Pham, Toan Q., Ngo, Nam K., Nguyen, An X., Pham, Nghi D. M., Nguyen, Nghia H., Quan, Tho T.

Jul-31-2025–arXiv.org Artificial Intelligence

With the rapid advancement of Artificial Intelligence, Large Language Models (LLMs) have become indispensable in Question Answering (QA) systems, enhancing response efficiency and reducing human workload, particularly in customer service. The rise of Vietnamese LLMs (ViLLMs) has positioned lightweight open-source models as the preferred choice due to their efficiency, accuracy, and privacy advantages. However, systematic evaluations of their performance in domain-specific contexts remain scarce, making it challenging for enterprises to identify the most suitable LLM for customer support applications, especially given the lack of benchmark datasets reflecting real-world customer interactions. To bridge this gap, we introduce Customer Support Conversations Dataset (CSConDa), a high-quality benchmark comprising over 9,000 QA pairs, meticulously curated from customer interactions with human advisors at a large-scale Vietnamese software company. Covering diverse service-related topics, including pricing inquiries, product availability, and technical troubleshooting, CSConDa serves as a representative dataset for evaluating ViLLMs in real-world scenarios. Furthermore, we present a comprehensive evaluation framework, bench-marking 11 lightweight open-source ViLLMs on CSConDa using not only well-suited automatic metrics but also an in-depth syntactic analysis to uncover their strengths, weaknesses, and underlying linguistic patterns. This analysis provides insights into model behavior, explains performance variations, and identifies critical areas for improvement, guiding future advancements in ViLLM development. Thus, by establishing a robust benchmark for LLM-driven customer service applications, our work provides a quantitative evaluation dataset and a comprehensive ViLLM performance comparison, offering key insights into intrinsic model performance, including accuracy, fluency, and consistency, while enabling informed decision-making for next-generation QA systems. Our dataset is publicly available on Hugging Face.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

Jul-31-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Vietnam (0.15)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (1.00)
- Health & Medicine (0.95)
- Education (0.94)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found