DialogBench: Evaluating LLMs as Human-like Dialogue Systems
Ou, Jiao, Lu, Junda, Liu, Che, Tang, Yihong, Zhang, Fuzheng, Zhang, Di, Wang, Zhongyuan, Gai, Kun
–arXiv.org Artificial Intelligence
Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities, refreshing human's impressions on dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users by satisfying the need for communication, affection and social belonging. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that currently contains $12$ dialogue tasks to assess the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely-used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive test over $28$ LLMs (including pre-trained and supervised instruction-tuning) shows that instruction fine-tuning benefits improve the human likeness of LLMs to a certain extent, but there is still much room to improve those capabilities for most LLMs as human-like dialogue systems. In addition, experimental results also indicate that LLMs perform differently in various abilities that human-like dialogue systems should have. We will publicly release DialogBench, along with the associated evaluation code for the broader research community.
arXiv.org Artificial Intelligence
Nov-2-2023
- Country:
- Asia
- China (0.14)
- Middle East > UAE (0.14)
- Asia
- Genre:
- Research Report (0.50)
- Industry:
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology (0.67)
- Leisure & Entertainment > Games
- Computer Games (0.46)
- Media
- Technology: