DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Ou, Jiao, Lu, Junda, Liu, Che, Tang, Yihong, Zhang, Fuzheng, Zhang, Di, Wang, Zhongyuan, Gai, Kun

Nov-2-2023–arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities, refreshing human's impressions on dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users by satisfying the need for communication, affection and social belonging. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that currently contains $12$ dialogue tasks to assess the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely-used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive test over $28$ LLMs (including pre-trained and supervised instruction-tuning) shows that instruction fine-tuning benefits improve the human likeness of LLMs to a certain extent, but there is still much room to improve those capabilities for most LLMs as human-like dialogue systems. In addition, experimental results also indicate that LLMs perform differently in various abilities that human-like dialogue systems should have. We will publicly release DialogBench, along with the associated evaluation code for the broader research community.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Nov-2-2023

arXiv.org PDF

Add feedback

Country:
- Asia
  - China (0.14)
  - Middle East > UAE (0.14)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology (0.67)
- Leisure & Entertainment > Games
  - Computer Games (0.46)
- Media
  - Film (0.46)
  - Music (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.92)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found