Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Huynh, Jessica, Jiao, Cathy, Gupta, Prakhar, Mehri, Shikib, Bajaj, Payal, Chaudhary, Vishrav, Eskenazi, Maxine

Jan-27-2023–arXiv.org Artificial Intelligence

In recent years, language models such as GPT-3 [5] have grown larger, and their performance on downstream natural language processing (NLP) tasks has significantly improved in low-resource settings where only a few instances per task are available (few-shot). The larger these models are, the higher their performances trend on tasks such as language generation and evaluation [39]. They can generate coherent, fluent and interesting responses. However, they can also produce responses that are repetitive and un-engaging [29], in addition to being hard to control. Dialog evaluation is the task of assessing the quality of responses generated by dialog models in terms of properties like those mentioned above. However, one significant impediment for open-domain dialog generation research is the lack of meaningful automatic metrics for open-domain dialog evaluation. Standard language generation metrics have been shown to be ineffective for dialog evaluation [11], a large part of which is because conversations can be followed by multiple valid responses.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jan-27-2023

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States (0.29)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.35)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found