Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Huynh, Jessica, Jiao, Cathy, Gupta, Prakhar, Mehri, Shikib, Bajaj, Payal, Chaudhary, Vishrav, Eskenazi, Maxine

arXiv.org Artificial Intelligence 

In recent years, language models such as GPT-3 [5] have grown larger, and their performance on downstream natural language processing (NLP) tasks has significantly improved in low-resource settings where only a few instances per task are available (few-shot). The larger these models are, the higher their performances trend on tasks such as language generation and evaluation [39]. They can generate coherent, fluent and interesting responses. However, they can also produce responses that are repetitive and un-engaging [29], in addition to being hard to control. Dialog evaluation is the task of assessing the quality of responses generated by dialog models in terms of properties like those mentioned above. However, one significant impediment for open-domain dialog generation research is the lack of meaningful automatic metrics for open-domain dialog evaluation. Standard language generation metrics have been shown to be ineffective for dialog evaluation [11], a large part of which is because conversations can be followed by multiple valid responses.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found