How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

Khayrallah, Huda, Akhtar, Zuhaib, Cohen, Edward, Sedoc, João

May-23-2023–arXiv.org Artificial Intelligence

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-23-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Texas (0.04)
    - Pennsylvania (0.04)
    - Michigan (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Taiwan > Taiwan Province
    - Taipei (0.04)
  - China
    - Hong Kong (0.04)
    - Beijing > Beijing (0.04)

Genre:
- Research Report (0.64)

Industry:
- Education > Curriculum > Subject-Specific Education (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Machine Translation (0.68)
    - Chatbot (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found