Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Deriu, Jan, von Däniken, Pius, Tuggener, Don, Cieliebak, Mark

Jun-6-2023–arXiv.org Artificial Intelligence

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jun-6-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Pennsylvania (0.04)
    - New Mexico > Santa Fe County
      - Santa Fe (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Georgia > Fulton County
      - Atlanta (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Germany > Berlin (0.04)
  - United Kingdom > Scotland
    - City of Aberdeen > Aberdeen (0.04)
  - Switzerland > Zürich
    - Zürich (0.04)
  - Spain
    - Galicia > Madrid (0.04)
    - Catalonia > Barcelona Province
      - Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China > Hong Kong (0.04)

Genre:
- Research Report > Experimental Study (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Uncertainty
    - Bayesian Inference (0.46)
  - Natural Language
    - Machine Translation (0.67)
    - Generation (0.46)
    - Large Language Model (0.46)
    - Discourse & Dialogue (0.46)
  - Machine Learning > Learning Graphical Models
    - Directed Networks > Bayesian Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found