Correction of Errors in Preference Ratings from Automated Metrics for Text Generation
Deriu, Jan, von Däniken, Pius, Tuggener, Don, Cieliebak, Mark
–arXiv.org Artificial Intelligence
A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.
arXiv.org Artificial Intelligence
Jun-6-2023
- Country:
- Oceania > Australia
- North America
- United States
- Pennsylvania (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Georgia > Fulton County
- Atlanta (0.04)
- Canada > British Columbia
- United States
- Europe
- Germany > Berlin (0.04)
- United Kingdom > Scotland
- City of Aberdeen > Aberdeen (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Spain
- Galicia > Madrid (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Asia
- Middle East > Jordan (0.04)
- China > Hong Kong (0.04)
- Genre:
- Research Report > Experimental Study (0.88)
- Technology: