How to Select Datapoints for Efficient Human Evaluation of NLG Models?

Zouhar, Vilém, Cui, Peng, Sachan, Mrinmaya

Jan-30-2025–arXiv.org Artificial Intelligence

Human evaluation is the gold-standard for evaluating text generation models. It is also expensive, and to fit budgetary constraints, a random subset of the test data is often chosen in practice. The randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop a suite of selectors to get the most informative datapoints for human evaluation while taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just Figure 1: Output-based variant of our informative subset based on the source texts. We demonstrate the selection approach. Given model outputs and automated efficacy of our selectors in two common NLG metrics, we select items to be human-evaluated tasks, machine translation and summarization, on which the final model ranking can be computed.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Jan-30-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)
- Europe
  - Switzerland > Zürich
    - Zürich (0.04)
  - France > Hauts-de-France
    - Nord > Lille (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Machine Translation (0.51)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found