MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

Moosa, Ibraheem Muhammad, Zhang, Rui, Yin, Wenpeng

Jan-30-2024–arXiv.org Artificial Intelligence

Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem--producing an absolute translation-quality score. This approach has two limitations: i) the scores lack interpretability, and human annotators struggle with giving consistent scores; ii) most scoring methods are based on (reference, translation) pairs, limiting their applicability in real-world scenarios where references are absent. In practice, we often care about whether a new MT system is better or worse than some competitors. In addition, reference-free MT evaluation is increasingly practical and necessary. Unfortunately, these two practical considerations have yet to be jointly explored. In this work, we formulate the reference-free MT evaluation into a pairwise ranking problem. Given the source sentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this new paradigm can demonstrate superior correlation with human judgments by merely using indirect supervision from natural language inference and weak supervision from our synthetic data. In the context of reference-free evaluation, MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challenging benchmark, ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors, MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines. Automatic MT evaluation is crucial to measure the progress of MT systems. Compared to human evaluation, automatic evaluation is much cheaper and less subjective.

computational linguistic, evaluation, translation, (14 more...)

arXiv.org Artificial Intelligence

Jan-30-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Maryland > Baltimore (0.04)
  - Pennsylvania > Philadelphia County
    - Philadelphia (0.04)
  - Ohio > Franklin County
    - Columbus (0.04)
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Colorado > Denver County
    - Denver (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Czechia > Prague (0.04)
  - Bulgaria > Varna Province
    - Varna (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Italy
    - Tuscany > Florence (0.04)
    - Trentino-Alto Adige/Südtirol > Trentino Province
      - Trento (0.04)
  - Middle East > Cyprus
    - Nicosia > Nicosia (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - United Kingdom
    - Scotland > City of Edinburgh
      - Edinburgh (0.04)
    - England > Greater London
      - London (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Middle East
    - Republic of Türkiye (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Qatar > Ad-Dawhah
      - Doha (0.04)
  - China > Beijing
    - Beijing (0.04)
- Africa > Ethiopia
  - Addis Ababa > Addis Ababa (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)