A Critical Study of Automatic Evaluation in Sign Language Translation
Yazdani, Shakib, Hamidullah, Yasser, España-Bonet, Cristina, Avramidis, Eleftherios, van Genabith, Josef
–arXiv.org Artificial Intelligence
Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.
arXiv.org Artificial Intelligence
Nov-17-2025
- Country:
- Asia
- China (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Austria > Vienna (0.14)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Bulgaria > Varna Province
- Varna (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Saarland
- Saarbrücken (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.14)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Florida > Miami-Dade County
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Education > Curriculum > Subject-Specific Education (0.66)
- Technology: