A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation
Betala, Siddharth, Raj, Kushan, Betala, Vipul, Saswade, Rohan
–arXiv.org Artificial Intelligence
In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at W AT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets.
arXiv.org Artificial Intelligence
Nov-11-2025
- Country:
- Europe > Italy
- North America
- Canada > British Columbia
- United States
- California > San Francisco County
- San Francisco (0.14)
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- California > San Francisco County
- Genre:
- Research Report (1.00)
- Technology: