Machine Translation
proposed idea to be impactful (all reviewers), clear (all reviewers), novel (R1,R2), principled (R3,R4) and applicable to
We thank all reviewers for their thorough reviews and insightful feedback! We will incorporate all suggested improvements in the final version. We did not compare to Zhang et al. (2019) because (1) our method is independent of We missed Zhang et al. (2020) since it was published at ACL '20 which is one month after our But we will include both and relevant multilingual MT references within it in the final version. It is the standard error after running with different seeds. In Table 4, we compared 12/100 (24.16 BLEU) to 12/24 (23.7 BLEU) so as to isolate the effect from increased encoder depths.
TASER: Translation Assessment via Systematic Evaluation and Reasoning
Maheswaran, Monishwaran, Carini, Marco, Federmann, Christian, Diaz, Tony
We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.
Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data
Bouthors, Maxime, Crego, Josep, Yvon, Franรงois
Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, monolingual corpora in the target language are often available. This work explores ways to take advantage of such resources by directly retrieving relevant target language segments, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with three RANMT architectures, we assess such cross-lingual objectives in a controlled setting, reaching performances that match those of standard TM-based models. We also showcase our method on a real-world settings, using much larger monolingual and observe strong improvements over both the baseline setting and general-purpose cross-lingual retrievers.
06964dce9addb1c5cb5d6e3d9838f733-AuthorFeedback.pdf
We thank the reviewers for their feedback. We will reflect reviewer's comments and our response in the revision. Reviewers showed concern on the novelty and the accuracy. DA is more effective when the task is more challenging. On the other hand, we find DA effective as well when the amount of labeled data is small.
Searching for Difficult-to-Translate Test Examples at Scale
Xu, Wenda, Zouhar, Vilรฉm, Riley, Parker, Finkelstein, Mara, Freitag, Markus, Deutsch, Daniel
NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.
The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Conti, Lina, Fucci, Dennis, Gaido, Marco, Negri, Matteo, Wisniewski, Guillaume, Bentivogli, Luisa
Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another. By extending the scope of contrastive explanations to S2T, our work provides a foundation for better understanding S2T models.
Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation
DiIanni, Colten, Deutsch, Daniel
This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $ฯ$-based and and Kendall's $ฯ$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.
DocHPLT: A Massively Multilingual Document-Level Translation Dataset
O'Brien, Dayyรกn, Malik, Bhavitvya, de Gibert, Ona, Chen, Pinzhen, Haddow, Barry, Tiedemann, Jรถrg
Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Shafique, Bhuiyan Sanjid, Vayani, Ashmal, Maaz, Muhammad, Rasheed, Hanoona Abdul, Dissanayake, Dinura, Kurpath, Mohammed Irfan, Hmaiti, Yahya, Inoue, Go, Lahoud, Jean, Rashid, Md. Safirur, Quasem, Shadid Intisar, Fatima, Maheen, Vidal, Franco, Maslych, Mykola, More, Ketan Pravin, Baliah, Sanoojan, Watawana, Hasindri, Li, Yuhao, Farestam, Fabian, Schaller, Leon, Tymtsiv, Roman, Weber, Simon, Cholakkal, Hisham, Laptev, Ivan, Satoh, Shin'ichi, Felsberg, Michael, Shah, Mubarak, Khan, Salman, Khan, Fahad Shahbaz
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.