wmt22
How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Chopra, Muskaan, Sparrenberg, Lorenz, Khanna, Sarthak, Sifa, Rafet
Abstract--Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC = 0.77 with F1-ERR = 0.98 on SynCED-EnDe 2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). In contrast, ultra-small models (< 0.6 B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs-augmented with lightweight calibration and small-sample supervision, can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines.
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning
Hu, Tianxiang, Zhang, Pei, Yang, Baosong, Xie, Jun, Wong, Derek F., Wang, Rui
Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While large language models (LLMs) have demonstrated impressive general understanding and generation abilities, their potential in multi-domain MT is under-explored. We establish a comprehensive benchmark for multi-domain translation, featuring 25 German$\Leftrightarrow$English and 22 Chinese$\Leftrightarrow$English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German$\rightarrow$English distinct out-of-domain tests.
A Large-Scale Automatic Evaluation of Machine Translation
Like every year since 2006, the Conference on Machine Translation (WMT) organized extensive machine translation shared tasks. Numerous participants from all over the world submitted their machine translation (MT) outputs to demonstrate their recent advances in the field. WMT is generally recognized as the event of reference to observe and evaluate the state-of-the-art of MT. The 2022 edition replaced the original news translation task by a "general" translation task covering various domains, including news, social, conversational, and ecommerce, among others. This task alone received 185 submissions for the 21 translation directions prepared by the organizers: Czech English (cs-en), Czech Ukrainian (cs-uk), German English (de-en), French German (fr-de), English Croatian (en-hr), English Japanese (en-ja), English Livonian (en-liv), English Russian (en-ru), Russian Yakut (ru-sah), English Ukrainian (en-uk), and English Chinese (en-zh).