LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens
Zebaze, Armel, Bawden, Rachel, Sagot, Benoît
–arXiv.org Artificial Intelligence
Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains under-explored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models. Large Language Models (LLMs) are general-purpose problem solvers (Touvron et al., 2023; OpenAI et al., 2024; Dubey et al., 2024; Kimi Team et al., 2025). Their instruction-following capabilities help them carry out a wide variety of requests provided by users via text. Research on alignment, typically through Reinforcement Learning from Human Feedback (RLHF) (Askell et al., 2021; Bai et al., 2022; Ouyang et al., 2022; Rafailov et al., 2023; Lambert et al., 2025) has greatly contributed to improving the quality of LLMs' outputs. Recently, a new paradigm has emerged: to train LLMs to "think" in natural language before answering a query. OpenAI o1 and o3 (OpenAI, 2024), DeepSeek-R1 (DeepSeek-AI et al., 2025), Qwen3 (Y ang et al., 2025), Claude 4 (Anthropic, 2025) and Gemini 2.5 (Gemini Team et al., 2025) inter alia are instances of these Reasoning Models (RM) or Thinking Models (TM).
arXiv.org Artificial Intelligence
Oct-15-2025
- Country:
- North America (1.00)
- Europe (1.00)
- Asia > Middle East (0.92)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine (0.46)
- Leisure & Entertainment (0.45)
- Technology: