A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Xu, Haoran, Kim, Young Jin, Sharaf, Amr, Awadalla, Hany Hassan

arXiv.org Artificial Intelligence 

Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two finetuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 (Touvron et al., 2023b) as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model (NLLB TEAM et al., 2022) and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation. Generative (decoder-only) large language models (LLMs) such as GPT models (Brown et al., 2020; OpenAI, 2023), PaLM (Chowdhery et al., 2022), OPT (Zhang et al., 2022), BLOOM (Scao et al., 2022), LLaMA (Touvron et al., 2023a;b), and others have exhibited remarkable capabilities across various NLP tasks. However, for the translation task, only very large models such as GPT-3.5 and GPT-4 can rival the supervised encoder-decoder state-of-the-art (SoTA) models like NLLB (NLLB TEAM et al., 2022), while they still fall short in translation for low-resource languages (Hendy et al., 2023; Jiao et al., 2023). The discrepancy becomes more evident when comparing other LLMs with traditional translation models (Zhu et al., 2023a). The gap is even larger in smaller LLMs; for example, XGLM (Lin et al., 2021), with a parameter size of 7B, lags behind the NLLB-1.3B