X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

Xu, Haoran, Murray, Kenton, Koehn, Philipp, Hoang, Hieu, Eriguchi, Akiko, Khayrallah, Huda

arXiv.org Artificial Intelligence 

Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English due to Englishcentric pre-training and limited multilingual data. While some multilingual LLMs claim to support for hundreds of languages, models often fail to provide highquality response for mid-and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages like English and Chinese. We prioritize quality over scaling number of languages, with a focus on multilingual machine translation task, and introduce X-ALMA, a model designed with to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels. This is achieved by plug-and-play languagespecific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. After the final stage of training regimen, our proposed Adaptive-Rejection Preference Optimization (ARPO) surpasses existing preference optimization methods in translation tasks. Large language models (LLMs) such as the GPT series (Brown et al., 2020; OpenAI, 2023), Mistral (Jiang et al., 2023), LLaMA series (Touvron et al., 2023a;b; Dubey et al., 2024), Gemma series (Team et al., 2024a;b), inter alia, among others, have demonstrated impressive performance across various NLP tasks. However, the efficacy of LLMs has primarily been evaluated on English tasks, with their multilingual capabilities receiving less attention due to the models being predominantly pre-trained on English and the scarcity of multilingual data. Recently, there has been a shift towards multilingual studies in LLMs. For instance, LLaMA 3 and 3.1 (Dubey et al., 2024) expand the vocabulary from 32K to 128K and pre-train on multilingual texts; Üstün et al. (2024) have introduced Aya-101, a multilingual generative model supporting 101 languages; and BigTranslate (Yang et al., 2023) and LLaMAX (Lu et al., 2024) scale LLM-based multilingual translation models to over 100 languages. Despite the increased language support in LLMs, their performance across most languages falls short of practical application expectations, especially for mid-and low-resource languages (weakness 1). Work done during an internship at Microsoft.