Farajian, Amin
EuroLLM: Multilingual Language Models for Europe
Martins, Pedro Henrique, Fernandes, Patrick, Alves, João, Guerreiro, Nuno M., Rei, Ricardo, Alves, Duarte M., Pombal, José, Farajian, Amin, Faysse, Manuel, Klimaszewski, Mateusz, Colombo, Pierre, Haddow, Barry, de Souza, José G. C., Birch, Alexandra, Martins, André F. T.
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
Is Context Helpful for Chat Translation Evaluation?
Agrawal, Sweta, Farajian, Amin, Fernandes, Patrick, Rei, Ricardo, Martins, André F. T.
Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing sentence-level automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Alves, Duarte M., Pombal, José, Guerreiro, Nuno M., Martins, Pedro H., Alves, João, Farajian, Amin, Peters, Ben, Rei, Ricardo, Fernandes, Patrick, Agrawal, Sweta, Colombo, Pierre, de Souza, José G. C., Martins, André F. T.
Many important tasks within multilingual NLP, such as quality estimation, automatic postedition, or grammatical error correction, involve analyzing, generating or operating with text in multiple languages, and are relevant to various translation workflows -- we call these translation-related tasks. Recently, general-purpose large language models (LLMs) challenged the paradigm of per-task dedicated systems, achieving state-of-the-art performance on several recent WMT shared tasks (Kocmi et al., 2023; Freitag et al., 2023; Neves et al., 2023). Unfortunately, strong capabilities for multiple translation-related tasks have so far been exhibited by closed LLMs only (Hendy et al., 2023; Kocmi & Federmann, 2023; Fernandes et al., 2023; Raunak et al., 2023). Perhaps because most open LLMs are English-centric, approaches leveraging these models still lag behind, having thus far achieved competitive results only when specializing on a single task (Xu et al., 2024a; 2023; Iyer et al., 2023). In this paper, we bridge this gap with a detailed recipe to develop an LLM for multiple translation-related tasks. Our approach, illustrated in Figure 1 and inspired by Xu et al.