Fernandes, Patrick
EuroBERT: Scaling Multilingual Encoders for European Languages
Boizard, Nicolas, Gisserot-Boukhlef, Hippolyte, Alves, Duarte M., Martins, André, Hammal, Ayoub, Corro, Caio, Hudelot, Céline, Malherbe, Emmanuel, Malaboeuf, Etienne, Jourdan, Fanny, Hautreux, Gabriel, Alves, João, El-Haddad, Kevin, Faysse, Manuel, Peyrard, Maxime, Guerreiro, Nuno M., Fernandes, Patrick, Rei, Ricardo, Colombo, Pierre
Many important tasks in Natural Language Processing (NLP), including information retrieval, classification, or regression, are built upon general-purpose vector representations. These representations are traditionally obtained from bidirectional encoder models, which aggregate information from the left and right contexts of each token (Devlin et al., 2019; Conneau et al., 2020; He et al., 2023). In contrast, recent advances in generative modeling have shifted the research community's attention towards unidirectional architectures (Bai et al., 2023; Llama Team, 2024; OLMo et al., 2025). Notably, these efforts have identified several key performance drivers that span architectural advances, data improvements, and increased scale. Yet, despite no apparent barrier to transferring these insights to bidirectional architectures, little effort has been devoted towards this objective, forcing practitioners to depend on outdated models. In this paper, we introduce a refreshed recipe for training general-purpose multilingual encoders, resulting in the EuroBERT family. Drawing inspiration from recent progress in decoder models, our models feature an updated architecture ( 2.1), and are trained on a 5T-token multilingual dataset, covering widely spoken European and global languages,
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Liu, Emmy, Bertsch, Amanda, Sutawika, Lintang, Tjuatja, Lindia, Fernandes, Patrick, Marinov, Lara, Chen, Michael, Singhal, Shreya, Lawrence, Carolin, Raghunathan, Aditi, Gashteovski, Kiril, Neubig, Graham
Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
A Context-aware Framework for Translation-mediated Conversations
Pombal, José, Agrawal, Sweta, Fernandes, Patrick, Zaranis, Emmanouil, Martins, André F. T.
Effective communication is fundamental to any interaction, yet challenges arise when participants do not share a common language. Automatic translation systems offer a powerful solution to bridge language barriers in such scenarios, but they introduce errors that can lead to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings. During training, we leverage context-augmented parallel data, which allows the model to generate translations sensitive to conversational history. During inference, we perform quality-aware decoding with context-aware metrics to select the optimal translation from a pool of candidates. We validate both components of our framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, our framework consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.
Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Ramos, Miguel Moura, Almeida, Tomás, Vareta, Daniel, Azevedo, Filipe, Agrawal, Sweta, Fernandes, Patrick, Martins, André F. T.
Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, which leads to inefficient learning signals due to the reward sparsity problem -- the model receives a single score for the entire sentence. To address this, we introduce a novel approach that leverages fine-grained token-level reward mechanisms with RL methods. We use xCOMET, a state-of-the-art quality estimation system as our token-level reward model. xCOMET provides detailed feedback by predicting fine-grained error spans and their severity given source-translation pairs. We conduct experiments on small and large translation datasets to compare the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to automatic and human evaluation. Furthermore, token-level reward optimization also improves training stability, evidenced by a steady increase in mean rewards over training epochs.
Better Instruction-Following Through Minimum Bayes Risk
Wu, Ian, Fernandes, Patrick, Bertsch, Amanda, Kim, Seungone, Pakazad, Sina, Neubig, Graham
General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance. One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding, which uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs. In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on AlpacaEval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. Then, seeking to retain the improvements from MBR decoding while mitigating additional test-time costs, we explore iterative self-training on MBR-decoded outputs. We find that self-training using Direct Preference Optimisation leads to significant performance gains, such that the self-trained models with greedy decoding generally match and sometimes exceed the performance of their base models with MBR decoding.
Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation
Agrawal, Sweta, de Souza, José G. C., Rei, Ricardo, Farinhas, António, Faria, Gonçalo, Fernandes, Patrick, Guerreiro, Nuno M, Martins, Andre
Alignment with human preferences is an important step in developing accurate and safe large language models. This is no exception in machine translation (MT), where better handling of language nuances and context-specific variations leads to improved quality. However, preference data based on human feedback can be very expensive to obtain and curate at a large scale. Automatic metrics, on the other hand, can induce preferences, but they might not match human expectations perfectly. In this paper, we propose an approach that leverages the best of both worlds. We first collect sentence-level quality assessments from professional linguists on translations generated by multiple high-quality MT systems and evaluate the ability of current automatic metrics to recover these preferences. We then use this analysis to curate a new dataset, MT-Pref (metric induced translation preference) dataset, which comprises 18k instances covering 18 language directions, using texts sourced from multiple domains post-2022. We show that aligning TOWER models on MT-Pref significantly improves translation quality on WMT23 and FLORES benchmarks.
EuroLLM: Multilingual Language Models for Europe
Martins, Pedro Henrique, Fernandes, Patrick, Alves, João, Guerreiro, Nuno M., Rei, Ricardo, Alves, Duarte M., Pombal, José, Farajian, Amin, Faysse, Manuel, Klimaszewski, Mateusz, Colombo, Pierre, Haddow, Barry, de Souza, José G. C., Birch, Alexandra, Martins, André F. T.
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
Is Context Helpful for Chat Translation Evaluation?
Agrawal, Sweta, Farajian, Amin, Fernandes, Patrick, Rei, Ricardo, Martins, André F. T.
Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing sentence-level automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Alves, Duarte M., Pombal, José, Guerreiro, Nuno M., Martins, Pedro H., Alves, João, Farajian, Amin, Peters, Ben, Rei, Ricardo, Fernandes, Patrick, Agrawal, Sweta, Colombo, Pierre, de Souza, José G. C., Martins, André F. T.
Many important tasks within multilingual NLP, such as quality estimation, automatic postedition, or grammatical error correction, involve analyzing, generating or operating with text in multiple languages, and are relevant to various translation workflows -- we call these translation-related tasks. Recently, general-purpose large language models (LLMs) challenged the paradigm of per-task dedicated systems, achieving state-of-the-art performance on several recent WMT shared tasks (Kocmi et al., 2023; Freitag et al., 2023; Neves et al., 2023). Unfortunately, strong capabilities for multiple translation-related tasks have so far been exhibited by closed LLMs only (Hendy et al., 2023; Kocmi & Federmann, 2023; Fernandes et al., 2023; Raunak et al., 2023). Perhaps because most open LLMs are English-centric, approaches leveraging these models still lag behind, having thus far achieved competitive results only when specializing on a single task (Xu et al., 2024a; 2023; Iyer et al., 2023). In this paper, we bridge this gap with a detailed recipe to develop an LLM for multiple translation-related tasks. Our approach, illustrated in Figure 1 and inspired by Xu et al.
CroissantLLM: A Truly Bilingual French-English Language Model
Faysse, Manuel, Fernandes, Patrick, Guerreiro, Nuno M., Loison, António, Alves, Duarte M., Corro, Caio, Boizard, Nicolas, Alves, João, Rei, Ricardo, Martins, Pedro H., Casademunt, Antoni Bigata, Yvon, François, Martins, André F. T., Viaud, Gautier, Hudelot, Céline, Colombo, Pierre
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.