EuroBERT: Scaling Multilingual Encoders for European Languages

Boizard, Nicolas, Gisserot-Boukhlef, Hippolyte, Alves, Duarte M., Martins, André, Hammal, Ayoub, Corro, Caio, Hudelot, Céline, Malherbe, Emmanuel, Malaboeuf, Etienne, Jourdan, Fanny, Hautreux, Gabriel, Alves, João, El-Haddad, Kevin, Faysse, Manuel, Peyrard, Maxime, Guerreiro, Nuno M., Fernandes, Patrick, Rei, Ricardo, Colombo, Pierre

arXiv.org Artificial Intelligence 

Many important tasks in Natural Language Processing (NLP), including information retrieval, classification, or regression, are built upon general-purpose vector representations. These representations are traditionally obtained from bidirectional encoder models, which aggregate information from the left and right contexts of each token (Devlin et al., 2019; Conneau et al., 2020; He et al., 2023). In contrast, recent advances in generative modeling have shifted the research community's attention towards unidirectional architectures (Bai et al., 2023; Llama Team, 2024; OLMo et al., 2025). Notably, these efforts have identified several key performance drivers that span architectural advances, data improvements, and increased scale. Yet, despite no apparent barrier to transferring these insights to bidirectional architectures, little effort has been devoted towards this objective, forcing practitioners to depend on outdated models. In this paper, we introduce a refreshed recipe for training general-purpose multilingual encoders, resulting in the EuroBERT family. Drawing inspiration from recent progress in decoder models, our models feature an updated architecture ( 2.1), and are trained on a 5T-token multilingual dataset, covering widely spoken European and global languages,