EuroLLM-9B: Technical Report

Martins, Pedro Henrique, Alves, João, Fernandes, Patrick, Guerreiro, Nuno M., Rei, Ricardo, Farajian, Amin, Klimaszewski, Mateusz, Alves, Duarte M., Pombal, José, Boizard, Nicolas, Faysse, Manuel, Colombo, Pierre, Yvon, François, Haddow, Barry, de Souza, José G. C., Birch, Alexandra, Martins, André F. T.

arXiv.org Artificial Intelligence 

This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset. Large language models (LLMs) have emerged as key drivers of progress in natural language processing (NLP) and artificial intelligence (AI), with notable examples including OpenAI's GPT series (OpenAI et al., 2024), Anthropic's Claude (Anthropic, 2023) or Google's Gemini (Google et al., 2025). LLMs are first pre-trained on vast amounts of unlabelled data relying on a self-supervised task ( e.g., next word prediction or missing word prediction). This process enables the model to acquire knowledge, to develop strong language understanding and generation skills, and to perform various downstream tasks, often leveraging in-context learning techniques.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found