An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume
–arXiv.org Artificial Intelligence
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
arXiv.org Artificial Intelligence
Mar-14-2025
- Country:
- Africa (0.04)
- Asia
- Afghanistan > Parwan Province
- Charikar (0.04)
- Malaysia (0.04)
- Kazakhstan (0.04)
- Azerbaijan (0.04)
- Vietnam (0.04)
- Japan (0.04)
- Middle East
- Iran (0.04)
- Israel (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Russia (0.04)
- China (0.04)
- South Korea (0.04)
- Kyrgyzstan (0.04)
- Sri Lanka (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Singapore (0.04)
- India (0.04)
- Afghanistan > Parwan Province
- Europe
- Belarus (0.04)
- North Macedonia (0.04)
- Estonia > Tartu County
- Tartu (0.04)
- Hungary (0.04)
- Portugal (0.04)
- Moldova (0.04)
- Ireland (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Czechia (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ukraine (0.04)
- Romania (0.04)
- Norway > Eastern Norway
- Oslo (0.04)
- Lithuania (0.04)
- Latvia (0.04)
- Middle East
- Malta (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Russia (0.04)
- Italy
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Serbia (0.04)
- Slovenia (0.04)
- Slovakia (0.04)
- United Kingdom > England
- South Yorkshire > Sheffield (0.04)
- Finland
- Southwest Finland > Turku (0.04)
- Uusimaa > Helsinki (0.04)
- Switzerland (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Iceland (0.04)
- Netherlands (0.04)
- Spain (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany (0.04)
- Poland (0.04)
- Bulgaria (0.04)
- Austria (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Oceania
- South America
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Education (0.46)
- Information Technology (0.67)
- Leisure & Entertainment > Games (0.45)
- Media > News (0.46)
- Technology: