An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume
–arXiv.org Artificial Intelligence
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
arXiv.org Artificial Intelligence
Mar-14-2025
- Country:
- Africa (0.04)
- South America
- Oceania
- North America
- United States
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Florida > Miami-Dade County
- Miami (0.04)
- Pennsylvania > Philadelphia County
- Mexico > Mexico City
- Mexico City (0.04)
- Canada > Ontario
- Toronto (0.04)
- United States
- Europe
- Russia (0.04)
- Switzerland (0.04)
- Serbia (0.04)
- Latvia (0.04)
- Iceland (0.04)
- Slovakia (0.04)
- Lithuania (0.04)
- Czechia (0.04)
- North Macedonia (0.04)
- Austria (0.04)
- Bulgaria (0.04)
- Poland (0.04)
- Germany (0.04)
- Spain (0.04)
- Netherlands (0.04)
- Slovenia (0.04)
- Romania (0.04)
- Ukraine (0.04)
- Ireland (0.04)
- Moldova (0.04)
- Portugal (0.04)
- Hungary (0.04)
- Belarus (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Finland
- Uusimaa > Helsinki (0.04)
- Southwest Finland > Turku (0.04)
- United Kingdom > England
- South Yorkshire > Sheffield (0.04)
- Italy
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Middle East
- Malta (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Norway > Eastern Norway
- Oslo (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Estonia > Tartu County
- Tartu (0.04)
- Asia
- Russia (0.04)
- India (0.04)
- Kazakhstan (0.04)
- Singapore (0.04)
- Sri Lanka (0.04)
- Kyrgyzstan (0.04)
- South Korea (0.04)
- China (0.04)
- Japan (0.04)
- Vietnam (0.04)
- Azerbaijan (0.04)
- Malaysia (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East
- Iran (0.04)
- Israel (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Afghanistan > Parwan Province
- Charikar (0.04)
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Information Technology (0.67)
- Education (0.46)
- Media > News (0.46)
- Leisure & Entertainment > Games (0.45)
- Technology: