Oepen, Stephan
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
de la Rosa, Javier, Mikhailov, Vladislav, Zhang, Lemei, Wetjen, Freddy, Samuel, David, Liu, Peng, Braaten, Rolv-Arild, Mæhlum, Petter, Birkenes, Magnus Breder, Kutuzov, Andrey, Enstad, Tita, Brygfjeld, Svein Arne, Gulla, Jon Atle, Oepen, Stephan, Velldal, Erik, Østgulen, Wilfred, Øvrelid, Liljia, Myhre, Aslak Sira
The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.
A New Massive Multilingual Dataset for High-Performance Language Technologies
de Gibert, Ona, Nail, Graeme, Arefyev, Nikolay, Bañón, Marta, van der Linde, Jelmer, Ji, Shaoxiong, Zaragoza-Bernabeu, Jaume, Aulamo, Mikko, Ramírez-Sánchez, Gema, Kutuzov, Andrey, Pyysalo, Sampo, Oepen, Stephan, Tiedemann, Jörg
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.