Stepachev, Pavel
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models
Stepachev, Pavel, Chen, Pinzhen, Haddow, Barry
Large language models (LLMs) have started to play a vital Formally, our approach explores suitable prompting role in modelling speech and text. To explore the best use of strategies to perform speech emotion prediction from ASR context and multiple systems' outputs for post-ASR speech outputs without speech signals. Most efforts are centred on emotion prediction, we study LLM prompting on a recent creating a practical context for prompting. The contributions task named GenSEC. Our techniques include ASR transcript of this work are: ranking, variable conversation context, and system output fusion. Methodologically, we 1) select and rank ASR outputs We show that the conversation context has diminishing as LLM input using multiple metrics and 2) exploit and returns and the metric used to select the transcript for prediction fuse the conversation history and multiple ASR system is crucial.