Self-supervised learning of speech representations with Dutch archival data

Vaessen, Nik, Ordelman, Roeland, van Leeuwen, David A.

Jul-9-2025–arXiv.org Artificial Intelligence

This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multilingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55 k hour archival dataset.

artificial intelligence, inductive learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

Jul-9-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.04)
- Europe
  - Netherlands (0.05)
  - Hungary > Budapest
    - Budapest (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)

Genre:
- Research Report > Experimental Study (0.34)

Industry:
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Machine Learning > Inductive Learning (0.71)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found