Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Rangappa, Pradeep, Carofilis, Andres, Prakash, Jeena, Kumar, Shashi, Burdisso, Sergio, Madikeri, Srikanth, Villatoro-Tello, Esau, Sharma, Bidisha, Motlicek, Petr, Hacioglu, Kadri, Venkatesan, Shankar, Vyas, Saurabh, Stolcke, Andreas

Oct-6-2025–arXiv.org Artificial Intelligence

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report (0.64)

Industry:
- Transportation (0.95)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (0.73)
  - Natural Language > Text Processing (0.70)
  - Machine Learning
    - Performance Analysis > Accuracy (0.54)
    - Neural Networks (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found