Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

Huang, Shao-Syuan, Huang, Kuan-Po, Liu, Andy T., Lee, Hung-yi

Dec-20-2024–arXiv.org Artificial Intelligence

Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Dec-20-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Taiwan (0.16)

Genre:
- Research Report > New Finding (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)
  - Speech > Speech Recognition (1.00)