Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages
–arXiv.org Artificial Intelligence
Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.
arXiv.org Artificial Intelligence
Jun-18-2025
- Country:
- Asia
- East Asia (0.04)
- Singapore > Central Region
- Singapore (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America > United States
- Georgia > Fulton County > Atlanta (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.88)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (0.71)
- Information Technology > Artificial Intelligence