Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies
Mena, Carlos, Serra, Pol, Romero, Jacobo, Messaoudi, Abir, Giraldo, Jose, Armentano-Oller, Carme, Zevallos, Rodolfo, Meza, Ivan, Hernando, Javier
–arXiv.org Artificial Intelligence
The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI's Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
arXiv.org Artificial Intelligence
Jul-21-2025
- Country:
- Africa > Middle East
- Asia > Middle East
- Europe
- Middle East
- Cyprus > Mediterranean Sea (0.40)
- Malta > Mediterranean Sea (0.40)
- Spain > Catalonia (0.05)
- Middle East
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Government (0.46)
- Media (0.47)
- Technology: