Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies

Mena, Carlos, Serra, Pol, Romero, Jacobo, Messaoudi, Abir, Giraldo, Jose, Armentano-Oller, Carme, Zevallos, Rodolfo, Meza, Ivan, Hernando, Javier

arXiv.org Artificial Intelligence 

The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI's Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.