CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

Yan, Brian, Hamed, Injy, Shimizu, Shuichiro, Lodagala, Vasista, Chen, William, Iakovenko, Olga, Talafha, Bashar, Hussein, Amir, Polok, Alexander, Chang, Kalvin, Klement, Dominik, Althubaiti, Sara, Peng, Puyuan, Wiesner, Matthew, Solorio, Thamar, Ali, Ahmed, Khudanpur, Sanjeev, Watanabe, Shinji, Chen, Chih-Chen, Wu, Zhen, Benharrak, Karim, Diwan, Anuj, Cornell, Samuele, Yeo, Eunjung, Choi, Kwanghee, Carvalho, Carlos, Rosero, Karen

Sep-18-2025–arXiv.org Artificial Intelligence

CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-18-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.68)
- Europe (0.46)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Speech > Speech Recognition (0.47)
  - Natural Language
    - Large Language Model (0.48)
    - Machine Translation (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found