CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

Zhou, Jiaming, Guo, Yujie, Zhao, Shiwan, Sun, Haoqin, Wang, Hui, He, Jiabei, Kong, Aobo, Wang, Shiyao, Yang, Xi, Wang, Yequan, Lin, Yonghua, Qin, Yong

Mar-11-2025–arXiv.org Artificial Intelligence

Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and establish benchmark ASR performance using state-of-the-art models. Our experiments, using Transformer, Conformer, and Branchformer, demonstrate the challenges of code-switching ASR, and show that existing pre-trained models such as Whisper still have the space to improve. The CS-Dialogue dataset will be made freely available for all academic purposes.

dataset, error rate, transcription, (11 more...)

arXiv.org Artificial Intelligence

Mar-11-2025

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > France
  - Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia
  - Taiwan (0.04)
  - Singapore (0.04)
  - Mongolia (0.04)
  - Malaysia (0.04)
  - Macao (0.04)
  - East Asia (0.04)
  - China
    - Beijing > Beijing (0.05)
    - Tianjin Province > Tianjin (0.04)
    - Inner Mongolia (0.04)
    - Hong Kong (0.04)
    - Chongqing Province > Chongqing (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found