Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning

Lau, Mingfei, Chen, Qian, Fang, Yeming, Xu, Tingting, Chen, Tongzhou, Golik, Pavel

Jul-1-2025–arXiv.org Artificial Intelligence

Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and Vox Populi - shows that in some languages, these datasets suffer from significant quality issues, which may obfuscate downstream evaluation results while creating an illusion of success. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the dataset creation process. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness and language planning principles. Furthermore, we encourage research into how this creation process itself can be leveraged as a tool for community-led language planning and revitalization.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jul-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)
- Africa (1.00)
- Asia > Middle East
  - UAE (0.28)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology
  - Data Science > Data Quality (1.00)
  - Communications (1.00)
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Speech > Speech Recognition (0.69)
    - Natural Language > Large Language Model (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found