Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning
Lau, Mingfei, Chen, Qian, Fang, Yeming, Xu, Tingting, Chen, Tongzhou, Golik, Pavel
–arXiv.org Artificial Intelligence
Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and Vox Populi - shows that in some languages, these datasets suffer from significant quality issues, which may obfuscate downstream evaluation results while creating an illusion of success. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the dataset creation process. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness and language planning principles. Furthermore, we encourage research into how this creation process itself can be leveraged as a tool for community-led language planning and revitalization.
arXiv.org Artificial Intelligence
Jul-1-2025
- Country:
- Africa
- Ethiopia (0.04)
- Kenya (0.04)
- South Africa (0.04)
- Nigeria (0.04)
- Middle East
- Egypt (0.04)
- Morocco > Casablanca-Settat Region
- Casablanca (0.04)
- Somalia (0.04)
- Zimbabwe (0.04)
- Uganda (0.04)
- Malawi (0.04)
- Angola (0.04)
- Senegal (0.04)
- Asia
- Pakistan (0.04)
- Cambodia (0.04)
- Nepal (0.04)
- Mongolia (0.04)
- Malaysia (0.04)
- Kazakhstan (0.04)
- South Korea > Incheon
- Incheon (0.04)
- Uzbekistan (0.04)
- Azerbaijan (0.04)
- Vietnam (0.04)
- Japan (0.04)
- Indonesia (0.04)
- Laos (0.04)
- Middle East
- Iran (0.04)
- Iraq (0.04)
- Israel (0.04)
- Jordan (0.04)
- Republic of Türkiye (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Philippines (0.04)
- Russia (0.04)
- China
- Hong Kong (0.05)
- Tibet Autonomous Region (0.04)
- Armenia (0.04)
- Taiwan (0.04)
- Myanmar (0.04)
- Kyrgyzstan (0.04)
- Thailand (0.04)
- Singapore (0.04)
- Tajikistan (0.04)
- India (0.05)
- Afghanistan (0.04)
- Bangladesh > Dhaka Division
- Dhaka District > Dhaka (0.04)
- Europe
- Belarus (0.04)
- North Macedonia (0.04)
- Hungary (0.04)
- United Kingdom (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Sweden (0.04)
- Ukraine (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- Romania (0.04)
- Estonia (0.04)
- Lithuania (0.04)
- Latvia (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Greece (0.04)
- Russia (0.04)
- Italy (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Serbia (0.04)
- Slovenia (0.04)
- Slovakia (0.04)
- Norway (0.04)
- Finland (0.04)
- Denmark (0.04)
- Iceland (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany (0.04)
- Poland (0.04)
- Middle East > Malta (0.04)
- Bulgaria (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Central America (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Pennsylvania (0.04)
- Texas > Travis County
- Austin (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Oceania > New Zealand (0.04)
- South America > Brazil (0.04)
- Africa
- Genre:
- Research Report > New Finding (0.46)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (0.46)
- Speech > Speech Recognition (0.69)
- Communications (1.00)
- Data Science > Data Quality (0.66)
- Artificial Intelligence
- Information Technology