Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges
Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet
–arXiv.org Artificial Intelligence
Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.
arXiv.org Artificial Intelligence
Oct-4-2024
- Country:
- Asia
- Laos (0.04)
- Russia (0.04)
- Thailand
- Vietnam
- Hà Giang Province > Hà Giang (0.04)
- Bắc Giang Province > Bắc Giang (0.04)
- Bắc Ninh Province > Bắc Ninh (0.04)
- Thanh Hóa Province > Thanh Hóa (0.04)
- Bà Rịa-Vũng Tàu Province > Bà Rịa (0.04)
- Yên Bái Province > Yên Bái (0.04)
- Lâm Đồng Province (0.04)
- Quảng Nam Province (0.04)
- Bắc Kạn Province > Bắc Kạn (0.04)
- Nam Định Province > Nam Định (0.04)
- An Giang Province (0.04)
- Hòa Bình Province > Hòa Bình (0.04)
- Đồng Nai Province (0.04)
- Lai Châu Province > Lai Châu (0.04)
- Ninh Thuận Province (0.04)
- Quảng Trị Province (0.04)
- Bình Định Province (0.04)
- Trà Vinh Province > Trà Vinh (0.04)
- Kon Tum Province > Kon Tum (0.04)
- Đắk Lắk Province (0.04)
- Quảng Ninh Province (0.04)
- Nghệ An Province (0.04)
- Bình Phước Province (0.04)
- Hậu Giang Province (0.04)
- Tiền Giang Province (0.04)
- Đồng Tháp Province (0.04)
- Quảng Bình Province (0.04)
- Đắk Nông Province (0.04)
- Cà Mau Province > Cà Mau (0.04)
- Da Nang > Da Nang (0.04)
- Quảng Ngãi Province > Quảng Ngãi (0.04)
- Kiên Giang Province (0.04)
- Hanoi > Hanoi (0.14)
- Cao Bằng Province > Cao Bằng (0.04)
- Haiphong > Haiphong (0.04)
- Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- Hà Nam Province (0.04)
- Sóc Trăng Province > Sóc Trăng (0.04)
- Bến Tre Province > Bến Tre (0.04)
- Gia Lai Province (0.04)
- Hải Dương Province > Hải Dương (0.04)
- Lào Cai Province > Lào Cai (0.04)
- Vĩnh Phúc Province (0.04)
- Hưng Yên Province > Hưng Yên (0.04)
- Phú Yên Province (0.04)
- Phú Thọ Province (0.04)
- Thái Bình Province > Thái Bình (0.04)
- Thái Nguyên Province > Thái Nguyên (0.04)
- Khánh Hòa Province (0.04)
- Bình Dương Province (0.04)
- Thừa Thiên-Huế Province (0.04)
- Tây Ninh Province > Tây Ninh (0.04)
- Vĩnh Long Province > Vĩnh Long (0.04)
- Hà Tĩnh Province > Hà Tĩnh (0.04)
- Bình Thuận Province (0.04)
- Lạng Sơn Province > Lạng Sơn (0.04)
- Bạc Liêu Province > Bạc Liêu (0.04)
- Tuyên Quang Province > Tuyên Quang (0.04)
- Long An Province (0.04)
- Europe
- North America > Canada
- Oceania > Australia
- Australian Capital Territory > Canberra (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Transportation > Ground > Road (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence