Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus
Chou, Yi-Hui, Chang, Kalvin, Wu, Meng-Ju, Ou, Winston, Bi, Alice Wen-Hsin, Yang, Carol, Chen, Bryan Y., Pai, Rong-Wei, Yeh, Po-Yen, Chiang, Jo-Peng, Phoann, Iu-Tshian, Chang, Winnie, Cui, Chenxuan, Chen, Noel, Shi, Jiatong
–arXiv.org Artificial Intelligence
Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan. This is partly why it is a low resource language in NLP and speech research today. To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's suite of self-supervised learning (SSL) speech representations on our dataset, we find that model size does not consistently determine performance. In fact, certain smaller models outperform larger ones. Furthermore, linguistic alignment between pretraining data and the target language plays a crucial role.
arXiv.org Artificial Intelligence
Dec-5-2023
- Country:
- North America > United States
- Illinois (0.04)
- Maryland (0.04)
- Washington > King County
- Seattle (0.04)
- Europe
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- United Kingdom > England
- Asia
- China (0.04)
- Taiwan
- Takao Province > Kaohsiung (0.04)
- Taiwan Province > Taipei (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (1.00)
- Machine Learning (1.00)
- Speech > Speech Recognition (0.94)
- Information Technology > Artificial Intelligence