AfriHuBERT: A self-supervised speech representation model for African languages

Alabi, Jesujoba O., Liu, Xuechen, Klakow, Dietrich, Yamagishi, Junichi

Sep-30-2024–arXiv.org Artificial Intelligence

In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-30-2024

arXiv.org PDF

Add feedback

Country:
- Africa (1.00)
- Europe (0.93)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Performance Analysis
    - Accuracy (0.34)
  - Natural Language (1.00)
  - Speech > Speech Recognition (1.00)