mHuBERT-147: A Compact Multilingual HuBERT Model
Boito, Marcely Zanon, Iyer, Vivek, Lagos, Nikolaos, Besacier, Laurent, Calapodescu, Ioan
–arXiv.org Artificial Intelligence
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
arXiv.org Artificial Intelligence
Jun-27-2024
- Country:
- South America > Chile
- North America
- United States
- Washington > King County
- Seattle (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Washington > King County
- Canada > Ontario
- Toronto (0.04)
- United States
- Europe
- Asia
- Taiwan (0.04)
- Singapore (0.04)
- India (0.04)
- East Asia (0.04)
- China > Hong Kong (0.04)
- Central Asia (0.04)
- Myanmar > Chin State
- Hakha (0.04)
- Africa
- Sub-Saharan Africa (0.04)
- Niger (0.04)
- Genre:
- Research Report > New Finding (0.48)
- Technology:
- Information Technology
- Data Science (1.00)
- Information Management (0.93)
- Artificial Intelligence
- Natural Language (1.00)
- Machine Learning (1.00)
- Speech > Speech Recognition (0.46)
- Information Technology