Towards Unsupervised Speech Recognition at the Syllable-Level
Wang, Liming, Ni, Junrui, Chang, Kai-Wei, Bhati, Saurabhchand, Harwath, David, Hasegawa-Johnson, Mark, Glass, James R.
–arXiv.org Artificial Intelligence
Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.
arXiv.org Artificial Intelligence
Oct-7-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- Singapore (0.04)
- South Korea
- Europe
- North America
- Canada > British Columbia
- Vancouver (0.04)
- United States
- Illinois > Champaign County
- Urbana (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Rhode Island (0.04)
- Illinois > Champaign County
- Canada > British Columbia
- Asia
- Genre:
- Research Report (0.64)
- Industry:
- Education (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence