pronunciation model
CMU's ASR2K Pipeline Recognizes Speech in 1909 Languages Without Audio
AI-powered speech recognition systems have made great progress in recent years, with speech-to-text processing now so powerful that the occasional errors are little more than curious exceptions. Most contemporary models addressing this task however require massive labelled training data -- which is simple enough to source for English, Chinese, and other popular languages but challenging to obtain in the case of the low-resource tongues that make up the majority of the world's 8,000 languages. To address this issue, a Carnegie Mellon University research team has developed a speech recognition pipeline that can recognize 1909 languages without any audio for the target language. Their ASR2K pipeline achieves impressive 45 percent CER and 69 percent WER scores when using 10,000 raw text utterances on the CMU Wilderness dataset, and is introduced in the paper ASR2K: Speech Recognition for Around 2000 Languages Without Audio. The proposed pipeline comprises separate acoustic, pronunciation, and language models.
ASR2K: Speech Recognition for Around 2000 Languages without Audio
Li, Xinjian, Metze, Florian, Mortensen, David R, Black, Alan W, Watanabe, Shinji
Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.
The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices
Speech recognition is invading our lives. It's built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream? The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully-controlled environments. In this blog post, we'll learn how to perform speech recognition with 3 different implementations of popular deep learning frameworks.