Conventional ASR systems are generally made up of three components: an acoustic model that predicts phonemes from short segments of audio, a pronunciation lexicon which describes how the phonemes are combined to form the words in a given language, and a language model that captures the relationships among those words. Facebook engineers have deployed their model variations with a number of infrastructure optimizations to handle the additional livestream traffic while also reducing the compute required despite the increased load. Although the system was trained on many different types of speech, it's still far from perfect, particularly when it comes to accents. As it's difficult to collect sufficient training data for every accent type, Facebook researchers are now exploring ways to improve their models by having them also learn from the vast amounts of unlabelled audio that is available online.
Sep-17-2020, 05:20:18 GMT