A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Peyser, Cal, Picheny, Michael, Cho, Kyunghyun, Prabhavalkar, Rohit, Huang, Ronny, Sainath, Tara

arXiv.org Artificial Intelligence 

Unlike previous work, we apply these methods to a state-of-the-art, 160M-parameter streaming Conformer [7] Unpaired text and audio injection have emerged as dominant methods model that is already trained on a very large supervised corpus. We for improving ASR performance in the absence of a large labeled further depart from previous work by training supervised and unsupervised corpus. However, little guidance exists on deploying these methods tasks jointly, which is being increasingly shown to be to improve production ASR systems that are trained on very large supervised preferable to the conventional fine-tuning approach on very large corpora and with realistic requirements like a constrained datasets [8]. We find that under these conditions, none of the studied model size and CPU budget, streaming capability, and a rich lattice methods improve general WER at all. However, we report improvements for rescoring and for downstream NLU tasks. In this work, we compare in the decoder's computational load and in lattice density, three state-of-the-art semi-supervised methods encompassing as well as in several targeted WER measurements assessing performance both unpaired text and audio as well as several of their combinations on known categories of particularly difficult utterances. in a controlled setting using joint training. We find that in our setting Through this comparison and analysis, we hope to offer a more nuanced these methods offer many improvements beyond raw WER, including and comprehensive view of the usefulness of unpaired audio substantial gains in tail-word WER, decoder computation during and text in industrial ASR.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found