Not enough data to create a plot.
Try a different view from the menu above.
Peyser, Cal
Text Injection for Neural Contextual Biasing
Meng, Zhong, Wu, Zelin, Prabhavalkar, Rohit, Peyser, Cal, Wang, Weiran, Chen, Nanxin, Sainath, Tara N., Ramabhadran, Bhuvana
Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.
Improving Joint Speech-Text Representations Without Alignment
Peyser, Cal, Meng, Zhong, Hu, Ke, Prabhavalkar, Rohit, Rosenberg, Andrew, Sainath, Tara N., Picheny, Michael, Cho, Kyunghyun
The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale
Peyser, Cal, Picheny, Michael, Cho, Kyunghyun, Prabhavalkar, Rohit, Huang, Ronny, Sainath, Tara
Unlike previous work, we apply these methods to a state-of-the-art, 160M-parameter streaming Conformer [7] Unpaired text and audio injection have emerged as dominant methods model that is already trained on a very large supervised corpus. We for improving ASR performance in the absence of a large labeled further depart from previous work by training supervised and unsupervised corpus. However, little guidance exists on deploying these methods tasks jointly, which is being increasingly shown to be to improve production ASR systems that are trained on very large supervised preferable to the conventional fine-tuning approach on very large corpora and with realistic requirements like a constrained datasets [8]. We find that under these conditions, none of the studied model size and CPU budget, streaming capability, and a rich lattice methods improve general WER at all. However, we report improvements for rescoring and for downstream NLU tasks. In this work, we compare in the decoder's computational load and in lattice density, three state-of-the-art semi-supervised methods encompassing as well as in several targeted WER measurements assessing performance both unpaired text and audio as well as several of their combinations on known categories of particularly difficult utterances. in a controlled setting using joint training. We find that in our setting Through this comparison and analysis, we hope to offer a more nuanced these methods offer many improvements beyond raw WER, including and comprehensive view of the usefulness of unpaired audio substantial gains in tail-word WER, decoder computation during and text in industrial ASR.
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model
Huang, W. Ronny, Chang, Shuo-Yiin, Sainath, Tara N., He, Yanzhang, Rybach, David, David, Robert, Prabhavalkar, Rohit, Allauzen, Cyril, Peyser, Cal, Strohman, Trevor D.
We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.
Dual Learning for Large Vocabulary On-Device ASR
Peyser, Cal, Huang, Ronny, Sainath, Tara, Prabhavalkar, Rohit, Picheny, Michael, Cho, Kyunghyun
Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.
Improving Performance of End-to-End ASR on Numeric Sequences
Peyser, Cal, Zhang, Hao, Sainath, Tara N., Wu, Zelin
Recognizing written domain numeric utterances (e.g. I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g. I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low memory setting of on-device speech recognition. E2E models such as RNN-T are attractive for on-device ASR, as they fold the AM, PM and LM of a conventional model into one neural network. However, in the on-device setting the large memory footprint of an FST denormer makes spoken domain training more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that using a text-to-speech system to generate additional numeric training data, as well as using a small-footprint neural network to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest numeric sequences, we see reduction of WER by up to a factor of 8.