E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Huang, W. Ronny, Chang, Shuo-Yiin, Sainath, Tara N., He, Yanzhang, Rybach, David, David, Robert, Prabhavalkar, Rohit, Allauzen, Cyril, Peyser, Cal, Strohman, Trevor D.

Mar-5-2023–arXiv.org Artificial Intelligence

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

machine learning, natural language, segmenter, (16 more...)

arXiv.org Artificial Intelligence

Mar-5-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.68)
  - Natural Language (0.69)
  - Speech (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found