Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Shakeel, Muhammad, Sudo, Yui, Peng, Yifan, Watanabe, Shinji

May-22-2024–arXiv.org Artificial Intelligence

ABSTRACT End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multidecoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation Figure 1: Joint optimization of multi-decoder ASR model: A single results show 2.6%-5.3%

architecture, artificial intelligence, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

May-22-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.14)
- North America > United States (0.14)

Genre:
- Research Report > New Finding (0.66)

Technology:
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found