Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Zhang, Chao, Li, Bo, Sainath, Tara, Strohman, Trevor, Mavandadi, Sepand, Chang, Shuo-yiin, Haghani, Parisa
–arXiv.org Artificial Intelligence
Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.
arXiv.org Artificial Intelligence
Sep-13-2022
- Country:
- North America
- United States > California
- San Francisco County > San Francisco (0.04)
- Canada
- Ontario > Toronto (0.04)
- Alberta > Census Division No. 6
- Calgary Metropolitan Region > Calgary (0.05)
- United States > California
- Europe
- United Kingdom (0.04)
- Germany (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Austria > Styria
- Graz (0.04)
- Asia
- Africa > South Africa
- Western Cape > Cape Town (0.04)
- North America
- Genre:
- Research Report (0.50)
- Technology: