Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang, Chao, Li, Bo, Sainath, Tara, Strohman, Trevor, Mavandadi, Sepand, Chang, Shuo-yiin, Haghani, Parisa

Sep-13-2022–arXiv.org Artificial Intelligence

Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.

artificial intelligence, encoder, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Sep-13-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > California
    - San Francisco County > San Francisco (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - Alberta > Census Division No. 6
      - Calgary Metropolitan Region > Calgary (0.05)
- Europe
  - United Kingdom (0.04)
  - Germany (0.04)
  - Czechia > South Moravian Region
    - Brno (0.04)
  - Austria > Styria
    - Graz (0.04)
- Asia
  - Singapore (0.04)
  - Taiwan (0.04)
  - Japan > Kyūshū & Okinawa
    - Okinawa (0.04)
  - China
    - Shanghai > Shanghai (0.04)
    - Hong Kong (0.04)
- Africa > South Africa
  - Western Cape > Cape Town (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found