Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Papi, Sara, Wang, Peidong, Chen, Junkun, Xue, Jian, Li, Jinyu, Gaur, Yashesh

Oct-2-2023–arXiv.org Artificial Intelligence

ABSTRACT In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic Figure 1. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target while incrementally receiving additional speech content. Experiments particular, only Weller et al., 2021 [10] proposed a unifieddecoder in monolingual (it-en) and multilingual ({de,es,it}- solution for real-time applications that, however, en) settings demonstrate that our approach achieves the best leverages a fully attention-based encoder-decoder (AED) architecture quality-latency balance. With an average ASR latency of 1s [11], which is theoretically not well suited for and ST latency of 1.3s, our model shows no degradation or the streaming scenario [12], and adopts the re-translation even improves output quality compared to separate ASR and approach [13], which is well-known to be affected by the ST models, yielding an average improvement of 1.1 WER and flickering problem [14].

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

Oct-2-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Italy (0.14)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (1.00)
  - Speech > Speech Recognition (0.99)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found