Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments
Papi, Sara, Wang, Peidong, Chen, Junkun, Xue, Jian, Li, Jinyu, Gaur, Yashesh
–arXiv.org Artificial Intelligence
ABSTRACT In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic Figure 1. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target while incrementally receiving additional speech content. Experiments particular, only Weller et al., 2021 [10] proposed a unifieddecoder in monolingual (it-en) and multilingual ({de,es,it}- solution for real-time applications that, however, en) settings demonstrate that our approach achieves the best leverages a fully attention-based encoder-decoder (AED) architecture quality-latency balance. With an average ASR latency of 1s [11], which is theoretically not well suited for and ST latency of 1.3s, our model shows no degradation or the streaming scenario [12], and adopts the re-translation even improves output quality compared to separate ASR and approach [13], which is well-known to be affected by the ST models, yielding an average improvement of 1.1 WER and flickering problem [14].
arXiv.org Artificial Intelligence
Oct-2-2023