Decoder-only Architecture for Streaming End-to-end Speech Recognition
Tsunoo, Emiru, Futami, Hayato, Kashiwagi, Yosuke, Arora, Siddhant, Watanabe, Shinji
–arXiv.org Artificial Intelligence
This study aims to use powerful yet efficient decoder-only Decoder-only language models (LMs) have been successfully architecture for blockwise streaming ASR. Speech utterances adopted for speech-processing tasks including automatic speech are processed in a blockwise conformer-based speech subnetwork, recognition (ASR). The LMs have ample expressiveness and and each block produces prompts that represent acoustic perform efficiently. This efficiency is a suitable characteristic information. The prompts are considerably compressed by for streaming applications of ASR. In this work, we propose removing unnecessary frames with the auxiliary CTC greedy to use a decoder-only architecture for blockwise streaming search. The speech subnetwork introduces context embedding ASR. In our approach, speech features are compressed using that is inherited from previous blocks and represents past context CTC output and context embedding using blockwise speech information. This context embedding is also provided to the subnetwork, and are sequentially provided as prompts to the decoder.
arXiv.org Artificial Intelligence
Jun-23-2024
- Country:
- Asia > Japan (0.04)
- North America > United States (0.04)
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence