Decoder-only Architecture for Streaming End-to-end Speech Recognition

Tsunoo, Emiru, Futami, Hayato, Kashiwagi, Yosuke, Arora, Siddhant, Watanabe, Shinji

Jun-23-2024–arXiv.org Artificial Intelligence

This study aims to use powerful yet efficient decoder-only Decoder-only language models (LMs) have been successfully architecture for blockwise streaming ASR. Speech utterances adopted for speech-processing tasks including automatic speech are processed in a blockwise conformer-based speech subnetwork, recognition (ASR). The LMs have ample expressiveness and and each block produces prompts that represent acoustic perform efficiently. This efficiency is a suitable characteristic information. The prompts are considerably compressed by for streaming applications of ASR. In this work, we propose removing unnecessary frames with the auxiliary CTC greedy to use a decoder-only architecture for blockwise streaming search. The speech subnetwork introduces context embedding ASR. In our approach, speech features are compressed using that is inherited from previous blocks and represents past context CTC output and context embedding using blockwise speech information. This context embedding is also provided to the subnetwork, and are sequentially provided as prompts to the decoder.

architecture, decoder-only architecture, speech subnetwork, (14 more...)

arXiv.org Artificial Intelligence

Jun-23-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Asia > Japan (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found