Retrieval-Augmented Speech Recognition Approach for Domain Challenges

Shen, Peng, Lu, Xugang, Kawai, Hisashi

arXiv.org Artificial Intelligence 

National Institute of Information and Communications Technology (NICT), Japan peng.shen@nict.go.jp Abstract --Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data. Automatic speech recognition (ASR) techniques have improved significantly due to advancements in system architecture and optimization algorithms [1]-[4].