Index-MSR: A high-efficiency multimodal fusion framework for speech recognition

Chen, Jinming, Wang, Lu, Song, Zheshu, Deng, Wei

arXiv.org Artificial Intelligence 

ABSTRACT Driven by large-scale datasets and LLM-based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public A VSR dataset demonstrate that Index-MSR achieves state-of-the-art accuracy, with substitution errors reduced by 20-50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio-text synchronization, such as audio translation.