Index-MSR: A high-efficiency multimodal fusion framework for speech recognition

Chen, Jinming, Wang, Lu, Song, Zheshu, Deng, Wei

Sep-30-2025–arXiv.org Artificial Intelligence

ABSTRACT Driven by large-scale datasets and LLM-based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public A VSR dataset demonstrate that Index-MSR achieves state-of-the-art accuracy, with substitution errors reduced by 20-50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio-text synchronization, such as audio translation.

artificial intelligence, information, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.84)

Technology:
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)