The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

Gao, Ming, Wu, Shilong, Chen, Hang, Du, Jun, Lee, Chin-Hui, Watanabe, Shinji, Chen, Jingdong, Marco, Siniscalchi Sabato, Scharenborg, Odette

May-28-2025–arXiv.org Artificial Intelligence

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Di-arization (A VSD), Audio-Visual Speech Recognition (A VSR), and Audio-Visual Diarization and Recognition (A VDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top A VSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top A VSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best A VDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.

artificial intelligence, machine learning, recognition, (13 more...)

arXiv.org Artificial Intelligence

May-28-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)

Genre:
- Overview (0.46)
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.47)
  - Speech > Speech Recognition (0.52)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found