Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Pan, Zexu, Zhao, Shengkui, Wang, Tingting, Zhou, Kun, Ma, Yukun, Zhang, Chong, Ma, Bin

May-28-2025–arXiv.org Artificial Intelligence

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the A V -DPRNN and the state-of-the-art A V -TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped V oxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and gen-eralizability of our method.

artificial intelligence, machine learning, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

May-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report (0.64)

Industry:
- Leisure & Entertainment (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Speech > Speech Recognition (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found