STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking
Li, Yidi, Liu, Hong, Yang, Bing
–arXiv.org Artificial Intelligence
--Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. T o this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the A V16.3 and CA V3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers. PEAKER tracking is a fundamental task in human-computer interaction that determines the position of the speaker in each time step by analyzing data from sensors such as microphones and cameras [1]. It has wide applications in intelligent surveillance [2], multimedia systems [3], and robot navigation [4]. In general, the basic approaches for solving the tracking problem include computer vision-based face or body tracking methods [5-7] and auditory-based Sound Source Localization (SSL) methods [8, 9]. However, it is difficult for uni-modal methods to adapt to complex dynamic environments. For example, visual trackers are susceptible to object occlusion and changes in illumination and appearance. Besides, acoustic tracking is not subject to visual interference, but the intermittent nature of speech signals, background noise, and room reverberation constrain the performance of SSL-based trackers. This work is supported by National Natural Science Foundation of China (No. 62403345).
arXiv.org Artificial Intelligence
Oct-8-2024
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Statistical Learning (0.93)
- Representation & Reasoning > Information Fusion (1.00)
- Speech (1.00)
- Vision (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence