Learning Representations from Audio-Visual Spatial Alignment

Oct-9-2024, 21:54:21 GMT–Neural Information Processing Systems

We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, they completely disregard the spatial cues of audio and visual signals naturally occurring in the real world. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360\degree video and spatial audio.

audio-visual spatial alignment, correspondence, learning representation, (5 more...)

Neural Information Processing Systems

Oct-9-2024, 21:54:21 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.71)