Aligning Audio-Visual Joint Representations with an Agentic Workflow

Neural Information Processing Systems 

Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, nonsynchronization may appear between audio and video streams.