Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Jan-19-2025, 10:51:31 GMT–Neural Information Processing Systems

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events on the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video.

language perspective, language prompt, revisit weakly-supervised audio-visual video parsing, (2 more...)

Neural Information Processing Systems

Jan-19-2025, 10:51:31 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.65)