AITopics | revisit weakly-supervised audio-visual video parsing

Collaborating Authors

revisit weakly-supervised audio-visual video parsing

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Neural Information Processing SystemsDec-26-2025, 05:48:24 GMT

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events on the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.

language perspective, name change, revisit weakly-supervised audio-visual video parsing, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.43)

Add feedback

Supplementary Material for Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Neural Information Processing SystemsOct-8-2025, 23:42:01 GMT

Sec. B, we provide more examples of the similarity distribution with/without the event and visualize To investigate the flexibility of our approach, we combine LSLD with different SOT A methods for the A VVP task. The experiments show that our denoised labels are indeed influential and can be properly employed on different SOT A methods. Effectiveness of modifying class names in prompts. Table 2, we can see that the segment-level visual metric improves by 1.7 points when we add playing As we transform objects like Accordion into human behavior (i.e. Table 2: Study the impact of varying class names to make the prompt more contextual.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: Asia > China > Hubei Province > Wuhan (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.48)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.42)

Add feedback

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Neural Information Processing SystemsJan-19-2025, 10:51:31 GMT

language perspective, language prompt, revisit weakly-supervised audio-visual video parsing, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.65)

Add feedback