Multi-modalGroupingNetworkfor Weakly-SupervisedAudio-VisualVideoParsing (SupplementaryMaterial)

Feb-12-2026, 10:03:30 GMT–Neural Information Processing Systems

However, the number of learned group tokens in GroupViT is a hyper-parameter and there is no constraint on it. The textembeddings is used inacontrastiveloss tomatch with the global visual representations. Figure 1: Comparison results of recall for all 25 classes between HAN [2] and the proposed MGN in terms of event-level audio, visual and audio-visual metrics,i.e.,Event_A,Event_V,and Event_AV.

artificial intelligence, supplementarymaterial, visual audio-visual type, (9 more...)

Neural Information Processing Systems

Feb-12-2026, 10:03:30 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.48)

Duplicate Docs Excel Report

Title
e095c0a3717629aa5497601985bfcf0e-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found