Multi-modalGroupingNetworkfor Weakly-SupervisedAudio-VisualVideoParsing (SupplementaryMaterial)
–Neural Information Processing Systems
However, the number of learned group tokens in GroupViT is a hyper-parameter and there is no constraint on it. The textembeddings is used inacontrastiveloss tomatch with the global visual representations. Figure 1: Comparison results of recall for all 25 classes between HAN [2] and the proposed MGN in terms of event-level audio, visual and audio-visual metrics,i.e.,Event_A,Event_V,and Event_AV.
Neural Information Processing Systems
Feb-12-2026, 10:03:30 GMT
- Technology: