Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
–Neural Information Processing Systems
The audio-visual video parsing task aims to parse a video into modality- and category-aware temporal segments. Previous work mainly focuses on weakly-supervised approaches, which learn from video-level event labels. During training, they do not know which modality perceives and meanwhile which temporal segment contains the video event. Since there is no explicit grouping in the existing frameworks, the modality and temporal uncertainties make these methods suffer from false predictions. For instance, segments in the same category could be predicted in different event classes.
multi-modal grouping network, temporal segment, weakly-supervised audio-visual video parsing, (3 more...)
Neural Information Processing Systems
Jan-19-2025, 02:51:12 GMT
- Technology: