Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing Shentong Mo Carnegie Mellon University Y apeng Tian University of Texas at Dallas

Neural Information Processing Systems 

The audio-visual video parsing task aims to parse a video into modality-and category-aware temporal segments. Previous work mainly focuses on weakly-supervised approaches, which learn from video-level event labels.