MAViL: Masked Audio-Video Learners Po-Yao Huang 1 Chaitanya Ryali

Mar-21-2025, 21:14:41 GMT–Neural Information Processing Systems

We present Masked Audio-Video Learners (MAViL) to learn audio-visual representations with three complementary forms of self-supervision: (1) reconstructing masked raw audio and video inputs, (2) intra-modal and inter-modal contrastive learning with masking, and (3) self-training to predict aligned and contextualized audio-video representations learned from the first two objectives. Empirically, MAViL achieves state-of-the-art audio-video classification performance on AudioSet (53.3 mAP) and VGGSound (67.1% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data. Notably, pre-training with MAViL not only enhances performance in multimodal classification and retrieval tasks, but it also improves the representations of each modality in isolation, without relying on information from the other modality during uni-modal fine-tuning or inference.

artificial intelligence, machine learning, representation, (19 more...)

Neural Information Processing Systems

Mar-21-2025, 21:14:41 GMT

Conferences PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - California (0.28)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Leisure & Entertainment (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Duplicate Docs Excel Report

Title
MAViL: Masked Audio-Video Learners Po-Yao Huang 1 Chaitanya Ryali

Similar Docs Excel Report more

Title	Similarity	Source
None found