MA ViL: Masked Audio-Video Learners Po-Y ao Huang

Neural Information Processing Systems 

Empirically, MA ViL achieves state-of-the-art audio-video classification performance on AudioSet (53.3 mAP) and VGGSound (67.1% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data.