Learning Representations from Audio-Visual Spatial Alignment

Neural Information Processing Systems 

We introduce a novel self-supervised pretext task for learning representations from audio-visual content.