Goto

Collaborating Authors

 audio and video model


Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing Systems

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.


Reviews: Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing Systems

The authors propose self-supervised learning of audio and video features, by means of a curriculum learning setup. In particular, a deep neural network is trained with a contrastive loss function; generating as large feature distances as possible when the video and audio fragment are out of sync, while generating as small distances as possible when they are in sync. The proposed self-supervised learning scheme gives good results for downstream processing, e.g., improving over pure supervised training from scratch. In general, I think this work is highly interesting, since it may significantly reduce the need for manually labelled data sets, and hence may be of significant impact on video & audio processing algorithms. I also think the work is original, although this is not very easy to determine exactly.


Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Korbar, Bruno, Tran, Du, Torresani, Lorenzo

Neural Information Processing Systems

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of 19.9% in action recognition accuracy on UCF101 and a boost of 17.7% on HMDB51.