Reviews: Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing Systems 

The authors propose self-supervised learning of audio and video features, by means of a curriculum learning setup. In particular, a deep neural network is trained with a contrastive loss function; generating as large feature distances as possible when the video and audio fragment are out of sync, while generating as small distances as possible when they are in sync. The proposed self-supervised learning scheme gives good results for downstream processing, e.g., improving over pure supervised training from scratch. In general, I think this work is highly interesting, since it may significantly reduce the need for manually labelled data sets, and hence may be of significant impact on video & audio processing algorithms. I also think the work is original, although this is not very easy to determine exactly.