Goto

Collaborating Authors

 xdc



Self-SupervisedLearningbyCross-Modal Audio-VideoClustering

Neural Information Processing Systems

The first challenge is the exorbitant cost of scaling up the size of manually-labeled video datasets. The recent creation of large-scale action recognition datasets [5,15,25,26]hasundoubtedly enabled amajor leap forwardinvideo models accuracies.







We thank reviewers for their thoughtful feedback

Neural Information Processing Systems

We thank reviewers for their thoughtful feedback. We outperform all the reviewers' additional references that "One model to learn them all" is We will add all mentioned references but we stress that they do not hurt our novelty claim nor our results. MIL-NCE by 2% in both UCF and HMDB, while enabling a task (ESC-50) that would not be possible with MIL-NCE. We agree that our novelty does not lie in the loss which is indeed not novel. We will clarify the impact of the deflation contribution in the paper.