Goto

Collaborating Authors

 xdc




Self-SupervisedLearningbyCross-Modal Audio-VideoClustering

Neural Information Processing Systems

The first challenge is the exorbitant cost of scaling up the size of manually-labeled video datasets. The recent creation of large-scale action recognition datasets [5,15,25,26]hasundoubtedly enabled amajor leap forwardinvideo models accuracies.