Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Korbar, Bruno, Tran, Du, Torresani, Lorenzo

Dec-31-2018–Neural Information Processing Systems

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

artificial intelligence, computer vision, machine learning, (17 more...)

Neural Information Processing Systems

Dec-31-2018

Conferences PDF

Add feedback

Country:
- Europe (1.00)
- North America
  - Canada (0.68)
  - United States > Minnesota (0.28)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Inductive Learning (0.89)

Duplicate Docs Excel Report

Title
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Similar Docs Excel Report more

Title	Similarity	Source
None found