AITopics | audio and video model

Collaborating Authors

audio and video model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing SystemsNov-20-2025, 22:57:49 GMT

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

audio and video model, cooperative learning, self-supervised synchronization, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.84)
Information Technology > Artificial Intelligence > Machine Learning (0.82)

Add feedback

Reviews: Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing SystemsOct-8-2024, 03:21:25 GMT

The authors propose self-supervised learning of audio and video features, by means of a curriculum learning setup. In particular, a deep neural network is trained with a contrastive loss function; generating as large feature distances as possible when the video and audio fragment are out of sync, while generating as small distances as possible when they are in sync. The proposed self-supervised learning scheme gives good results for downstream processing, e.g., improving over pure supervised training from scratch. In general, I think this work is highly interesting, since it may significantly reduce the need for manually labelled data sets, and hence may be of significant impact on video & audio processing algorithms. I also think the work is original, although this is not very easy to determine exactly.

audio and video model, self-supervised learning, self-supervised synchronization, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Korbar, Bruno, Tran, Du, Torresani, Lorenzo

Neural Information Processing SystemsFeb-14-2020, 19:57:13 GMT

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of 19.9% in action recognition accuracy on UCF101 and a boost of 17.7% on HMDB51.

artificial intelligence, machine learning, self-supervised synchronization, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.90)

Add feedback