Goto

Collaborating Authors

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing Systems

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.


Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Neural Information Processing Systems

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.


Beyond Audio and Video: Using Claytronics to Enable Pario

AI Magazine

In this article, we describe the hardware and software challenges involved in realizing Claytronics, a form of programmable matter made out of very large numbers-potentially millions-of submillimeter sized spherical robots. The goal of the claytronics project is to create ensembles of cooperating submillimeter  robots, which work together to form dynamic 3D physical objects. For example, claytronics might be used in telepresense to mimic, with high-fidelity and in 3-dimensional solid form, the look, feel, and motion of the person at the other end of the telephone call. To achieve this long-range vision we are investigating hardware mechanisms for constructing submillimeter robots, which can be manufactured en masse using photolithography. We also propose the creation of a new media type, which we call pario. The idea behind pario is to render arbitrary moving, physical 3-dimensional objects that you can see, touch, and even hold in your hands. In parallel with our hardware effort, we are developing novel distributed programming languages and algorithms to control the ensembles, LDP and Meld. Pario may fundamentally change how we communicate with others and interact with the world around us. Our research results to date suggest that there is a viable path to implementing both the hardware and software necessary for claytronics, which is a form of programmable matter that can be used to implement pario. While we have made significant progress, there is still much research ahead in order to turn this vision into reality.


Π-Avida - A Personalized Interactive Audio and Video Portal

AAAI Conferences

We describe a system for enregistering, storing and distributing multimedia data streams. For each modality - audio, speech, video - characteristic features are extracted and used to classify the content into a range of topic categories. Using data mining techniques classifier models are determined from training data. These models are able to assign existing and new multimedia documents to one or several topic categories. We describe the features used as inputs for these classifiers. We demonstrate that the classification of audio material may be improved by using phonemes and syllables instead of words. Finally we show that the categorization performance mainly depends on the quality of speech recognition and that the simple video features we tested are of only marginal utility.


Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

arXiv.org Machine Learning

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video . This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings. ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms. Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views.