AITopics | vatt

Tell What You Hear From What You See - Video to Audio Generation Through Text

Neural Information Processing SystemsMar-22-2026, 05:12:50 GMT

The content of visual and audio scenes is multi-faceted such that a video stream canbe paired with various audio streams and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches for controlling thegenerated audio. While Video-to-Audio generation is a well-established generativetask, existing methods lack such controllability. In this work, we propose VATT, amulti-modal generative framework that takes a video and an optional text promptas input, and generates audio and optional textual description (caption) of theaudio. Such a framework has two unique advantages: i) Video-to-Audio generationprocess can be refined and controlled via text which complements the contextof the visual information, and ii) The model can suggest what audio to generatefor the video by generating audio captions.

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.36)

Add feedback

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Neural Information Processing SystemsDec-24-2025, 22:08:49 GMT

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.

multimodal self-supervised learning, transformer, vatt, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.82)

Add feedback

cb3213ada48302953cb0f166464ab356-Supplemental.pdf

Neural Information Processing SystemsAug-17-2025, 09:23:45 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.70)

Add feedback

cb3213ada48302953cb0f166464ab356-Paper.pdf

Neural Information Processing SystemsAug-17-2025, 09:23:42 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Tell What You Hear From What You See - Video to Audio Generation Through Text

Neural Information Processing SystemsMay-27-2025, 13:54:20 GMT

The content of visual and audio scenes is multi-faceted such that a video stream canbe paired with various audio streams and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches for controlling thegenerated audio. While Video-to-Audio generation is a well-established generativetask, existing methods lack such controllability. In this work, we propose VATT, amulti-modal generative framework that takes a video and an optional text promptas input, and generates audio and optional textual description (caption) of theaudio. Such a framework has two unique advantages: i) Video-to-Audio generationprocess can be refined and controlled via text which complements the contextof the visual information, and ii) The model can suggest what audio to generatefor the video by generating audio captions.

audio caption, audio generation, video, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Neural Information Processing SystemsJan-19-2025, 05:01:42 GMT

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training.

multimodal self-supervised learning, transformer, vatt, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Add feedback

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Akbari, Hassan, Yuan, Linagzhe, Qian, Rui, Chuang, Wei-Hong, Chang, Shih-Fu, Cui, Yin, Gong, Boqing

arXiv.org Artificial IntelligenceApr-22-2021

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.

deep learning, neural network, transformer, (19 more...)

arXiv.org Artificial Intelligence

2104.11178

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.88)

Add feedback

Filters

Collaborating Authors

vatt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Tell What You Hear From What You See - Video to Audio Generation Through Text

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

cb3213ada48302953cb0f166464ab356-Supplemental.pdf

cb3213ada48302953cb0f166464ab356-Paper.pdf

Tell What You Hear From What You See - Video to Audio Generation Through Text

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text