Tell What You Hear From What You See - Video to Audio Generation Through Text

May-27-2025, 13:54:20 GMT–Neural Information Processing Systems

The content of visual and audio scenes is multi-faceted such that a video stream canbe paired with various audio streams and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches for controlling thegenerated audio. While Video-to-Audio generation is a well-established generativetask, existing methods lack such controllability. In this work, we propose VATT, amulti-modal generative framework that takes a video and an optional text promptas input, and generates audio and optional textual description (caption) of theaudio. Such a framework has two unique advantages: i) Video-to-Audio generationprocess can be refined and controlled via text which complements the contextof the visual information, and ii) The model can suggest what audio to generatefor the video by generating audio captions.

audio caption, audio generation, video, (4 more...)

Neural Information Processing Systems

May-27-2025, 13:54:20 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.39)