Tell What You Hear From What You See - Video to Audio Generation Through Text
–Neural Information Processing Systems
The content of visual and audio scenes is multi-faceted such that a video stream canbe paired with various audio streams and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches for controlling thegenerated audio. While Video-to-Audio generation is a well-established generativetask, existing methods lack such controllability. In this work, we propose VATT, amulti-modal generative framework that takes a video and an optional text promptas input, and generates audio and optional textual description (caption) of theaudio. Such a framework has two unique advantages: i) Video-to-Audio generationprocess can be refined and controlled via text which complements the contextof the visual information, and ii) The model can suggest what audio to generatefor the video by generating audio captions.
Neural Information Processing Systems
May-27-2025, 13:54:20 GMT
- Technology: