Tell What You Hear From What You See - Video to Audio Generation Through Text
–Neural Information Processing Systems
The content of visual and audio scenes is multi-faceted such that a video stream can be paired with various audio streams and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description (caption) of the audio. Such a framework has two unique advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of the visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions.
Neural Information Processing Systems
Jun-1-2025, 05:42:13 GMT
- Country:
- North America > United States > Pennsylvania (0.14)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Leisure & Entertainment (1.00)
- Media > Music (0.93)
- Technology: