Tell What You Hear From What You See - Video to Audio Generation Through Text

Jun-1-2025, 05:42:13 GMT–Neural Information Processing Systems

The content of visual and audio scenes is multi-faceted such that a video stream can be paired with various audio streams and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description (caption) of the audio. Such a framework has two unique advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of the visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Jun-1-2025, 05:42:13 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Pennsylvania (0.14)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Leisure & Entertainment (1.00)
- Media > Music (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning > Personal Assistant Systems (0.67)
  - Vision (1.00)