SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

Dellali, Amir, Lanzendörfer, Luca A., Grötschla, Florian, Wattenhofer, Roger

arXiv.org Artificial Intelligence 

We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design. Video-to-audio (V2A) generation, sometimes referred to as "computational Foley", aims to produce realistic sounds for the visual events occurring in a silent video clip. Unlike background music or speech synthesis, Foley focuses on diegetic sounds, which are sounds implied by the current on-screen content (e.g., the sound of rain and thunder when a storm is shown, or a dog's bark echoing in a room). Achieving realism requires semantic (the model must recognize what is happening so it can select the right acoustic event) as well as temporal alignment (it must identify when that event occurs). Especially temporal alignment is crucial, as humans are sensitive to as few as tens of milliseconds of asynchrony (Keetels & Vroomen, 2005). Early generative machine learning models for video-to-audio were trained from scratch on modestly-sized audio-visual corpora and struggled to cover the acoustic diversity of in-the-wild video. Recent work has addressed this issue by borrowing scale from adjacent modalities.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found