Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
Zhang, Kang, Pham, Trung X., Lee, Suyeon, Niu, Axi, Senocak, Arda, Chung, Joon Son
–arXiv.org Artificial Intelligence
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio
arXiv.org Artificial Intelligence
Oct-29-2025
- Country:
- Asia
- China (0.04)
- South Korea (0.04)
- North America > United States (0.04)
- South America > Chile
- Asia
- Genre:
- Research Report > Promising Solution (0.46)
- Industry:
- Leisure & Entertainment > Sports (0.68)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.93)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence