Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Jun-16-2026, 04:18:43 GMT–Neural Information Processing Systems

We present MGAudio, a novel flow-based framework for open-domain video-toaudio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifierfree guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudioachieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics.

large language model, machine learning, mgaudio, (20 more...)

Neural Information Processing Systems

Jun-16-2026, 04:18:43 GMT

Conferences PDF

Add feedback

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.67)

Industry:
- Leisure & Entertainment > Sports (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Natural Language > Large Language Model (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found