JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Liu, Kai, Li, Wei, Chen, Lai, Wu, Shengqiong, Zheng, Yanhao, Ji, Jiayi, Zhou, Fan, Jiang, Rongxin, Luo, Jiebo, Fei, Hao, Chua, Tat-Seng

Mar-30-2025–arXiv.org Artificial Intelligence

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Mar-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Singapore (0.04)
  - China (0.04)
  - Japan > Honshū
    - Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Media > Music (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found