Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

Yao, David Yifan, Zhai, Albert J., Wang, Shenlong

Mar-27-2025–arXiv.org Artificial Intelligence

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

artificial intelligence, image understanding, machine learning, (15 more...)

arXiv.org Artificial Intelligence

Mar-27-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Illinois (0.04)
- Europe
  - Italy > Tuscany
    - Florence (0.04)
  - Greece > Ionian Islands
    - Corfu (0.04)
- Asia > Japan
  - Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Image Understanding (0.56)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found