VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Jun-21-2026, 23:23:48 GMT–Neural Information Processing Systems

Recent advancements in text-to-video (T2V) diffusion models have enabled highfidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enables more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models--a critical departure from prior representation alignment (REPA) methods.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Jun-21-2026, 23:23:48 GMT

Conferences PDF

Add feedback

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Leisure & Entertainment (0.46)
- Education > Curriculum
  - Subject-Specific Education (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found