Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Cheng, Shihan, Kulkarni, Nilesh, Hyde, David, Smirnov, Dmitriy

Dec-12-2025–arXiv.org Artificial Intelligence

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. W e show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on pho-torealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-12-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.93)

Industry:
- Media > Photography (0.90)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.93)
  - Artificial Intelligence
    - Vision (1.00)
    - Machine Learning > Neural Networks (1.00)
    - Natural Language > Large Language Model (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found