Bench to Time lapseVideoGeneration
–Neural Information Processing Systems
The emergence of large-scale text-to-image models [92, 60, 59, 58, 42, 5, 94, 14, 54, 40] has significantly advanced the field of Text-to-Video (T2V) generation [66,6,7,21,73,90]. Existing T2V architectures can be categorized into two types: U-Net-based and DiT-based. The latter focuses on recreating open-source structures similar to Sora [9], using the DiT (Diffusion-Transformer) [57]frameworkforT2Vgeneration [43,95,93,20]. When calculating theMTScore, thevideo retrievalmodel uses these texts toevaluate each frame ofthe video, assigning probabilities based on the matches. The final result is obtained by summing the general probability and the metamorphic probability.
Neural Information Processing Systems
Feb-9-2026, 15:56:51 GMT