Bench to Time lapseVideoGeneration

Neural Information Processing Systems 

The emergence of large-scale text-to-image models [92, 60, 59, 58, 42, 5, 94, 14, 54, 40] has significantly advanced the field of Text-to-Video (T2V) generation [66,6,7,21,73,90]. Existing T2V architectures can be categorized into two types: U-Net-based and DiT-based. The latter focuses on recreating open-source structures similar to Sora [9], using the DiT (Diffusion-Transformer) [57]frameworkforT2Vgeneration [43,95,93,20]. When calculating theMTScore, thevideo retrievalmodel uses these texts toevaluate each frame ofthe video, assigning probabilities based on the matches. The final result is obtained by summing the general probability and the metamorphic probability.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found