When and Where do Events Switch in Multi-Event Video Generation?
Liao, Ruotong, Huang, Guowen, Cheng, Qing, Seidl, Thomas, Cremers, Daniel, Tresp, Volker
–arXiv.org Artificial Intelligence
Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.
arXiv.org Artificial Intelligence
Oct-6-2025
- Country:
- Africa > Zambia
- Southern Province > Choma (0.04)
- Asia
- Japan > Honshū
- Chūbu > Toyama Prefecture > Toyama (0.04)
- Middle East > Jordan (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Japan > Honshū
- Europe
- Germany > Bavaria
- Middle Franconia > Nuremberg (0.04)
- Upper Bavaria > Munich (0.04)
- Middle East > Cyprus
- Germany > Bavaria
- North America > United States
- New Mexico > Bernalillo County > Albuquerque (0.04)
- South America > Chile
- Africa > Zambia
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.93)
- Natural Language (0.68)
- Vision (1.00)
- Information Technology > Artificial Intelligence