Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

Huang, Haoyang, Ma, Guoqing, Duan, Nan, Chen, Xing, Wan, Changyi, Ming, Ranchen, Wang, Tianyu, Wang, Bo, Lu, Zhiying, Li, Aojie, Zeng, Xianfang, Zhang, Xinhao, Yu, Gang, Yin, Yuhe, Wu, Qiling, Sun, Wen, An, Kang, Han, Xin, Sun, Deshan, Ji, Wei, Huang, Bizhu, Li, Brian, Wu, Chenfei, Huang, Guanzhe, Xiong, Huixin, He, Jiaxin, Wu, Jianchang, Yuan, Jianlong, Wu, Jie, Liu, Jiashuai, Guo, Junjing, Tan, Kaijun, Chen, Liangyu, Chen, Qiaohui, Sun, Ran, Yuan, Shanshan, Yin, Shengming, Liu, Sitong, Chen, Wei, Dai, Yaqi, Luo, Yuchu, Ge, Zheng, Guan, Zhisheng, Song, Xiaoniu, Zhou, Yu, Jiao, Binxing, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Xiu, Yi, Zhu, Yibo, Shum, Heung-Yeung, Jiang, Daxin

arXiv.org Artificial Intelligence 

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task.