Huang, Shijie
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Peng, Xiangyu, Zheng, Zangwei, Shen, Chenhui, Young, Tom, Guo, Xinying, Wang, Binluo, Xu, Hang, Liu, Hongxin, Jiang, Mingyan, Li, Wenjun, Wang, Yuhui, Ye, Anbang, Ren, Gang, Ma, Qianran, Liang, Wanying, Lian, Xiang, Wu, Xiwen, Zhong, Yuting, Li, Zhuangyan, Gong, Chaoyu, Lei, Guojun, Cheng, Leijun, Zhang, Limin, Li, Minghao, Zhang, Ruijie, Hu, Silan, Huang, Shijie, Wang, Xiaokang, Zhao, Yuanheng, Wang, Yuqi, Wei, Ziang, You, Yang
Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-performing video generation model is highly controllable. We detail all techniques that contribute to this efficiency breakthrough, including data curation, model architecture, training strategy, and system optimization. According to human evaluation results and VBench scores, Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha. By making Open-Sora 2.0 fully open-source, we aim to democratize access to advanced video generation technology, fostering broader innovation and creativity in content creation. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.
ProcessPainter: Learn Painting Process from Sequence Data
Song, Yiren, Huang, Shijie, Yao, Chen, Ye, Xiaojun, Ci, Hai, Liu, Jiaming, Zhang, Yuxuan, Shou, Mike Zheng
The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored. Traditional stroke-based rendering methods break down images into sequences of brushstrokes, yet they fall short of replicating the authentic processes of artists, with limitations confined to basic brushstroke modifications. Text-to-image models utilizing diffusion processes generate images through iterative denoising, also diverge substantially from artists' painting process. To address these challenges, we introduce ProcessPainter, a text-to-video model that is initially pre-trained on synthetic data and subsequently fine-tuned with a select set of artists' painting sequences using the LoRA model. This approach successfully generates painting processes from text prompts for the first time. Furthermore, we introduce an Artwork Replication Network capable of accepting arbitrary-frame input, which facilitates the controlled generation of painting processes, decomposing images into painting sequences, and completing semi-finished artworks. This paper offers new perspectives and tools for advancing art education and image generation technology.