Lin, Junzhe
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Huang, Ailin, Wu, Boyong, Wang, Bruce, Yan, Chao, Hu, Chen, Feng, Chengli, Tian, Fei, Shen, Feiyu, Li, Jingbei, Chen, Mingrui, Liu, Peng, Miao, Ruihang, You, Wang, Chen, Xi, Yang, Xuerui, Huang, Yechang, Zhang, Yuxiang, Gong, Zheng, Zhang, Zixin, Zhou, Hongyu, Sun, Jianjian, Li, Brian, Feng, Chengting, Wan, Changyi, Hu, Hanpeng, Wu, Jianchang, Zhen, Jiangjie, Ming, Ranchen, Yuan, Song, Zhang, Xuelin, Zhou, Yu, Li, Bingxin, Ma, Buyun, Wang, Hongyuan, An, Kang, Ji, Wei, Li, Wen, Wen, Xuan, Kong, Xiangwen, Ma, Yuankai, Liang, Yuanwei, Mou, Yun, Ahmidi, Bahtiyar, Wang, Bin, Li, Bo, Miao, Changxin, Xu, Chen, Wang, Chenrun, Shi, Dapeng, Sun, Deshan, Hu, Dingyuan, Sai, Dula, Liu, Enle, Huang, Guanzhe, Yan, Gulin, Wang, Heng, Jia, Haonan, Zhang, Haoyang, Gong, Jiahao, Guo, Junjing, Liu, Jiashuai, Liu, Jiahong, Feng, Jie, Wu, Jie, Wu, Jiaoren, Yang, Jie, Wang, Jinguo, Zhang, Jingyang, Lin, Junzhe, Li, Kaixiang, Xia, Lei, Zhou, Li, Zhao, Liang, Gu, Longlong, Chen, Mei, Wu, Menglin, Li, Ming, Li, Mingxiao, Li, Mingliang, Liang, Mingyao, Wang, Na, Hao, Nie, Wu, Qiling, Tan, Qinyuan, Sun, Ran, Shuai, Shuai, Pang, Shaoliang, Yang, Shiliang, Gao, Shuli, Yuan, Shanshan, Liu, Siqi, Deng, Shihong, Jiang, Shilei, Liu, Sitong, Cao, Tiancheng, Wang, Tianyu, Deng, Wenjin, Xie, Wuxun, Ming, Weipeng, He, Wenqing, Sun, Wen, Han, Xin, Huang, Xin, Deng, Xiaomin, Liu, Xiaojia, Wu, Xin, Zhao, Xu, Wei, Yanan, Yu, Yanbo, Cao, Yang, Li, Yangguang, Ma, Yangzhen, Xu, Yanming, Wang, Yaoyu, Shi, Yaqiang, Wang, Yilei, Zhou, Yizhuang, Zhong, Yinmin, Zhang, Yang, Wei, Yaoben, Luo, Yu, Lu, Yuanwei, Yin, Yuhe, Luo, Yuchu, Ding, Yuanhao, Yan, Yuting, Dai, Yaqi, Yang, Yuxiang, Xie, Zhe, Ge, Zheng, Sun, Zheng, Huang, Zhewei, Chang, Zhichao, Guan, Zhisheng, Yang, Zidong, Zhang, Zili, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Ma, Guoqing, Huang, Haoyang, Yan, Kun, Chen, Liangyu, Duan, Nan, Yin, Shengming, Wan, Changyi, Ming, Ranchen, Song, Xiaoniu, Chen, Xing, Zhou, Yu, Sun, Deshan, Zhou, Deyu, Zhou, Jian, Tan, Kaijun, An, Kang, Chen, Mei, Ji, Wei, Wu, Qiling, Sun, Wen, Han, Xin, Wei, Yanan, Ge, Zheng, Li, Aojie, Wang, Bin, Huang, Bizhu, Wang, Bo, Li, Brian, Miao, Changxing, Xu, Chen, Wu, Chenfei, Yu, Chenguang, Shi, Dapeng, Hu, Dingyuan, Liu, Enle, Yu, Gang, Yang, Ge, Huang, Guanzhe, Yan, Gulin, Feng, Haiyang, Nie, Hao, Jia, Haonan, Hu, Hanpeng, Chen, Hanqi, Yan, Haolong, Wang, Heng, Guo, Hongcheng, Xiong, Huilin, Xiong, Huixin, Gong, Jiahao, Wu, Jianchang, Wu, Jiaoren, Wu, Jie, Yang, Jie, Liu, Jiashuai, Li, Jiashuo, Zhang, Jingyang, Guo, Junjing, Lin, Junzhe, Li, Kaixiang, Liu, Lei, Xia, Lei, Zhao, Liang, Tan, Liguo, Huang, Liwen, Shi, Liying, Li, Ming, Li, Mingliang, Cheng, Muhua, Wang, Na, Chen, Qiaohui, He, Qinglin, Liang, Qiuyan, Sun, Quan, Sun, Ran, Wang, Rui, Pang, Shaoliang, Yang, Shiliang, Liu, Sitong, Liu, Siqi, Gao, Shuli, Cao, Tiancheng, Wang, Tianyu, Ming, Weipeng, He, Wenqing, Zhao, Xu, Zhang, Xuelin, Zeng, Xianfang, Liu, Xiaojia, Yang, Xuan, Dai, Yaqi, Yu, Yanbo, Li, Yang, Deng, Yineng, Wang, Yingming, Wang, Yilei, Lu, Yuanwei, Chen, Yu, Luo, Yu, Luo, Yuchu, Yin, Yuhe, Feng, Yuheng, Yang, Yuxiang, Tang, Zecheng, Zhang, Zekai, Yang, Zidong, Jiao, Binxing, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo, Shum, Heung-Yeung, Jiang, Daxin
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
DNN Partitioning, Task Offloading, and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach
Liu, Zhang, Du, Hongyang, Lin, Junzhe, Gao, Zhibin, Huang, Lianfen, Hosseinalipour, Seyyedali, Niyato, Dusit
The rapid advancement of Artificial Intelligence (AI) has introduced Deep Neural Network (DNN)-based tasks to the ecosystem of vehicular networks. These tasks are often computation-intensive, requiring substantial computation resources, which are beyond the capability of a single vehicle. To address this challenge, Vehicular Edge Computing (VEC) has emerged as a solution, offering computing services for DNN-based tasks through resource pooling via Vehicle-to-Vehicle/Infrastructure (V2V/V2I) communications. In this paper, we formulate the problem of joint DNN partitioning, task offloading, and resource allocation in VEC as a dynamic long-term optimization. Our objective is to minimize the DNN-based task completion time while guaranteeing the system stability over time. To this end, we first leverage a Lyapunov optimization technique to decouple the original long-term optimization with stability constraints into a per-slot deterministic problem. Afterwards, we propose a Multi-Agent Diffusion-based Deep Reinforcement Learning (MAD2RL) algorithm, incorporating the innovative use of diffusion models to determine the optimal DNN partitioning and task offloading decisions. Furthermore, we integrate convex optimization techniques into MAD2RL as a subroutine to allocate computation resources, enhancing the learning efficiency. Through simulations under real-world movement traces of vehicles, we demonstrate the superior performance of our proposed algorithm compared to existing benchmark solutions.