Sun, Jianjian
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Huang, Ailin, Wu, Boyong, Wang, Bruce, Yan, Chao, Hu, Chen, Feng, Chengli, Tian, Fei, Shen, Feiyu, Li, Jingbei, Chen, Mingrui, Liu, Peng, Miao, Ruihang, You, Wang, Chen, Xi, Yang, Xuerui, Huang, Yechang, Zhang, Yuxiang, Gong, Zheng, Zhang, Zixin, Zhou, Hongyu, Sun, Jianjian, Li, Brian, Feng, Chengting, Wan, Changyi, Hu, Hanpeng, Wu, Jianchang, Zhen, Jiangjie, Ming, Ranchen, Yuan, Song, Zhang, Xuelin, Zhou, Yu, Li, Bingxin, Ma, Buyun, Wang, Hongyuan, An, Kang, Ji, Wei, Li, Wen, Wen, Xuan, Kong, Xiangwen, Ma, Yuankai, Liang, Yuanwei, Mou, Yun, Ahmidi, Bahtiyar, Wang, Bin, Li, Bo, Miao, Changxin, Xu, Chen, Wang, Chenrun, Shi, Dapeng, Sun, Deshan, Hu, Dingyuan, Sai, Dula, Liu, Enle, Huang, Guanzhe, Yan, Gulin, Wang, Heng, Jia, Haonan, Zhang, Haoyang, Gong, Jiahao, Guo, Junjing, Liu, Jiashuai, Liu, Jiahong, Feng, Jie, Wu, Jie, Wu, Jiaoren, Yang, Jie, Wang, Jinguo, Zhang, Jingyang, Lin, Junzhe, Li, Kaixiang, Xia, Lei, Zhou, Li, Zhao, Liang, Gu, Longlong, Chen, Mei, Wu, Menglin, Li, Ming, Li, Mingxiao, Li, Mingliang, Liang, Mingyao, Wang, Na, Hao, Nie, Wu, Qiling, Tan, Qinyuan, Sun, Ran, Shuai, Shuai, Pang, Shaoliang, Yang, Shiliang, Gao, Shuli, Yuan, Shanshan, Liu, Siqi, Deng, Shihong, Jiang, Shilei, Liu, Sitong, Cao, Tiancheng, Wang, Tianyu, Deng, Wenjin, Xie, Wuxun, Ming, Weipeng, He, Wenqing, Sun, Wen, Han, Xin, Huang, Xin, Deng, Xiaomin, Liu, Xiaojia, Wu, Xin, Zhao, Xu, Wei, Yanan, Yu, Yanbo, Cao, Yang, Li, Yangguang, Ma, Yangzhen, Xu, Yanming, Wang, Yaoyu, Shi, Yaqiang, Wang, Yilei, Zhou, Yizhuang, Zhong, Yinmin, Zhang, Yang, Wei, Yaoben, Luo, Yu, Lu, Yuanwei, Yin, Yuhe, Luo, Yuchu, Ding, Yuanhao, Yan, Yuting, Dai, Yaqi, Yang, Yuxiang, Xie, Zhe, Ge, Zheng, Sun, Zheng, Huang, Zhewei, Chang, Zhichao, Guan, Zhisheng, Yang, Zidong, Zhang, Zili, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Unhackable Temporal Rewarding for Scalable Video MLLMs
Yu, En, Lin, Kangheng, Zhao, Liang, Wei, Yana, Zhu, Zining, Wei, Haoran, Sun, Jianjian, Ge, Zheng, Zhang, Xiangyu, Wang, Jingyu, Tao, Wenbing
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
PerPO: Perceptual Preference Optimization via Discriminative Rewarding
Zhu, Zining, Zhao, Liang, Lin, Kangheng, Yang, Jinze, Yu, En, Liu, Chenglong, Wei, Haoran, Sun, Jianjian, Ge, Zheng, Zhang, Xiangyu
This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank them.By utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs' visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope that PerPO will encourage the community to rethink MLLM alignment strategies.
DreamLLM: Synergistic Multimodal Comprehension and Creation
Dong, Runpei, Han, Chunrui, Peng, Yuang, Qi, Zekun, Ge, Zheng, Yang, Jinrong, Zhao, Liang, Sun, Jianjian, Zhou, Hongyu, Wei, Haoran, Kong, Xiangwen, Zhang, Xiangyu, Ma, Kaisheng, Yi, Li
This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
Zhao, Liang, Yu, En, Ge, Zheng, Yang, Jinrong, Wei, Haoran, Zhou, Hongyu, Sun, Jianjian, Peng, Yuang, Dong, Runpei, Han, Chunrui, Zhang, Xiangyu
Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot's promising performance.