Chen, Mingrui
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Huang, Ailin, Wu, Boyong, Wang, Bruce, Yan, Chao, Hu, Chen, Feng, Chengli, Tian, Fei, Shen, Feiyu, Li, Jingbei, Chen, Mingrui, Liu, Peng, Miao, Ruihang, You, Wang, Chen, Xi, Yang, Xuerui, Huang, Yechang, Zhang, Yuxiang, Gong, Zheng, Zhang, Zixin, Zhou, Hongyu, Sun, Jianjian, Li, Brian, Feng, Chengting, Wan, Changyi, Hu, Hanpeng, Wu, Jianchang, Zhen, Jiangjie, Ming, Ranchen, Yuan, Song, Zhang, Xuelin, Zhou, Yu, Li, Bingxin, Ma, Buyun, Wang, Hongyuan, An, Kang, Ji, Wei, Li, Wen, Wen, Xuan, Kong, Xiangwen, Ma, Yuankai, Liang, Yuanwei, Mou, Yun, Ahmidi, Bahtiyar, Wang, Bin, Li, Bo, Miao, Changxin, Xu, Chen, Wang, Chenrun, Shi, Dapeng, Sun, Deshan, Hu, Dingyuan, Sai, Dula, Liu, Enle, Huang, Guanzhe, Yan, Gulin, Wang, Heng, Jia, Haonan, Zhang, Haoyang, Gong, Jiahao, Guo, Junjing, Liu, Jiashuai, Liu, Jiahong, Feng, Jie, Wu, Jie, Wu, Jiaoren, Yang, Jie, Wang, Jinguo, Zhang, Jingyang, Lin, Junzhe, Li, Kaixiang, Xia, Lei, Zhou, Li, Zhao, Liang, Gu, Longlong, Chen, Mei, Wu, Menglin, Li, Ming, Li, Mingxiao, Li, Mingliang, Liang, Mingyao, Wang, Na, Hao, Nie, Wu, Qiling, Tan, Qinyuan, Sun, Ran, Shuai, Shuai, Pang, Shaoliang, Yang, Shiliang, Gao, Shuli, Yuan, Shanshan, Liu, Siqi, Deng, Shihong, Jiang, Shilei, Liu, Sitong, Cao, Tiancheng, Wang, Tianyu, Deng, Wenjin, Xie, Wuxun, Ming, Weipeng, He, Wenqing, Sun, Wen, Han, Xin, Huang, Xin, Deng, Xiaomin, Liu, Xiaojia, Wu, Xin, Zhao, Xu, Wei, Yanan, Yu, Yanbo, Cao, Yang, Li, Yangguang, Ma, Yangzhen, Xu, Yanming, Wang, Yaoyu, Shi, Yaqiang, Wang, Yilei, Zhou, Yizhuang, Zhong, Yinmin, Zhang, Yang, Wei, Yaoben, Luo, Yu, Lu, Yuanwei, Yin, Yuhe, Luo, Yuchu, Ding, Yuanhao, Yan, Yuting, Dai, Yaqi, Yang, Yuxiang, Xie, Zhe, Ge, Zheng, Sun, Zheng, Huang, Zhewei, Chang, Zhichao, Guan, Zhisheng, Yang, Zidong, Zhang, Zili, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
On the Hidden Mystery of OCR in Large Multimodal Models
Liu, Yuliang, Li, Zhang, Li, Hongliang, Yu, Wenwen, Liu, Yang, Yang, Biao, Huang, Mingxin, Peng, Dezhi, Liu, Mingyu, Chen, Mingrui, Li, Chunyuan, Yin, Xucheng, Liu, Cheng-lin, Jin, Lianwen, Bai, Xiang
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting finegrained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline is available at https://github.com/Yuliang-Liu/MultimodalOCR.