Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
StepFun, null, :, null, Wang, Bin, Wang, Bojun, Wan, Changyi, Huang, Guanzhe, Hu, Hanpeng, Jia, Haonan, Nie, Hao, Li, Mingliang, Chen, Nuo, Chen, Siyu, Yuan, Song, Xie, Wuxun, Song, Xiaoniu, Chen, Xing, Yang, Xingping, Zhang, Xuelin, Yu, Yanbo, Wang, Yaoyu, Zhu, Yibo, Jiang, Yimin, Zhou, Yu, Lu, Yuanwei, Li, Houyi, Hu, Jingcheng, Lo, Ka Man, Huang, Ailin, Jiao, Binxing, Li, Bo, Chen, Boyu, Miao, Changxin, Lou, Chang, Hu, Chen, Xu, Chen, Yu, Chenfeng, Yao, Chengyuan, Lv, Daokuan, Shi, Dapeng, Sun, Deshan, Huang, Ding, Hu, Dingyuan, Pang, Dongqing, Liu, Enle, Zhang, Fajie, Wan, Fanqi, Yan, Gulin, Zhang, Han, Zhou, Han, Wu, Hanghao, Guo, Hangyu, Chen, Hanqi, Zhang, Hanshan, Wu, Hao, Zhang, Haocheng, Yan, Haolong, Lv, Haoran, Wei, Haoran, Zhou, Hebin, Wang, Heng, Wang, Heng, Li, Hongxin, Zhou, Hongyu, Wang, Hongyuan, Guo, Huiyong, Wang, Jia, Gong, Jiahao, Xie, Jialing, Zhou, Jian, Sun, Jianjian, Wu, Jiaoren, Zhang, Jiaran, Liu, Jiayu, Cheng, Jie, Luo, Jie, Yan, Jie, Yang, Jie, Hou, Jieyi, Zhang, Jinguang, Cao, Jinlan, Yin, Jisheng, Liu, Junfeng, Huang, Junhao, Lin, Junzhe, Tan, Kaijun, Li, Kaixiang, An, Kang, Lin, Kangheng, Liu, Kenkun, Yang, Lei, Zhao, Liang, Chen, Liangyu, Shi, Lieyu, Tan, Liguo, Lin, Lin, Zhang, Lin, Chen, Lina, Huang, Liwen, Shi, Liying, Gu, Longlong, Chen, Mei, Ren, Mengqiang, Li, Ming, Chen, Mingzhe, Wang, Na, Wu, Nan, Han, Qi, Zhao, Qian, Zhang, Qiang, Liu, Qianni, Chen, Qiaohui, Wu, Qiling, He, Qinglin, Tan, Qinyuan, Wang, Qiufeng, Wu, Qiuping, Liang, Qiuyan, Sun, Quan, Li, Rui, Miao, Ruihang, Wan, Ruosi, Guo, Ruyan, Zhong, Shangwu, Pang, Shaoliang, Fan, Shengjie, Shang, Shijie, Jiang, Shilei, Yang, Shiliang, Hao, Shiming, Gao, Shuli, Huang, Siming, Liu, Siqi, Cao, Tiancheng, Cheng, Tianhao, Peng, Tianhao, You, Wang, Ji, Wei, Sun, Wen, Deng, Wenjin, He, Wenqing, Zheng, Wenzhen, Chen, Xi, Kong, Xiangwen, Luo, Xianzhen, Yang, Xiaobo, Liu, Xiaojia, Ren, Xiaoxiao, Han, Xin, Li, Xin, Wu, Xin, Zhao, Xu, Wei, Yanan, Li, Yang, Li, Yangguang, Xu, Yangshijie, Xu, Yanming, Shi, Yaqiang, Shen, Yeqing, Yang, Yi, Yang, Yifei, Gong, Yifeng, Chen, Yihan, Yang, Yijing, Zhang, Yinmin, Zhou, Yizhuang, Ding, Yuanhao, Fan, Yuantao, Yang, Yuanzhen, Luo, Yuchu, Peng, Yue, Lu, Yufan, Deng, Yuhang, Yin, Yuhe, Liu, Yujie, Chen, Yukun, Zhao, Yuling, Mou, Yun, Li, Yunlong, Ju, Yunzhou, Li, Yusheng, Yang, Yuxiang, Zhang, Yuxiang, Chen, Yuyang, Weng, Zejia, Xie, Zhe, Ge, Zheng, Gong, Zheng, Lu, Zhenyi, Huang, Zhewei, Chang, Zhichao, Huang, Zhiguo, Wang, Zhirui, Yang, Zidong, Wang, Zili, Wang, Ziqi, Zhang, Zixin, Jiao, Binxing, Jiang, Daxin, Shum, Heung-Yeung, Zhang, Xiangyu
–arXiv.org Artificial Intelligence
Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.
arXiv.org Artificial Intelligence
Jul-28-2025