Dong, Mengnan
Muon is Scalable for LLM Training
Liu, Jingyuan, Su, Jianlin, Yao, Xingcheng, Jiang, Zhejun, Lai, Guokun, Du, Yulun, Qin, Yidao, Xu, Weixin, Lu, Enzhe, Yan, Junjie, Chen, Yanru, Zheng, Huabin, Liu, Yibo, Liu, Shaowei, Yin, Bohong, He, Weiran, Zhu, Han, Wang, Yuzhi, Wang, Jianzhou, Dong, Mengnan, Zhang, Zheng, Kang, Yongsheng, Zhang, Hao, Xu, Xinran, Zhang, Yutao, Wu, Yuxin, Zhou, Xinyu, Yang, Zhilin
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, null, Du, Angang, Gao, Bofei, Xing, Bowei, Jiang, Changjiu, Chen, Cheng, Li, Cheng, Xiao, Chenjun, Du, Chenzhuang, Liao, Chonghua, Tang, Chuning, Wang, Congcong, Zhang, Dehao, Yuan, Enming, Lu, Enzhe, Tang, Fengxiang, Sung, Flood, Wei, Guangda, Lai, Guokun, Guo, Haiqing, Zhu, Han, Ding, Hao, Hu, Hao, Yang, Hao, Zhang, Hao, Yao, Haotian, Zhao, Haotian, Lu, Haoyu, Li, Haoze, Yu, Haozhen, Gao, Hongcheng, Zheng, Huabin, Yuan, Huan, Chen, Jia, Guo, Jianhang, Su, Jianlin, Wang, Jianzhou, Zhao, Jie, Zhang, Jin, Liu, Jingyuan, Yan, Junjie, Wu, Junyan, Shi, Lidong, Ye, Ling, Yu, Longhui, Dong, Mengnan, Zhang, Neo, Ma, Ningchen, Pan, Qiwei, Gong, Qucheng, Liu, Shaowei, Ma, Shengling, Wei, Shupeng, Cao, Sihan, Huang, Siying, Jiang, Tao, Gao, Weihao, Xiong, Weimin, He, Weiran, Huang, Weixiao, Wu, Wenhao, He, Wenyang, Wei, Xianghui, Jia, Xianqing, Wu, Xingzhe, Xu, Xinran, Zu, Xinxing, Zhou, Xinyu, Pan, Xuehai, Charles, Y., Li, Yang, Hu, Yangyang, Liu, Yangyang, Chen, Yanru, Wang, Yejie, Liu, Yibo, Qin, Yidao, Liu, Yifeng, Yang, Ying, Bao, Yiping, Du, Yulun, Wu, Yuxin, Wang, Yuzhi, Zhou, Zaida, Wang, Zhaoji, Li, Zhaowei, Zhu, Zhen, Zhang, Zheng, Wang, Zhexu, Yang, Zhilin, Huang, Zhiqi, Huang, Zihao, Xu, Ziyao, Yang, Zonghan
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).