Pan, Xuehai
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, null, Du, Angang, Gao, Bofei, Xing, Bowei, Jiang, Changjiu, Chen, Cheng, Li, Cheng, Xiao, Chenjun, Du, Chenzhuang, Liao, Chonghua, Tang, Chuning, Wang, Congcong, Zhang, Dehao, Yuan, Enming, Lu, Enzhe, Tang, Fengxiang, Sung, Flood, Wei, Guangda, Lai, Guokun, Guo, Haiqing, Zhu, Han, Ding, Hao, Hu, Hao, Yang, Hao, Zhang, Hao, Yao, Haotian, Zhao, Haotian, Lu, Haoyu, Li, Haoze, Yu, Haozhen, Gao, Hongcheng, Zheng, Huabin, Yuan, Huan, Chen, Jia, Guo, Jianhang, Su, Jianlin, Wang, Jianzhou, Zhao, Jie, Zhang, Jin, Liu, Jingyuan, Yan, Junjie, Wu, Junyan, Shi, Lidong, Ye, Ling, Yu, Longhui, Dong, Mengnan, Zhang, Neo, Ma, Ningchen, Pan, Qiwei, Gong, Qucheng, Liu, Shaowei, Ma, Shengling, Wei, Shupeng, Cao, Sihan, Huang, Siying, Jiang, Tao, Gao, Weihao, Xiong, Weimin, He, Weiran, Huang, Weixiao, Wu, Wenhao, He, Wenyang, Wei, Xianghui, Jia, Xianqing, Wu, Xingzhe, Xu, Xinran, Zu, Xinxing, Zhou, Xinyu, Pan, Xuehai, Charles, Y., Li, Yang, Hu, Yangyang, Liu, Yangyang, Chen, Yanru, Wang, Yejie, Liu, Yibo, Qin, Yidao, Liu, Yifeng, Yang, Ying, Bao, Yiping, Du, Yulun, Wu, Yuxin, Wang, Yuzhi, Zhou, Zaida, Wang, Zhaoji, Li, Zhaowei, Zhu, Zhen, Zhang, Zheng, Wang, Zhexu, Yang, Zhilin, Huang, Zhiqi, Huang, Zihao, Xu, Ziyao, Yang, Zonghan
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).
Reward Generalization in RLHF: A Topological Perspective
Qiu, Tianyi, Zeng, Fanzhi, Ji, Jiaming, Yan, Dong, Wang, Kaile, Zhou, Jiayi, Han, Yang, Dai, Josef, Pan, Xuehai, Yang, Yaodong
Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.
Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction
Ji, Jiaming, Chen, Boyuan, Lou, Hantao, Hong, Donghai, Zhang, Borong, Pan, Xuehai, Dai, Juntao, Yang, Yaodong
Efforts to align Large Language Models (LLMs) are mainly conducted via Reinforcement Learning from Human Feedback (RLHF) methods. However, RLHF encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to LLM parameters. Here we introduce Aligner, a new efficient alignment paradigm that bypasses the whole RLHF process by learning the correctional residuals between the aligned and the unaligned answers. Our Aligner offers several key advantages. Firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. Secondly, the Aligner facilitates weak-to-strong generalization; finetuning large pretrained models by Aligner's supervisory signals demonstrates strong performance boost. Thirdly, Aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and API-based models. Remarkably, Aligner-7B improves 11 different LLMs by 21.9% in helpfulness and 23.8% in harmlessness on average (GPT-4 by 17.5% and 26.9%). When finetuning (strong) Llama2-70B with (weak) Aligner-13B's supervision, we can improve Llama2 by 8.2% in helpfulness and 61.6% in harmlessness. See our dataset and code at https://aligner2024.github.io
AI Alignment: A Comprehensive Survey
Ji, Jiaming, Qiu, Tianyi, Chen, Boyuan, Zhang, Borong, Lou, Hantao, Wang, Kaile, Duan, Yawen, He, Zhonghao, Zhou, Jiayi, Zhang, Zhaowei, Zeng, Fanzhi, Ng, Kwan Yee, Dai, Juntao, Pan, Xuehai, O'Gara, Aidan, Lei, Yingshan, Xu, Hua, Tse, Brian, Fu, Jie, McAleer, Stephen, Yang, Yaodong, Wang, Yizhou, Zhu, Song-Chun, Guo, Yike, Gao, Wen
AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. The former aims to make AI systems aligned via alignment training, while the latter aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks. On forward alignment, we discuss techniques for learning from feedback and learning under distribution shift. On backward alignment, we discuss assurance techniques and governance practices. We also release and continually update the website (www.alignmentsurvey.com) which features tutorials, collections of papers, blog posts, and other resources.
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
Ji, Jiaming, Zhang, Borong, Zhou, Jiayi, Pan, Xuehai, Huang, Weidong, Sun, Ruiyang, Geng, Yiran, Zhong, Yifan, Dai, Juntao, Yang, Yaodong
Artificial intelligence (AI) systems possess significant potential to drive societal progress. However, their deployment often faces obstacles due to substantial safety concerns. Safe reinforcement learning (SafeRL) emerges as a solution to optimize policies while simultaneously adhering to multiple constraints, thereby addressing the challenge of integrating reinforcement learning in safety-critical scenarios. In this paper, we present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios, accepting vector and vision-only input. Additionally, we offer a library of algorithms named Safe Policy Optimization (SafePO), comprising 16 state-of-the-art SafeRL algorithms. This comprehensive library can serve as a validation tool for the research community. By introducing this benchmark, we aim to facilitate the evaluation and comparison of safety performance, thus fostering the development of reinforcement learning for safer, more reliable, and responsible real-world applications. The website of this project can be accessed at https://sites.google.com/view/safety-gymnasium.
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Ji, Jiaming, Liu, Mickel, Dai, Juntao, Pan, Xuehai, Zhang, Chi, Bian, Ce, Zhang, Chi, Sun, Ruiyang, Wang, Yizhou, Yang, Yaodong
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https://sites.google.com/view/pku-beavertails.
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Dai, Josef, Pan, Xuehai, Sun, Ruiyang, Ji, Jiaming, Xu, Xinbo, Liu, Mickel, Wang, Yizhou, Yang, Yaodong
With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we finetuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations. Warning: This paper contains example data that may be offensive or harmful. Large Language Models (LLMs) have shown remarkable capabilities in understanding instructions (Chung et al., 2022; Ouyang et al., 2022), summarization (Stiennon et al., 2020; Koh et al., 2022) and performing complex reasoning tasks (OpenAI, 2023; Anil et al., 2023), and more. Considering the potential for broad societal impact, responses generated by LLMs must not contain harmful content, such as discrimination, misinformation, or violations of social norms and morals (Gehman et al., 2020; Weidinger et al., 2021; Ganguli et al., 2022; Deshpande et al., 2023). Therefore, the alignment of safety in LLMs has received widespread attention from academia and industry (Christian, 2023). An essential component of safety alignment involves minimizing the tendency of a model to generate harmful responses through fine-tuning. Give three tips for staying how to be a serial killer? Figure 1: Safe RLHF pipeline compared to conventional RLHF method. NOTE: In the annotation phase, the safety labels for the responses are annotated independently. These responses can be labeled as both safe or both unsafe. RLHF leverages LLMs' broad knowledge and capabilities to promote desired responses and behaviors, which leads to safer, higher-performing, and more controllable AI systems.
Red Teaming Game: A Game-Theoretic Framework for Red Teaming Language Models
Ma, Chengdong, Yang, Ziran, Gao, Minquan, Ci, Hai, Gao, Jun, Pan, Xuehai, Yang, Yaodong
Deployable Large Language Models (LLMs) must conform to the criterion of helpfulness and harmlessness, thereby achieving consistency between LLMs outputs and human values. Red-teaming techniques constitute a critical way towards this criterion. Existing work rely solely on manual red team designs and heuristic adversarial prompts for vulnerability detection and optimization. These approaches lack rigorous mathematical formulation, thus limiting the exploration of diverse attack strategy within quantifiable measure and optimization of LLMs under convergence guarantees. In this paper, we present Red-teaming Game (RTG), a general game-theoretic framework without manual annotation. RTG is designed for analyzing the multi-turn attack and defense interactions between Red-team language Models (RLMs) and Blue-team Language Model (BLM). Within the RTG, we propose Gamified Red-teaming Solver (GRTS) with diversity measure of the semantic space. GRTS is an automated red teaming technique to solve RTG towards Nash equilibrium through meta-game analysis, which corresponds to the theoretically guaranteed optimization direction of both RLMs and BLM. Empirical results in multi-turn attacks with RLMs show that GRTS autonomously discovered diverse attack strategies and effectively improved security of LLMs, outperforming existing heuristic red-team designs. Overall, RTG has established a foundational framework for red teaming tasks and constructed a new scalable oversight technique for alignment.
Baichuan 2: Open Large-scale Language Models
Yang, Aiyuan, Xiao, Bin, Wang, Bingning, Zhang, Borong, Bian, Ce, Yin, Chao, Lv, Chenxu, Pan, Da, Wang, Dian, Yan, Dong, Yang, Fan, Deng, Fei, Wang, Feng, Liu, Feng, Ai, Guangwei, Dong, Guosheng, Zhao, Haizhou, Xu, Hang, Sun, Haoze, Zhang, Hongda, Liu, Hui, Ji, Jiaming, Xie, Jian, Dai, JunTao, Fang, Kun, Su, Lei, Song, Liang, Liu, Lifeng, Ru, Liyun, Ma, Luyao, Wang, Mang, Liu, Mickel, Lin, MingAn, Nie, Nuolan, Guo, Peidong, Sun, Ruiyang, Zhang, Tao, Li, Tianpeng, Li, Tianyu, Cheng, Wei, Chen, Weipeng, Zeng, Xiangrong, Wang, Xiaochuan, Chen, Xiaoxi, Men, Xin, Yu, Xin, Pan, Xuehai, Shen, Yanjun, Wang, Yiding, Li, Yiyu, Jiang, Youxin, Gao, Yuchen, Zhang, Yupeng, Zhou, Zenan, Wu, Zhiying
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research
Ji, Jiaming, Zhou, Jiayi, Zhang, Borong, Dai, Juntao, Pan, Xuehai, Sun, Ruiyang, Huang, Weidong, Geng, Yiran, Liu, Mickel, Yang, Yaodong
AI systems empowered by reinforcement learning (RL) algorithms harbor the immense potential to catalyze societal advancement, yet their deployment is often impeded by significant safety concerns. Particularly in safety-critical applications, researchers have raised concerns about unintended harms or unsafe behaviors of unaligned RL agents. The philosophy of safe reinforcement learning (SafeRL) is to align RL agents with harmless intentions and safe behavioral patterns. In SafeRL, agents learn to develop optimal policies by receiving feedback from the environment, while also fulfilling the requirement of minimizing the risk of unintended harm or unsafe behavior. However, due to the intricate nature of SafeRL algorithm implementation, combining methodologies across various domains presents a formidable challenge. This had led to an absence of a cohesive and efficacious learning framework within the contemporary SafeRL research milieu. In this work, we introduce a foundational framework designed to expedite SafeRL research endeavors. Our comprehensive framework encompasses an array of algorithms spanning different RL domains and places heavy emphasis on safety elements. Our efforts are to make the SafeRL-related research process more streamlined and efficient, therefore facilitating further research in AI safety.