Joglekar, Manas
Deliberative Alignment: Reasoning Enables Safer Language Models
Guan, Melody Y., Joglekar, Manas, Wallace, Eric, Jain, Saachi, Barak, Boaz, Helyar, Alec, Dias, Rachel, Vallone, Andrea, Ren, Hongyu, Wei, Jason, Chung, Hyung Won, Toyer, Sam, Heidecke, Johannes, Beutel, Alex, Glaese, Amelia
Modern Large Language Models (LLMs) are safety trained using Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to mitigate harmful, undesirable, or otherwise disallowed outputs [2]-[4]. Despite ongoing advances in these methods, today's models still exhibit safety shortcomings: they can be tricked into revealing harmful content, often refuse legitimate requests, and remain vulnerable to jailbreak attacks [5]-[8]. We argue that many of these failures arise from two limitations in modern safety training. First, LLMs must respond instantly to user requests using a fixed amount of compute, without deliberation even for complex safety scenarios. Second, LLMs must infer underlying safety standards indirectly from large sets of labeled examples, rather than directly learning the safety specifications that govern them. This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks. We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. By applying this method to OpenAI's o-series models [1], we enable them to use chain-of-thought (CoT) reasoning to examine user prompts, identify relevant policy guidelines, and generate safer responses (e.g., Figure 1).
OpenAI o1 System Card
OpenAI, null, :, null, Jaech, Aaron, Kalai, Adam, Lerer, Adam, Richardson, Adam, El-Kishky, Ahmed, Low, Aiden, Helyar, Alec, Madry, Aleksander, Beutel, Alex, Carney, Alex, Iftimie, Alex, Karpenko, Alex, Passos, Alex Tachard, Neitz, Alexander, Prokofiev, Alexander, Wei, Alexander, Tam, Allison, Bennett, Ally, Kumar, Ananya, Saraiva, Andre, Vallone, Andrea, Duberstein, Andrew, Kondrich, Andrew, Mishchenko, Andrey, Applebaum, Andy, Jiang, Angela, Nair, Ashvin, Zoph, Barret, Ghorbani, Behrooz, Rossen, Ben, Sokolowsky, Benjamin, Barak, Boaz, McGrew, Bob, Minaiev, Borys, Hao, Botao, Baker, Bowen, Houghton, Brandon, McKinzie, Brandon, Eastman, Brydon, Lugaresi, Camillo, Bassin, Cary, Hudson, Cary, Li, Chak Ming, de Bourcy, Charles, Voss, Chelsea, Shen, Chen, Zhang, Chong, Koch, Chris, Orsinger, Chris, Hesse, Christopher, Fischer, Claudia, Chan, Clive, Roberts, Dan, Kappler, Daniel, Levy, Daniel, Selsam, Daniel, Dohan, David, Farhi, David, Mely, David, Robinson, David, Tsipras, Dimitris, Li, Doug, Oprica, Dragos, Freeman, Eben, Zhang, Eddie, Wong, Edmund, Proehl, Elizabeth, Cheung, Enoch, Mitchell, Eric, Wallace, Eric, Ritter, Erik, Mays, Evan, Wang, Fan, Such, Felipe Petroski, Raso, Filippo, Leoni, Florencia, Tsimpourlas, Foivos, Song, Francis, von Lohmann, Fred, Sulit, Freddie, Salmon, Geoff, Parascandolo, Giambattista, Chabot, Gildas, Zhao, Grace, Brockman, Greg, Leclerc, Guillaume, Salman, Hadi, Bao, Haiming, Sheng, Hao, Andrin, Hart, Bagherinezhad, Hessam, Ren, Hongyu, Lightman, Hunter, Chung, Hyung Won, Kivlichan, Ian, O'Connell, Ian, Osband, Ian, Gilaberte, Ignasi Clavera, Akkaya, Ilge, Kostrikov, Ilya, Sutskever, Ilya, Kofman, Irina, Pachocki, Jakub, Lennon, James, Wei, Jason, Harb, Jean, Twore, Jerry, Feng, Jiacheng, Yu, Jiahui, Weng, Jiayi, Tang, Jie, Yu, Jieqi, Candela, Joaquin Quiรฑonero, Palermo, Joe, Parish, Joel, Heidecke, Johannes, Hallman, John, Rizzo, John, Gordon, Jonathan, Uesato, Jonathan, Ward, Jonathan, Huizinga, Joost, Wang, Julie, Chen, Kai, Xiao, Kai, Singhal, Karan, Nguyen, Karina, Cobbe, Karl, Shi, Katy, Wood, Kayla, Rimbach, Kendra, Gu-Lemberg, Keren, Liu, Kevin, Lu, Kevin, Stone, Kevin, Yu, Kevin, Ahmad, Lama, Yang, Lauren, Liu, Leo, Maksin, Leon, Ho, Leyton, Fedus, Liam, Weng, Lilian, Li, Linden, McCallum, Lindsay, Held, Lindsey, Kuhn, Lorenz, Kondraciuk, Lukas, Kaiser, Lukasz, Metz, Luke, Boyd, Madelaine, Trebacz, Maja, Joglekar, Manas, Chen, Mark, Tintor, Marko, Meyer, Mason, Jones, Matt, Kaufer, Matt, Schwarzer, Max, Shah, Meghan, Yatbaz, Mehmet, Guan, Melody Y., Xu, Mengyuan, Yan, Mengyuan, Glaese, Mia, Chen, Mianna, Lampe, Michael, Malek, Michael, Wang, Michele, Fradin, Michelle, McClay, Mike, Pavlov, Mikhail, Wang, Miles, Wang, Mingxuan, Murati, Mira, Bavarian, Mo, Rohaninejad, Mostafa, McAleese, Nat, Chowdhury, Neil, Chowdhury, Neil, Ryder, Nick, Tezak, Nikolas, Brown, Noam, Nachum, Ofir, Boiko, Oleg, Murk, Oleg, Watkins, Olivia, Chao, Patrick, Ashbourne, Paul, Izmailov, Pavel, Zhokhov, Peter, Dias, Rachel, Arora, Rahul, Lin, Randall, Lopes, Rapha Gontijo, Gaon, Raz, Miyara, Reah, Leike, Reimar, Hwang, Renny, Garg, Rhythm, Brown, Robin, James, Roshan, Shu, Rui, Cheu, Ryan, Greene, Ryan, Jain, Saachi, Altman, Sam, Toizer, Sam, Toyer, Sam, Miserendino, Samuel, Agarwal, Sandhini, Hernandez, Santiago, Baker, Sasha, McKinney, Scott, Yan, Scottie, Zhao, Shengjia, Hu, Shengli, Santurkar, Shibani, Chaudhuri, Shraman Ray, Zhang, Shuyuan, Fu, Siyuan, Papay, Spencer, Lin, Steph, Balaji, Suchir, Sanjeev, Suvansh, Sidor, Szymon, Broda, Tal, Clark, Aidan, Wang, Tao, Gordon, Taylor, Sanders, Ted, Patwardhan, Tejal, Sottiaux, Thibault, Degry, Thomas, Dimson, Thomas, Zheng, Tianhao, Garipov, Timur, Stasi, Tom, Bansal, Trapit, Creech, Trevor, Peterson, Troy, Eloundou, Tyna, Qi, Valerie, Kosaraju, Vineet, Monaco, Vinnie, Pong, Vitchyr, Fomenko, Vlad, Zheng, Weiyi, Zhou, Wenda, McCabe, Wes, Zaremba, Wojciech, Dubois, Yann, Lu, Yinghai, Chen, Yining, Cha, Young, Bai, Yu, He, Yuchen, Zhang, Yuchen, Wang, Yunyun, Shao, Zheng, Li, Zhuohan
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Burns, Collin, Izmailov, Pavel, Kirchner, Jan Hendrik, Baker, Bowen, Gao, Leo, Aschenbrenner, Leopold, Chen, Yining, Ecoffet, Adrien, Joglekar, Manas, Leike, Jan, Sutskever, Ilya, Wu, Jeff
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.