Pong, Vitchyr
OpenAI o1 System Card
OpenAI, null, :, null, Jaech, Aaron, Kalai, Adam, Lerer, Adam, Richardson, Adam, El-Kishky, Ahmed, Low, Aiden, Helyar, Alec, Madry, Aleksander, Beutel, Alex, Carney, Alex, Iftimie, Alex, Karpenko, Alex, Passos, Alex Tachard, Neitz, Alexander, Prokofiev, Alexander, Wei, Alexander, Tam, Allison, Bennett, Ally, Kumar, Ananya, Saraiva, Andre, Vallone, Andrea, Duberstein, Andrew, Kondrich, Andrew, Mishchenko, Andrey, Applebaum, Andy, Jiang, Angela, Nair, Ashvin, Zoph, Barret, Ghorbani, Behrooz, Rossen, Ben, Sokolowsky, Benjamin, Barak, Boaz, McGrew, Bob, Minaiev, Borys, Hao, Botao, Baker, Bowen, Houghton, Brandon, McKinzie, Brandon, Eastman, Brydon, Lugaresi, Camillo, Bassin, Cary, Hudson, Cary, Li, Chak Ming, de Bourcy, Charles, Voss, Chelsea, Shen, Chen, Zhang, Chong, Koch, Chris, Orsinger, Chris, Hesse, Christopher, Fischer, Claudia, Chan, Clive, Roberts, Dan, Kappler, Daniel, Levy, Daniel, Selsam, Daniel, Dohan, David, Farhi, David, Mely, David, Robinson, David, Tsipras, Dimitris, Li, Doug, Oprica, Dragos, Freeman, Eben, Zhang, Eddie, Wong, Edmund, Proehl, Elizabeth, Cheung, Enoch, Mitchell, Eric, Wallace, Eric, Ritter, Erik, Mays, Evan, Wang, Fan, Such, Felipe Petroski, Raso, Filippo, Leoni, Florencia, Tsimpourlas, Foivos, Song, Francis, von Lohmann, Fred, Sulit, Freddie, Salmon, Geoff, Parascandolo, Giambattista, Chabot, Gildas, Zhao, Grace, Brockman, Greg, Leclerc, Guillaume, Salman, Hadi, Bao, Haiming, Sheng, Hao, Andrin, Hart, Bagherinezhad, Hessam, Ren, Hongyu, Lightman, Hunter, Chung, Hyung Won, Kivlichan, Ian, O'Connell, Ian, Osband, Ian, Gilaberte, Ignasi Clavera, Akkaya, Ilge, Kostrikov, Ilya, Sutskever, Ilya, Kofman, Irina, Pachocki, Jakub, Lennon, James, Wei, Jason, Harb, Jean, Twore, Jerry, Feng, Jiacheng, Yu, Jiahui, Weng, Jiayi, Tang, Jie, Yu, Jieqi, Candela, Joaquin Quiรฑonero, Palermo, Joe, Parish, Joel, Heidecke, Johannes, Hallman, John, Rizzo, John, Gordon, Jonathan, Uesato, Jonathan, Ward, Jonathan, Huizinga, Joost, Wang, Julie, Chen, Kai, Xiao, Kai, Singhal, Karan, Nguyen, Karina, Cobbe, Karl, Shi, Katy, Wood, Kayla, Rimbach, Kendra, Gu-Lemberg, Keren, Liu, Kevin, Lu, Kevin, Stone, Kevin, Yu, Kevin, Ahmad, Lama, Yang, Lauren, Liu, Leo, Maksin, Leon, Ho, Leyton, Fedus, Liam, Weng, Lilian, Li, Linden, McCallum, Lindsay, Held, Lindsey, Kuhn, Lorenz, Kondraciuk, Lukas, Kaiser, Lukasz, Metz, Luke, Boyd, Madelaine, Trebacz, Maja, Joglekar, Manas, Chen, Mark, Tintor, Marko, Meyer, Mason, Jones, Matt, Kaufer, Matt, Schwarzer, Max, Shah, Meghan, Yatbaz, Mehmet, Guan, Melody Y., Xu, Mengyuan, Yan, Mengyuan, Glaese, Mia, Chen, Mianna, Lampe, Michael, Malek, Michael, Wang, Michele, Fradin, Michelle, McClay, Mike, Pavlov, Mikhail, Wang, Miles, Wang, Mingxuan, Murati, Mira, Bavarian, Mo, Rohaninejad, Mostafa, McAleese, Nat, Chowdhury, Neil, Chowdhury, Neil, Ryder, Nick, Tezak, Nikolas, Brown, Noam, Nachum, Ofir, Boiko, Oleg, Murk, Oleg, Watkins, Olivia, Chao, Patrick, Ashbourne, Paul, Izmailov, Pavel, Zhokhov, Peter, Dias, Rachel, Arora, Rahul, Lin, Randall, Lopes, Rapha Gontijo, Gaon, Raz, Miyara, Reah, Leike, Reimar, Hwang, Renny, Garg, Rhythm, Brown, Robin, James, Roshan, Shu, Rui, Cheu, Ryan, Greene, Ryan, Jain, Saachi, Altman, Sam, Toizer, Sam, Toyer, Sam, Miserendino, Samuel, Agarwal, Sandhini, Hernandez, Santiago, Baker, Sasha, McKinney, Scott, Yan, Scottie, Zhao, Shengjia, Hu, Shengli, Santurkar, Shibani, Chaudhuri, Shraman Ray, Zhang, Shuyuan, Fu, Siyuan, Papay, Spencer, Lin, Steph, Balaji, Suchir, Sanjeev, Suvansh, Sidor, Szymon, Broda, Tal, Clark, Aidan, Wang, Tao, Gordon, Taylor, Sanders, Ted, Patwardhan, Tejal, Sottiaux, Thibault, Degry, Thomas, Dimson, Thomas, Zheng, Tianhao, Garipov, Timur, Stasi, Tom, Bansal, Trapit, Creech, Trevor, Peterson, Troy, Eloundou, Tyna, Qi, Valerie, Kosaraju, Vineet, Monaco, Vinnie, Pong, Vitchyr, Fomenko, Vlad, Zheng, Weiyi, Zhou, Wenda, McCabe, Wes, Zaremba, Wojciech, Dubois, Yann, Lu, Yinghai, Chen, Yining, Cha, Young, Bai, Yu, He, Yuchen, Zhang, Yuchen, Wang, Yunyun, Shao, Zheng, Li, Zhuohan
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
GPT-4 Technical Report
OpenAI, null, :, null, Achiam, Josh, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Florencia Leoni, Almeida, Diogo, Altenschmidt, Janko, Altman, Sam, Anadkat, Shyamal, Avila, Red, Babuschkin, Igor, Balaji, Suchir, Balcom, Valerie, Baltescu, Paul, Bao, Haiming, Bavarian, Mo, Belgum, Jeff, Bello, Irwan, Berdine, Jake, Bernadett-Shapiro, Gabriel, Berner, Christopher, Bogdonoff, Lenny, Boiko, Oleg, Boyd, Madelaine, Brakman, Anna-Luisa, Brockman, Greg, Brooks, Tim, Brundage, Miles, Button, Kevin, Cai, Trevor, Campbell, Rosie, Cann, Andrew, Carey, Brittany, Carlson, Chelsea, Carmichael, Rory, Chan, Brooke, Chang, Che, Chantzis, Fotis, Chen, Derek, Chen, Sully, Chen, Ruby, Chen, Jason, Chen, Mark, Chess, Ben, Cho, Chester, Chu, Casey, Chung, Hyung Won, Cummings, Dave, Currier, Jeremiah, Dai, Yunxing, Decareaux, Cory, Degry, Thomas, Deutsch, Noah, Deville, Damien, Dhar, Arka, Dohan, David, Dowling, Steve, Dunning, Sheila, Ecoffet, Adrien, Eleti, Atty, Eloundou, Tyna, Farhi, David, Fedus, Liam, Felix, Niko, Fishman, Simรณn Posada, Forte, Juston, Fulford, Isabella, Gao, Leo, Georges, Elie, Gibson, Christian, Goel, Vik, Gogineni, Tarun, Goh, Gabriel, Gontijo-Lopes, Rapha, Gordon, Jonathan, Grafstein, Morgan, Gray, Scott, Greene, Ryan, Gross, Joshua, Gu, Shixiang Shane, Guo, Yufei, Hallacy, Chris, Han, Jesse, Harris, Jeff, He, Yuchen, Heaton, Mike, Heidecke, Johannes, Hesse, Chris, Hickey, Alan, Hickey, Wade, Hoeschele, Peter, Houghton, Brandon, Hsu, Kenny, Hu, Shengli, Hu, Xin, Huizinga, Joost, Jain, Shantanu, Jain, Shawn, Jang, Joanne, Jiang, Angela, Jiang, Roger, Jin, Haozhun, Jin, Denny, Jomoto, Shino, Jonn, Billie, Jun, Heewoo, Kaftan, Tomer, Kaiser, ลukasz, Kamali, Ali, Kanitscheider, Ingmar, Keskar, Nitish Shirish, Khan, Tabarak, Kilpatrick, Logan, Kim, Jong Wook, Kim, Christina, Kim, Yongjik, Kirchner, Hendrik, Kiros, Jamie, Knight, Matt, Kokotajlo, Daniel, Kondraciuk, ลukasz, Kondrich, Andrew, Konstantinidis, Aris, Kosic, Kyle, Krueger, Gretchen, Kuo, Vishal, Lampe, Michael, Lan, Ikai, Lee, Teddy, Leike, Jan, Leung, Jade, Levy, Daniel, Li, Chak Ming, Lim, Rachel, Lin, Molly, Lin, Stephanie, Litwin, Mateusz, Lopez, Theresa, Lowe, Ryan, Lue, Patricia, Makanju, Anna, Malfacini, Kim, Manning, Sam, Markov, Todor, Markovski, Yaniv, Martin, Bianca, Mayer, Katie, Mayne, Andrew, McGrew, Bob, McKinney, Scott Mayer, McLeavey, Christine, McMillan, Paul, McNeil, Jake, Medina, David, Mehta, Aalok, Menick, Jacob, Metz, Luke, Mishchenko, Andrey, Mishkin, Pamela, Monaco, Vinnie, Morikawa, Evan, Mossing, Daniel, Mu, Tong, Murati, Mira, Murk, Oleg, Mรฉly, David, Nair, Ashvin, Nakano, Reiichiro, Nayak, Rajeev, Neelakantan, Arvind, Ngo, Richard, Noh, Hyeonwoo, Ouyang, Long, O'Keefe, Cullen, Pachocki, Jakub, Paino, Alex, Palermo, Joe, Pantuliano, Ashley, Parascandolo, Giambattista, Parish, Joel, Parparita, Emy, Passos, Alex, Pavlov, Mikhail, Peng, Andrew, Perelman, Adam, Peres, Filipe de Avila Belbute, Petrov, Michael, Pinto, Henrique Ponde de Oliveira, Michael, null, Pokorny, null, Pokrass, Michelle, Pong, Vitchyr, Powell, Tolly, Power, Alethea, Power, Boris, Proehl, Elizabeth, Puri, Raul, Radford, Alec, Rae, Jack, Ramesh, Aditya, Raymond, Cameron, Real, Francis, Rimbach, Kendra, Ross, Carl, Rotsted, Bob, Roussez, Henri, Ryder, Nick, Saltarelli, Mario, Sanders, Ted, Santurkar, Shibani, Sastry, Girish, Schmidt, Heather, Schnurr, David, Schulman, John, Selsam, Daniel, Sheppard, Kyla, Sherbakov, Toki, Shieh, Jessica, Shoker, Sarah, Shyam, Pranav, Sidor, Szymon, Sigler, Eric, Simens, Maddie, Sitkin, Jordan, Slama, Katarina, Sohl, Ian, Sokolowsky, Benjamin, Song, Yang, Staudacher, Natalie, Such, Felipe Petroski, Summers, Natalie, Sutskever, Ilya, Tang, Jie, Tezak, Nikolas, Thompson, Madeleine, Tillet, Phil, Tootoonchian, Amin, Tseng, Elizabeth, Tuggle, Preston, Turley, Nick, Tworek, Jerry, Uribe, Juan Felipe Cerรณn, Vallone, Andrea, Vijayvergiya, Arun, Voss, Chelsea, Wainwright, Carroll, Wang, Justin Jay, Wang, Alvin, Wang, Ben, Ward, Jonathan, Wei, Jason, Weinmann, CJ, Welihinda, Akila, Welinder, Peter, Weng, Jiayi, Weng, Lilian, Wiethoff, Matt, Willner, Dave, Winter, Clemens, Wolrich, Samuel, Wong, Hannah, Workman, Lauren, Wu, Sherwin, Wu, Jeff, Wu, Michael, Xiao, Kai, Xu, Tao, Yoo, Sarah, Yu, Kevin, Yuan, Qiming, Zaremba, Wojciech, Zellers, Rowan, Zhang, Chong, Zhang, Marvin, Zhao, Shengjia, Zheng, Tianhao, Zhuang, Juntang, Zhuk, William, Zoph, Barret
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
Visual Reinforcement Learning with Imagined Goals
Nair, Ashvin V., Pong, Vitchyr, Dalal, Murtaza, Bahl, Shikhar, Lin, Steven, Levine, Sergey
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching.
Visual Reinforcement Learning with Imagined Goals
Nair, Ashvin V., Pong, Vitchyr, Dalal, Murtaza, Bahl, Shikhar, Lin, Steven, Levine, Sergey
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals in a real-world physical system, and substantially outperforms prior techniques.
Visual Reinforcement Learning with Imagined Goals
Nair, Ashvin V., Pong, Vitchyr, Dalal, Murtaza, Bahl, Shikhar, Lin, Steven, Levine, Sergey
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals in a real-world physical system, and substantially outperforms prior techniques.
Visual Reinforcement Learning with Imagined Goals
Nair, Ashvin, Pong, Vitchyr, Dalal, Murtaza, Bahl, Shikhar, Lin, Steven, Levine, Sergey
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques.
Composable Deep Reinforcement Learning for Robotic Manipulation
Haarnoja, Tuomas, Pong, Vitchyr, Zhou, Aurick, Dalal, Murtaza, Abbeel, Pieter, Levine, Sergey
Model-free deep reinforcement learning has been shown to exhibit good performance in domains ranging from video games to simulated robotic manipulation and locomotion. However, model-free methods are known to perform poorly when the interaction time with the environment is limited, as is the case for most real-world robotic tasks. In this paper, we study how maximum entropy policies trained using soft Q-learning can be applied to real-world robotic manipulation. The application of this method to real-world manipulation is facilitated by two important features of soft Q-learning. First, soft Q-learning can learn multimodal exploration strategies by learning policies represented by expressive energy-based models. Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch. Our experimental evaluation demonstrates that soft Q-learning is substantially more sample efficient than prior model-free deep reinforcement learning methods, and that compositionality can be performed for both simulated and real-world tasks.