Kosaraju, Vineet
Competitive Programming with Large Reasoning Models
OpenAI, null, :, null, El-Kishky, Ahmed, Wei, Alexander, Saraiva, Andre, Minaev, Borys, Selsam, Daniel, Dohan, David, Song, Francis, Lightman, Hunter, Clavera, Ignasi, Pachocki, Jakub, Tworek, Jerry, Kuhn, Lorenz, Kaiser, Lukasz, Chen, Mark, Schwarzer, Max, Rohaninejad, Mostafa, McAleese, Nat, contributors, o3, Mürk, Oleg, Garg, Rhythm, Shu, Rui, Sidor, Szymon, Kosaraju, Vineet, Zhou, Wenda
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.
OpenAI o1 System Card
OpenAI, null, :, null, Jaech, Aaron, Kalai, Adam, Lerer, Adam, Richardson, Adam, El-Kishky, Ahmed, Low, Aiden, Helyar, Alec, Madry, Aleksander, Beutel, Alex, Carney, Alex, Iftimie, Alex, Karpenko, Alex, Passos, Alex Tachard, Neitz, Alexander, Prokofiev, Alexander, Wei, Alexander, Tam, Allison, Bennett, Ally, Kumar, Ananya, Saraiva, Andre, Vallone, Andrea, Duberstein, Andrew, Kondrich, Andrew, Mishchenko, Andrey, Applebaum, Andy, Jiang, Angela, Nair, Ashvin, Zoph, Barret, Ghorbani, Behrooz, Rossen, Ben, Sokolowsky, Benjamin, Barak, Boaz, McGrew, Bob, Minaiev, Borys, Hao, Botao, Baker, Bowen, Houghton, Brandon, McKinzie, Brandon, Eastman, Brydon, Lugaresi, Camillo, Bassin, Cary, Hudson, Cary, Li, Chak Ming, de Bourcy, Charles, Voss, Chelsea, Shen, Chen, Zhang, Chong, Koch, Chris, Orsinger, Chris, Hesse, Christopher, Fischer, Claudia, Chan, Clive, Roberts, Dan, Kappler, Daniel, Levy, Daniel, Selsam, Daniel, Dohan, David, Farhi, David, Mely, David, Robinson, David, Tsipras, Dimitris, Li, Doug, Oprica, Dragos, Freeman, Eben, Zhang, Eddie, Wong, Edmund, Proehl, Elizabeth, Cheung, Enoch, Mitchell, Eric, Wallace, Eric, Ritter, Erik, Mays, Evan, Wang, Fan, Such, Felipe Petroski, Raso, Filippo, Leoni, Florencia, Tsimpourlas, Foivos, Song, Francis, von Lohmann, Fred, Sulit, Freddie, Salmon, Geoff, Parascandolo, Giambattista, Chabot, Gildas, Zhao, Grace, Brockman, Greg, Leclerc, Guillaume, Salman, Hadi, Bao, Haiming, Sheng, Hao, Andrin, Hart, Bagherinezhad, Hessam, Ren, Hongyu, Lightman, Hunter, Chung, Hyung Won, Kivlichan, Ian, O'Connell, Ian, Osband, Ian, Gilaberte, Ignasi Clavera, Akkaya, Ilge, Kostrikov, Ilya, Sutskever, Ilya, Kofman, Irina, Pachocki, Jakub, Lennon, James, Wei, Jason, Harb, Jean, Twore, Jerry, Feng, Jiacheng, Yu, Jiahui, Weng, Jiayi, Tang, Jie, Yu, Jieqi, Candela, Joaquin Quiñonero, Palermo, Joe, Parish, Joel, Heidecke, Johannes, Hallman, John, Rizzo, John, Gordon, Jonathan, Uesato, Jonathan, Ward, Jonathan, Huizinga, Joost, Wang, Julie, Chen, Kai, Xiao, Kai, Singhal, Karan, Nguyen, Karina, Cobbe, Karl, Shi, Katy, Wood, Kayla, Rimbach, Kendra, Gu-Lemberg, Keren, Liu, Kevin, Lu, Kevin, Stone, Kevin, Yu, Kevin, Ahmad, Lama, Yang, Lauren, Liu, Leo, Maksin, Leon, Ho, Leyton, Fedus, Liam, Weng, Lilian, Li, Linden, McCallum, Lindsay, Held, Lindsey, Kuhn, Lorenz, Kondraciuk, Lukas, Kaiser, Lukasz, Metz, Luke, Boyd, Madelaine, Trebacz, Maja, Joglekar, Manas, Chen, Mark, Tintor, Marko, Meyer, Mason, Jones, Matt, Kaufer, Matt, Schwarzer, Max, Shah, Meghan, Yatbaz, Mehmet, Guan, Melody Y., Xu, Mengyuan, Yan, Mengyuan, Glaese, Mia, Chen, Mianna, Lampe, Michael, Malek, Michael, Wang, Michele, Fradin, Michelle, McClay, Mike, Pavlov, Mikhail, Wang, Miles, Wang, Mingxuan, Murati, Mira, Bavarian, Mo, Rohaninejad, Mostafa, McAleese, Nat, Chowdhury, Neil, Chowdhury, Neil, Ryder, Nick, Tezak, Nikolas, Brown, Noam, Nachum, Ofir, Boiko, Oleg, Murk, Oleg, Watkins, Olivia, Chao, Patrick, Ashbourne, Paul, Izmailov, Pavel, Zhokhov, Peter, Dias, Rachel, Arora, Rahul, Lin, Randall, Lopes, Rapha Gontijo, Gaon, Raz, Miyara, Reah, Leike, Reimar, Hwang, Renny, Garg, Rhythm, Brown, Robin, James, Roshan, Shu, Rui, Cheu, Ryan, Greene, Ryan, Jain, Saachi, Altman, Sam, Toizer, Sam, Toyer, Sam, Miserendino, Samuel, Agarwal, Sandhini, Hernandez, Santiago, Baker, Sasha, McKinney, Scott, Yan, Scottie, Zhao, Shengjia, Hu, Shengli, Santurkar, Shibani, Chaudhuri, Shraman Ray, Zhang, Shuyuan, Fu, Siyuan, Papay, Spencer, Lin, Steph, Balaji, Suchir, Sanjeev, Suvansh, Sidor, Szymon, Broda, Tal, Clark, Aidan, Wang, Tao, Gordon, Taylor, Sanders, Ted, Patwardhan, Tejal, Sottiaux, Thibault, Degry, Thomas, Dimson, Thomas, Zheng, Tianhao, Garipov, Timur, Stasi, Tom, Bansal, Trapit, Creech, Trevor, Peterson, Troy, Eloundou, Tyna, Qi, Valerie, Kosaraju, Vineet, Monaco, Vinnie, Pong, Vitchyr, Fomenko, Vlad, Zheng, Weiyi, Zhou, Wenda, McCabe, Wes, Zaremba, Wojciech, Dubois, Yann, Lu, Yinghai, Chen, Yining, Cha, Young, Bai, Yu, He, Yuchen, Zhang, Yuchen, Wang, Yunyun, Shao, Zheng, Li, Zhuohan
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
Let's Verify Step by Step
Lightman, Hunter, Kosaraju, Vineet, Burda, Yura, Edwards, Harri, Baker, Bowen, Lee, Teddy, Leike, Jan, Schulman, John, Sutskever, Ilya, Cobbe, Karl
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
WebGPT: Browser-assisted question-answering with human feedback
Nakano, Reiichiro, Hilton, Jacob, Balaji, Suchir, Wu, Jeff, Ouyang, Long, Kim, Christina, Hesse, Christopher, Jain, Shantanu, Kosaraju, Vineet, Saunders, William, Jiang, Xu, Cobbe, Karl, Eloundou, Tyna, Krueger, Gretchen, Button, Kevin, Knight, Matthew, Chess, Benjamin, Schulman, John
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
Asymmetric self-play for automatic goal discovery in robotic manipulation
OpenAI, OpenAI, Plappert, Matthias, Sampedro, Raul, Xu, Tao, Akkaya, Ilge, Kosaraju, Vineet, Welinder, Peter, D'Sa, Ruben, Petron, Arthur, Pinto, Henrique Ponde de Oliveira, Paino, Alex, Noh, Hyeonwoo, Weng, Lilian, Yuan, Qiming, Chu, Casey, Zaremba, Wojciech
We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method can discover highly diverse and complex goals without any human priors. Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice's trajectory when relabeled as a goal-conditioned demonstration. Finally, our method scales, resulting in a single policy that can generalize to many unseen tasks such as setting a table, stacking blocks, and solving simple puzzles. We are motivated to train a single goal-conditioned policy (Kaelbling, 1993) that can solve any robotic manipulation task that a human may request in a given environment. In this work, we make progress towards this goal by solving a robotic manipulation problem in a tabletop setting where the robot's task is to change the initial configuration of a variable number of objects on a table to match a given goal configuration. This problem is simple in its formulation but likely to challenge a wide variety of cognitive abilities of a robot as objects become diverse and goals become complex. Motivated by the recent success of deep reinforcement learning for robotics (Levine et al., 2016; Gu et al., 2017; Hwangbo et al., 2019; OpenAI et al., 2019a), we tackle this problem using deep reinforcement learning on a very large training distribution. An open question in this approach is how we can build a training distribution rich enough to achieve generalization to many unseen manipulation tasks. This involves defining both an environment's initial state distribution and a goal distribution.
BERT Learns (and Teaches) Chemistry
Payne, Josh, Srouji, Mario, Yap, Dian Ang, Kosaraju, Vineet
Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using an transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks
Kosaraju, Vineet, Sadeghian, Amir, Martín-Martín, Roberto, Reid, Ian, Rezatofighi, Hamid, Savarese, Silvio
Predicting the future trajectories of multiple interacting pedestrians in a scene has become an increasingly important problem for many different applications ranging from control of autonomous vehicles and social robots to security and surveillance. This problem is compounded by the presence of social interactions between humans and their physical interactions with the scene. While the existing literature has explored some of these cues, they mainly ignored the multimodal nature of each human's future trajectory which is noticeably influenced by the intricate social interactions. In this paper, we present Social-BiGAT, a graph-based generative adversarial network that generates realistic, multimodal trajectory predictions for multiple pedestrians in a scene. Our method is based on a graph attention network (GAT) that learns feature representations that encode the social interactions between humans in the scene, and a recurrent encoder-decoder architecture that is trained adversarially to predict, based on the features, the humans' paths.