Pertsch, Karl
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Khazatsky, Alexander, Pertsch, Karl, Nair, Suraj, Balakrishna, Ashwin, Dasari, Sudeep, Karamcheti, Siddharth, Nasiriany, Soroush, Srirama, Mohan Kumar, Chen, Lawrence Yunliang, Ellis, Kirsty, Fagan, Peter David, Hejna, Joey, Itkina, Masha, Lepert, Marion, Ma, Yecheng Jason, Miller, Patrick Tree, Wu, Jimmy, Belkhale, Suneel, Dass, Shivin, Ha, Huy, Jain, Arhan, Lee, Abraham, Lee, Youngwoon, Memmel, Marius, Park, Sungjae, Radosavovic, Ilija, Wang, Kaiyuan, Zhan, Albert, Black, Kevin, Chi, Cheng, Hatch, Kyle Beltran, Lin, Shan, Lu, Jingpei, Mercat, Jean, Rehman, Abdul, Sanketi, Pannag R, Sharma, Archit, Simpson, Cody, Vuong, Quan, Walke, Homer Rich, Wulfe, Blake, Xiao, Ted, Yang, Jonathan Heewon, Yavary, Arefeh, Zhao, Tony Z., Agia, Christopher, Baijal, Rohan, Castro, Mateo Guaman, Chen, Daphne, Chen, Qiuyu, Chung, Trinity, Drake, Jaimyn, Foster, Ethan Paul, Gao, Jensen, Herrera, David Antonio, Heo, Minho, Hsu, Kyle, Hu, Jiaheng, Jackson, Donovon, Le, Charlotte, Li, Yunshuang, Lin, Kevin, Lin, Roy, Ma, Zehan, Maddukuri, Abhiram, Mirchandani, Suvir, Morton, Daniel, Nguyen, Tony, O'Neill, Abigail, Scalise, Rosario, Seale, Derick, Son, Victor, Tian, Stephen, Tran, Emi, Wang, Andrew E., Wu, Yilin, Xie, Annie, Yang, Jingyun, Yin, Patrick, Zhang, Yunchu, Bastani, Osbert, Berseth, Glen, Bohg, Jeannette, Goldberg, Ken, Gupta, Abhinav, Gupta, Abhishek, Jayaraman, Dinesh, Lim, Joseph J, Malik, Jitendra, Martรญn-Martรญn, Roberto, Ramamoorthy, Subramanian, Sadigh, Dorsa, Song, Shuran, Wu, Jiajun, Yip, Michael C., Zhu, Yuke, Kollar, Thomas, Levine, Sergey, Finn, Chelsea
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.
SPRINT: Scalable Policy Pre-Training via Language Instruction Relabeling
Zhang, Jesse, Pertsch, Karl, Zhang, Jiahui, Lim, Joseph J.
Pre-training robot policies with a rich set of skills can substantially accelerate the learning of downstream tasks. Prior works have defined pre-training tasks via natural language instructions, but doing so requires tedious human annotation of hundreds of thousands of instructions. Thus, we propose SPRINT, a scalable offline policy pre-training approach which substantially reduces the human effort needed for pre-training a diverse set of skills. Our method uses two core ideas to automatically expand a base set of pre-training tasks: instruction relabeling via large language models and cross-trajectory skill chaining through offline reinforcement learning. As a result, SPRINT pre-training equips robots with a much richer repertoire of skills. Experimental results in a household simulator and on a real robot kitchen manipulation task show that SPRINT leads to substantially faster learning of new long-horizon tasks than previous pre-training approaches. Website at https://clvrai.com/sprint.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Collaboration, Open X-Embodiment, Padalkar, Abhishek, Pooley, Acorn, Mandlekar, Ajay, Jain, Ajinkya, Tung, Albert, Bewley, Alex, Herzog, Alex, Irpan, Alex, Khazatsky, Alexander, Rai, Anant, Singh, Anikait, Garg, Animesh, Brohan, Anthony, Raffin, Antonin, Wahid, Ayzaan, Burgess-Limerick, Ben, Kim, Beomjoon, Schรถlkopf, Bernhard, Ichter, Brian, Lu, Cewu, Xu, Charles, Finn, Chelsea, Xu, Chenfeng, Chi, Cheng, Huang, Chenguang, Chan, Christine, Pan, Chuer, Fu, Chuyuan, Devin, Coline, Driess, Danny, Pathak, Deepak, Shah, Dhruv, Bรผchler, Dieter, Kalashnikov, Dmitry, Sadigh, Dorsa, Johns, Edward, Ceola, Federico, Xia, Fei, Stulp, Freek, Zhou, Gaoyue, Sukhatme, Gaurav S., Salhotra, Gautam, Yan, Ge, Schiavi, Giulio, Kahn, Gregory, Su, Hao, Fang, Hao-Shu, Shi, Haochen, Amor, Heni Ben, Christensen, Henrik I, Furuta, Hiroki, Walke, Homer, Fang, Hongjie, Mordatch, Igor, Radosavovic, Ilija, Leal, Isabel, Liang, Jacky, Abou-Chakra, Jad, Kim, Jaehyung, Peters, Jan, Schneider, Jan, Hsu, Jasmine, Bohg, Jeannette, Bingham, Jeffrey, Wu, Jiajun, Wu, Jialin, Luo, Jianlan, Gu, Jiayuan, Tan, Jie, Oh, Jihoon, Malik, Jitendra, Booher, Jonathan, Tompson, Jonathan, Yang, Jonathan, Lim, Joseph J., Silvรฉrio, Joรฃo, Han, Junhyek, Rao, Kanishka, Pertsch, Karl, Hausman, Karol, Go, Keegan, Gopalakrishnan, Keerthana, Goldberg, Ken, Byrne, Kendra, Oslund, Kenneth, Kawaharazuka, Kento, Zhang, Kevin, Rana, Krishan, Srinivasan, Krishnan, Chen, Lawrence Yunliang, Pinto, Lerrel, Fei-Fei, Li, Tan, Liam, Ott, Lionel, Lee, Lisa, Tomizuka, Masayoshi, Spero, Max, Du, Maximilian, Ahn, Michael, Zhang, Mingtong, Ding, Mingyu, Srirama, Mohan Kumar, Sharma, Mohit, Kim, Moo Jin, Kanazawa, Naoaki, Hansen, Nicklas, Heess, Nicolas, Joshi, Nikhil J, Suenderhauf, Niko, Di Palo, Norman, Shafiullah, Nur Muhammad Mahi, Mees, Oier, Kroemer, Oliver, Sanketi, Pannag R, Wohlhart, Paul, Xu, Peng, Sermanet, Pierre, Sundaresan, Priya, Vuong, Quan, Rafailov, Rafael, Tian, Ran, Doshi, Ria, Martรญn-Martรญn, Roberto, Mendonca, Russell, Shah, Rutav, Hoque, Ryan, Julian, Ryan, Bustamante, Samuel, Kirmani, Sean, Levine, Sergey, Moore, Sherry, Bahl, Shikhar, Dass, Shivin, Sonawani, Shubham, Song, Shuran, Xu, Sichun, Haldar, Siddhant, Adebola, Simeon, Guist, Simon, Nasiriany, Soroush, Schaal, Stefan, Welker, Stefan, Tian, Stephen, Dasari, Sudeep, Belkhale, Suneel, Osa, Takayuki, Harada, Tatsuya, Matsushima, Tatsuya, Xiao, Ted, Yu, Tianhe, Ding, Tianli, Davchev, Todor, Zhao, Tony Z., Armstrong, Travis, Darrell, Trevor, Jain, Vidhi, Vanhoucke, Vincent, Zhan, Wei, Zhou, Wenxuan, Burgard, Wolfram, Chen, Xi, Wang, Xiaolong, Zhu, Xinghao, Li, Xuanlin, Lu, Yao, Chebotar, Yevgen, Zhou, Yifan, Zhu, Yifeng, Xu, Ying, Wang, Yixuan, Bisk, Yonatan, Cho, Yoonyoung, Lee, Youngwoon, Cui, Yuchen, Wu, Yueh-Hua, Tang, Yujin, Zhu, Yuke, Li, Yunzhu, Iwasawa, Yusuke, Matsuo, Yutaka, Xu, Zhuo, Cui, Zichen Jeff
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website $\href{https://robotics-transformer-x.github.io}{\text{robotics-transformer-x.github.io}}$.
LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers
Nam, Taewook, Lee, Juyong, Zhang, Jesse, Hwang, Sung Ju, Lim, Joseph J., Pertsch, Karl
We propose a framework that leverages foundation models as teachers, guiding a reinforcement learning agent to acquire semantically meaningful behavior without human feedback. In our framework, the agent receives task instructions grounded in a training environment from large language models. Then, a vision-language model guides the agent in learning the multi-task language-conditioned policy by providing reward feedback. We demonstrate that our method can learn semantically meaningful skills in a challenging open-ended MineDojo environment while prior unsupervised skill discovery methods struggle. Additionally, we discuss observed challenges of using off-the-shelf foundation models as teachers and our efforts to address them.
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions
Chebotar, Yevgen, Vuong, Quan, Irpan, Alex, Hausman, Karol, Xia, Fei, Lu, Yao, Kumar, Aviral, Yu, Tianhe, Herzog, Alexander, Pertsch, Karl, Gopalakrishnan, Keerthana, Ibarz, Julian, Nachum, Ofir, Sontakke, Sumedh, Salazar, Grecia, Tran, Huong T, Peralta, Jodilyn, Tan, Clayton, Manjunath, Deeksha, Singht, Jaspiar, Zitkovich, Brianna, Jackson, Tomas, Rao, Kanishka, Finn, Chelsea, Levine, Sergey
In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite. The project's website and videos can be found at https://qtransformer.github.io
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance
Zhang, Jesse, Zhang, Jiahui, Pertsch, Karl, Liu, Ziyi, Ren, Xiang, Chang, Minsuk, Sun, Shao-Hua, Lim, Joseph J.
We propose BOSS, an approach that automatically learns to solve new long-horizon, complex, and meaningful tasks by growing a learned skill library with minimal supervision. Prior work in reinforcement learning require expert supervision, in the form of demonstrations or rich reward functions, to learn long-horizon tasks. Instead, our approach BOSS (BOotStrapping your own Skills) learns to accomplish new tasks by performing "skill bootstrapping," where an agent with a set of primitive skills interacts with the environment to practice new skills without receiving reward feedback for tasks outside of the initial skill set. This bootstrapping phase is guided by large language models (LLMs) that inform the agent of meaningful skills to chain together. Through this process, BOSS builds a wide range of complex and useful behaviors from a basic set of primitive skills. We demonstrate through experiments in realistic household environments that agents trained with our LLM-guided bootstrapping procedure outperform those trained with naive bootstrapping as well as prior unsupervised skill acquisition methods on zero-shot execution of unseen, long-horizon tasks in new environments. Website at clvrai.com/boss.
RoboCLIP: One Demonstration is Enough to Learn Robot Policies
Sontakke, Sumedh A, Zhang, Jesse, Arnold, Sรฉbastien M. R., Pertsch, Karl, Bฤฑyฤฑk, Erdem, Sadigh, Dorsa, Finn, Chelsea, Itti, Laurent
Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration.
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, Anthony, Brown, Noah, Carbajal, Justice, Chebotar, Yevgen, Dabis, Joseph, Finn, Chelsea, Gopalakrishnan, Keerthana, Hausman, Karol, Herzog, Alex, Hsu, Jasmine, Ibarz, Julian, Ichter, Brian, Irpan, Alex, Jackson, Tomas, Jesmonth, Sally, Joshi, Nikhil J, Julian, Ryan, Kalashnikov, Dmitry, Kuang, Yuheng, Leal, Isabel, Lee, Kuang-Huei, Levine, Sergey, Lu, Yao, Malla, Utsav, Manjunath, Deeksha, Mordatch, Igor, Nachum, Ofir, Parada, Carolina, Peralta, Jodilyn, Perez, Emily, Pertsch, Karl, Quiambao, Jornell, Rao, Kanishka, Ryoo, Michael, Salazar, Grecia, Sanketi, Pannag, Sayed, Kevin, Singh, Jaspiar, Sontakke, Sumedh, Stone, Austin, Tan, Clayton, Tran, Huong, Vanhoucke, Vincent, Vega, Steve, Vuong, Quan, Xia, Fei, Xiao, Ted, Xu, Peng, Xu, Sichun, Yu, Tianhe, Zitkovich, Brianna
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, Anthony, Brown, Noah, Carbajal, Justice, Chebotar, Yevgen, Chen, Xi, Choromanski, Krzysztof, Ding, Tianli, Driess, Danny, Dubey, Avinava, Finn, Chelsea, Florence, Pete, Fu, Chuyuan, Arenas, Montse Gonzalez, Gopalakrishnan, Keerthana, Han, Kehang, Hausman, Karol, Herzog, Alexander, Hsu, Jasmine, Ichter, Brian, Irpan, Alex, Joshi, Nikhil, Julian, Ryan, Kalashnikov, Dmitry, Kuang, Yuheng, Leal, Isabel, Lee, Lisa, Lee, Tsang-Wei Edward, Levine, Sergey, Lu, Yao, Michalewski, Henryk, Mordatch, Igor, Pertsch, Karl, Rao, Kanishka, Reymann, Krista, Ryoo, Michael, Salazar, Grecia, Sanketi, Pannag, Sermanet, Pierre, Singh, Jaspiar, Singh, Anikait, Soricut, Radu, Tran, Huong, Vanhoucke, Vincent, Vuong, Quan, Wahid, Ayzaan, Welker, Stefan, Wohlhart, Paul, Wu, Jialin, Xia, Fei, Xiao, Ted, Xu, Peng, Xu, Sichun, Yu, Tianhe, Zitkovich, Brianna
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).
PATO: Policy Assisted TeleOperation for Scalable Robot Data Collection
Dass, Shivin, Pertsch, Karl, Zhang, Hejia, Lee, Youngwoon, Lim, Joseph J., Nikolaidis, Stefanos
Large-scale data is an essential component of machine learning as demonstrated in recent advances in natural language processing and computer vision research. However, collecting large-scale robotic data is much more expensive and slower as each operator can control only a single robot at a time. To make this costly data collection process efficient and scalable, we propose Policy Assisted TeleOperation (PATO), a system which automates part of the demonstration collection process using a learned assistive policy. PATO autonomously executes repetitive behaviors in data collection and asks for human input only when it is uncertain about which subtask or behavior to execute. We conduct teleoperation user studies both with a real robot and a simulated robot fleet and demonstrate that our assisted teleoperation system reduces human operators' mental load while improving data collection efficiency. Further, it enables a single operator to control multiple robots in parallel, which is a first step towards scalable robotic data collection. For code and video results, see https://clvrai.com/pato