Devin, Coline
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, null, Abeyruwan, Saminda, Ainslie, Joshua, Alayrac, Jean-Baptiste, Arenas, Montserrat Gonzalez, Armstrong, Travis, Balakrishna, Ashwin, Baruch, Robert, Bauza, Maria, Blokzijl, Michiel, Bohez, Steven, Bousmalis, Konstantinos, Brohan, Anthony, Buschmann, Thomas, Byravan, Arunkumar, Cabi, Serkan, Caluwaerts, Ken, Casarini, Federico, Chang, Oscar, Chen, Jose Enrique, Chen, Xi, Chiang, Hao-Tien Lewis, Choromanski, Krzysztof, D'Ambrosio, David, Dasari, Sudeep, Davchev, Todor, Devin, Coline, Di Palo, Norman, Ding, Tianli, Dostmohamed, Adil, Driess, Danny, Du, Yilun, Dwibedi, Debidatta, Elabd, Michael, Fantacci, Claudio, Fong, Cody, Frey, Erik, Fu, Chuyuan, Giustina, Marissa, Gopalakrishnan, Keerthana, Graesser, Laura, Hasenclever, Leonard, Heess, Nicolas, Hernaez, Brandon, Herzog, Alexander, Hofer, R. Alex, Humplik, Jan, Iscen, Atil, Jacob, Mithun George, Jain, Deepali, Julian, Ryan, Kalashnikov, Dmitry, Karagozler, M. Emre, Karp, Stefani, Kew, Chase, Kirkland, Jerad, Kirmani, Sean, Kuang, Yuheng, Lampe, Thomas, Laurens, Antoine, Leal, Isabel, Lee, Alex X., Lee, Tsang-Wei Edward, Liang, Jacky, Lin, Yixin, Maddineni, Sharath, Majumdar, Anirudha, Michaely, Assaf Hurwitz, Moreno, Robert, Neunert, Michael, Nori, Francesco, Parada, Carolina, Parisotto, Emilio, Pastor, Peter, Pooley, Acorn, Rao, Kanishka, Reymann, Krista, Sadigh, Dorsa, Saliceti, Stefano, Sanketi, Pannag, Sermanet, Pierre, Shah, Dhruv, Sharma, Mohit, Shea, Kathryn, Shu, Charles, Sindhwani, Vikas, Singh, Sumeet, Soricut, Radu, Springenberg, Jost Tobias, Sterneck, Rachel, Surdulescu, Razvan, Tan, Jie, Tompson, Jonathan, Vanhoucke, Vincent, Varley, Jake, Vesom, Grace, Vezzani, Giulia, Vinyals, Oriol, Wahid, Ayzaan, Welker, Stefan, Wohlhart, Paul, Xia, Fei, Xiao, Ted, Xie, Annie, Xie, Jinyu, Xu, Peng, Xu, Sichun, Xu, Ying, Xu, Zhuo, Yang, Yuxiang, Yao, Rui, Yaroshenko, Sergey, Yu, Wenhao, Yuan, Wentao, Zhang, Jingwei, Zhang, Tingnan, Zhou, Allan, Zhou, Yuxiang
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.
Robot Data Curation with Mutual Information Estimators
Hejna, Joey, Mirchandani, Suvir, Balakrishna, Ashwin, Xie, Annie, Wahid, Ayzaan, Tompson, Jonathan, Sanketi, Pannag, Shah, Dhruv, Devin, Coline, Sadigh, Dorsa
The performance of imitation learning policies often hinges on the datasets with which they are trained. Consequently, investment in data collection for robotics has grown across both industrial and academic labs. However, despite the marked increase in the quantity of demonstrations collected, little work has sought to assess the quality of said data despite mounting evidence of its importance in other areas such as vision and language. In this work, we take a critical step towards addressing the data quality in robotics. Given a dataset of demonstrations, we aim to estimate the relative quality of individual demonstrations in terms of both state diversity and action predictability. To do so, we estimate the average contribution of a trajectory towards the mutual information between states and actions in the entire dataset, which precisely captures both the entropy of the state distribution and the state-conditioned entropy of actions. Though commonly used mutual information estimators require vast amounts of data often beyond the scale available in robotics, we introduce a novel technique based on k-nearest neighbor estimates of mutual information on top of simple VAE embeddings of states and actions. Empirically, we demonstrate that our approach is able to partition demonstration datasets by quality according to human expert scores across a diverse set of benchmarks spanning simulation and real world environments. Moreover, training policies based on data filtered by our method leads to a 5-10% improvement in RoboMimic and better performance on real ALOHA and Franka setups.
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Bousmalis, Konstantinos, Vezzani, Giulia, Rao, Dushyant, Devin, Coline, Lee, Alex X., Bauza, Maria, Davchev, Todor, Zhou, Yuxiang, Gupta, Agrim, Raju, Akhil, Laurens, Antoine, Fantacci, Claudio, Dalibard, Valentin, Zambelli, Martina, Martins, Murilo, Pevceviciute, Rugile, Blokzijl, Michiel, Denil, Misha, Batchelor, Nathan, Lampe, Thomas, Parisotto, Emilio, Żołna, Konrad, Reed, Scott, Colmenarejo, Sergio Gómez, Scholz, Jon, Abdolmaleki, Abbas, Groth, Oliver, Regli, Jean-Baptiste, Sushkov, Oleg, Rothörl, Tom, Chen, José Enrique, Aytar, Yusuf, Barker, Dave, Ortiz, Joy, Riedmiller, Martin, Springenberg, Jost Tobias, Hadsell, Raia, Nori, Francesco, Heess, Nicolas
The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Collaboration, Open X-Embodiment, Padalkar, Abhishek, Pooley, Acorn, Mandlekar, Ajay, Jain, Ajinkya, Tung, Albert, Bewley, Alex, Herzog, Alex, Irpan, Alex, Khazatsky, Alexander, Rai, Anant, Singh, Anikait, Garg, Animesh, Brohan, Anthony, Raffin, Antonin, Wahid, Ayzaan, Burgess-Limerick, Ben, Kim, Beomjoon, Schölkopf, Bernhard, Ichter, Brian, Lu, Cewu, Xu, Charles, Finn, Chelsea, Xu, Chenfeng, Chi, Cheng, Huang, Chenguang, Chan, Christine, Pan, Chuer, Fu, Chuyuan, Devin, Coline, Driess, Danny, Pathak, Deepak, Shah, Dhruv, Büchler, Dieter, Kalashnikov, Dmitry, Sadigh, Dorsa, Johns, Edward, Ceola, Federico, Xia, Fei, Stulp, Freek, Zhou, Gaoyue, Sukhatme, Gaurav S., Salhotra, Gautam, Yan, Ge, Schiavi, Giulio, Kahn, Gregory, Su, Hao, Fang, Hao-Shu, Shi, Haochen, Amor, Heni Ben, Christensen, Henrik I, Furuta, Hiroki, Walke, Homer, Fang, Hongjie, Mordatch, Igor, Radosavovic, Ilija, Leal, Isabel, Liang, Jacky, Abou-Chakra, Jad, Kim, Jaehyung, Peters, Jan, Schneider, Jan, Hsu, Jasmine, Bohg, Jeannette, Bingham, Jeffrey, Wu, Jiajun, Wu, Jialin, Luo, Jianlan, Gu, Jiayuan, Tan, Jie, Oh, Jihoon, Malik, Jitendra, Booher, Jonathan, Tompson, Jonathan, Yang, Jonathan, Lim, Joseph J., Silvério, João, Han, Junhyek, Rao, Kanishka, Pertsch, Karl, Hausman, Karol, Go, Keegan, Gopalakrishnan, Keerthana, Goldberg, Ken, Byrne, Kendra, Oslund, Kenneth, Kawaharazuka, Kento, Zhang, Kevin, Rana, Krishan, Srinivasan, Krishnan, Chen, Lawrence Yunliang, Pinto, Lerrel, Fei-Fei, Li, Tan, Liam, Ott, Lionel, Lee, Lisa, Tomizuka, Masayoshi, Spero, Max, Du, Maximilian, Ahn, Michael, Zhang, Mingtong, Ding, Mingyu, Srirama, Mohan Kumar, Sharma, Mohit, Kim, Moo Jin, Kanazawa, Naoaki, Hansen, Nicklas, Heess, Nicolas, Joshi, Nikhil J, Suenderhauf, Niko, Di Palo, Norman, Shafiullah, Nur Muhammad Mahi, Mees, Oier, Kroemer, Oliver, Sanketi, Pannag R, Wohlhart, Paul, Xu, Peng, Sermanet, Pierre, Sundaresan, Priya, Vuong, Quan, Rafailov, Rafael, Tian, Ran, Doshi, Ria, Martín-Martín, Roberto, Mendonca, Russell, Shah, Rutav, Hoque, Ryan, Julian, Ryan, Bustamante, Samuel, Kirmani, Sean, Levine, Sergey, Moore, Sherry, Bahl, Shikhar, Dass, Shivin, Sonawani, Shubham, Song, Shuran, Xu, Sichun, Haldar, Siddhant, Adebola, Simeon, Guist, Simon, Nasiriany, Soroush, Schaal, Stefan, Welker, Stefan, Tian, Stephen, Dasari, Sudeep, Belkhale, Suneel, Osa, Takayuki, Harada, Tatsuya, Matsushima, Tatsuya, Xiao, Ted, Yu, Tianhe, Ding, Tianli, Davchev, Todor, Zhao, Tony Z., Armstrong, Travis, Darrell, Trevor, Jain, Vidhi, Vanhoucke, Vincent, Zhan, Wei, Zhou, Wenxuan, Burgard, Wolfram, Chen, Xi, Wang, Xiaolong, Zhu, Xinghao, Li, Xuanlin, Lu, Yao, Chebotar, Yevgen, Zhou, Yifan, Zhu, Yifeng, Xu, Ying, Wang, Yixuan, Bisk, Yonatan, Cho, Yoonyoung, Lee, Youngwoon, Cui, Yuchen, Wu, Yueh-Hua, Tang, Yujin, Zhu, Yuke, Li, Yunzhu, Iwasawa, Yusuke, Matsuo, Yutaka, Xu, Zhuo, Cui, Zichen Jeff
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website $\href{https://robotics-transformer-x.github.io}{\text{robotics-transformer-x.github.io}}$.
Modularity Improves Out-of-Domain Instruction Following
Corona, Rodolfo, Fried, Daniel, Devin, Coline, Klein, Dan, Darrell, Trevor
We propose a modular architecture for following natural language instructions that describe sequences of diverse subgoals, such as navigating to landmarks or picking up objects. Standard, non-modular, architectures used in instruction following do not exploit subgoal compositionality and often struggle on out-of-distribution tasks and environments. In our approach, subgoal modules each carry out natural language instructions for a specific subgoal type. A sequence of modules to execute is chosen by learning to segment the instructions and predicting a subgoal type for each segment. When compared to standard sequence-to-sequence approaches on ALFRED, a challenging instruction following benchmark, we find that modularization improves generalization to environments unseen in training and to novel tasks.
Compositional Plan Vectors
Devin, Coline, Geng, Daniel, Abbeel, Pieter, Darrell, Trevor, Levine, Sergey
Autonomous agents situated in real-world environments must be able to master large repertoires of skills. While a single short skill can be learned quickly, it would be impractical to learn every task independently. Instead, the agent should share knowledge across behaviors such that each task can be learned efficiently, and such that the resulting model can generalize to new tasks, especially ones that are compositions or subsets of tasks seen previously. A policy conditioned on a goal or demonstration has the potential to share knowledge between tasks if it sees enough diversity of inputs. However, these methods may not generalize to a more complex task at test time.
Grasp2Vec: Learning Object Representations from Self-Supervised Grasping
Jang, Eric, Devin, Coline, Vanhoucke, Vincent, Levine, Sergey
Well structured visual representations can make robot learning faster and can improve generalization. In this paper, we study how we can acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment. Such representation learning methods can benefit from continuous refinement of the representation as the robot collects more experience, allowing them to scale effectively without human intervention. Our representation learning approach is based on object persistence: when a robot removes an object from a scene, the representation of that scene should change according to the features of the object that was removed. We formulate an arithmetic relationship between feature vectors from this observation, and use it to learn a representation of scenes and objects that can then be used to identify object instances, localize them in the scene, and perform goal-directed grasping tasks where the robot must retrieve commanded objects from a bin. The same grasping procedure can also be used to automatically collect training data for our method, by recording images of scenes, grasping and removing an object, and recording the outcome. Our experiments demonstrate that this self-supervised approach for tasked grasping substantially outperforms direct reinforcement learning from images and prior representation learning methods.
Deep Object Centric Policies for Autonomous Driving
Wang, Dequan, Devin, Coline, Cai, Qi-Zhi, Yu, Fisher, Darrell, Trevor
Abstract-- While learning visuomotor skills in an end-toend manner is appealing, deep neural networks are often uninterpretable and fail in surprising ways. For robotics tasks, such as autonomous driving, models that explicitly represent objects may be more robust to new scenes and provide intuitive visualizations. We describe a taxonomy of "object-centric" models which leverage both object instances and end-to-end learning. In the Grand Theft Auto V simulator, we show that object centric models outperform object-agnostic methods in scenes with other vehicles and pedestrians, even with an imperfect detector. We also demonstrate that our architectures perform well on real world environments by evaluating on the Berkeley DeepDrive Video dataset. I. INTRODUCTION End-to-end approaches to visuomotor learning are appealing in their ability to discover which features of an observed environment are most relevant for a task, and to be able to exploit large amounts of training data to discover both a policy and a codependent visual representation. Yet, the key benefit of such approaches--that they learn from task experience--is also their Achilles heel when it comes to many real-world settings, where behavioral training data is not unlimited and correct perception of "long-tail" visual phenomena can be critical for robust performance.