Goto

Collaborating Authors

 Fantacci, Claudio


Gemini Robotics: Bringing AI into the Physical World

arXiv.org Artificial Intelligence

Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.


GATS: Gather-Attend-Scatter

arXiv.org Artificial Intelligence

As the AI community increasingly adopts large-scale models, it is crucial to develop general and flexible tools to integrate them. We introduce Gather-Attend-Scatter (GATS), a novel module that enables seamless combination of pretrained foundation models, both trainable and frozen, into larger multimodal networks. GATS empowers AI systems to process and generate information across multiple modalities at different rates. In contrast to traditional fine-tuning, GATS allows for the original component models to remain frozen, avoiding the risk of them losing important knowledge acquired during the pretraining phase. We demonstrate the utility and versatility of GATS with a few experiments across games, robotics, and multimodal input-output systems.


RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

arXiv.org Artificial Intelligence

The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.


$\pi2\text{vec}$: Policy Representations with Successor Features

arXiv.org Artificial Intelligence

Robot time is an important bottleneck in applying reinforcement learning in real life. The lack of sufficient training data has driven progress in sim2real, offline reinforcement learning (offline RL), and data efficient learning. However, these approaches do not address the data requirements of policy evaluation. Various proxy metrics were introduced to replace the evaluation on the real robotic system. For example, in sim2real we might measure the performance in simulation (Lee et al., 2021), while in offline RL we can rely on Off-policy Policy Evaluation (OPE) methods (Precup, 2000; Li et al., 2011; Gulcehre et al., 2020; Fu et al., 2021) As we are usually interested in deploying a policy in the real world, recent works narrowed the problem by focusing on Offline Policy Selection (OPS), where the goal is picking the best performing policy from offline data. While these methods are useful for determining coarse relative performance of policies, one still needs time on real robot for more reliable estimates.


Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

arXiv.org Artificial Intelligence

Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. We introduce lossless adaptation to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end finetuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings. Please see real world videos at https://sites.google.com/view/robo-adapters. Pretrained general-purpose vision models, often also referred to as vision foundation models (Yuan et al., 2021), have developed a growing set of perceptual capabilities in recent years. Large-scale vision-language models such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021)) are examples of these highly capable general-purpose vision models which have enabled many applications for image generation/editing (Ramesh et al., 2022; Saharia et al.) and image-based dialog (Alayrac et al., 2022). Existing self-supervised pretrained visual models, such as SimCLR (Chen et al., 2020), BYOL (Grill et al., 2020) or Visual MAE (He et al., 2022), have also been shown to provide strong initializations for a wide range of visual downstream tasks. How can we unlock the power of these models for increasingly novel and challenging control applications? One solution is to add an output head for each control task and fine-tune the entire architecture. However, fine-tuning degrades performance on the original task(s) the model was trained for, and therefore requires maintaining copies of the model for all tasks we wish to concurrently support. This strategy quickly becomes infeasible as we move towards more general and multi-task agents. For instance, embodied agents acting in the real world will end up solving thousands of downstream manipulation tasks. Given limited hardware capabilities of robots keeping separate copies of increasingly large models (e.g. This is further exacerbated for robot manipulation wherein hardware and tool differences can result in different task configurations which may require different representations.


SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

arXiv.org Artificial Intelligence

The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.