Arnaud, Sergio
Unifying 2D and 3D Vision-Language Understanding
Jain, Ayush, Swerdlow, Alexander, Wang, Yuzhou, Arnaud, Sergio, Martin, Ada, Sax, Alexander, Meier, Franziska, Fragkiadaki, Katerina
Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at https://univlg.github.io .
ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation
Yokoyama, Naoki, Clegg, Alex, Truong, Joanne, Undersander, Eric, Yang, Tsung-Yen, Arnaud, Sergio, Ha, Sehoon, Batra, Dhruv, Rai, Akshara
We present Adaptive Skill Coordination (ASC) -- an approach for accomplishing long-horizon tasks like mobile pick-and-place (i.e., navigating to an object, picking it, navigating to another location, and placing it). ASC consists of three components -- (1) a library of basic visuomotor skills (navigation, pick, place), (2) a skill coordination policy that chooses which skill to use when, and (3) a corrective policy that adapts pre-trained skills in out-of-distribution states. All components of ASC rely only on onboard visual and proprioceptive sensing, without requiring detailed maps with obstacle layouts or precise object locations, easing real-world deployment. We train ASC in simulated indoor environments, and deploy it zero-shot (without any real-world experience or fine-tuning) on the Boston Dynamics Spot robot in eight novel real-world environments (one apartment, one lab, two microkitchens, two lounges, one office space, one outdoor courtyard). In rigorous quantitative comparisons in two environments, ASC achieves near-perfect performance (59/60 episodes, or 98%), while sequentially executing skills succeeds in only 44/60 (73%) episodes. Extensive perturbation experiments show that ASC is robust to hand-off errors, changes in the environment layout, dynamic obstacles (e.g., people), and unexpected disturbances. Supplementary videos at adaptiveskillcoordination.github.io.
What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?
Silwal, Sneha, Yadav, Karmesh, Wu, Tingfan, Vakil, Jay, Majumdar, Arjun, Arnaud, Sergio, Chen, Claire, Berges, Vincent-Pierre, Batra, Dhruv, Rajeswaran, Aravind, Kalakrishnan, Mrinal, Meier, Franziska, Maksymets, Oleksandr
We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals.
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
Majumdar, Arjun, Yadav, Karmesh, Arnaud, Sergio, Ma, Yecheng Jason, Chen, Claire, Silwal, Sneha, Jain, Aryan, Berges, Vincent-Pierre, Abbeel, Pieter, Malik, Jitendra, Batra, Dhruv, Lin, Yixin, Maksymets, Oleksandr, Rajeswaran, Aravind, Meier, Franziska
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.