Hoque, Ryan
ARMADA: Augmented Reality for Robot Manipulation and Robot-Free Data Acquisition
Nechyporenko, Nataliya, Hoque, Ryan, Webb, Christopher, Sivapurapu, Mouli, Zhang, Jian
Teleoperation for robot imitation learning is bottlenecked by hardware availability. Can high-quality robot data be collected without a physical robot? We present a system for augmenting Apple Vision Pro with real-time virtual robot feedback. By providing users with an intuitive understanding of how their actions translate to robot motions, we enable the collection of natural barehanded human data that is compatible with the limitations of physical robot hardware. We conducted a user study with 15 participants demonstrating 3 different tasks each under 3 different feedback conditions and directly replayed the collected trajectories on physical robot hardware. Results suggest live robot feedback dramatically improves the quality of the collected data, suggesting a new avenue for scalable human data collection without access to robot hardware. Videos and more are available at https://nataliya.dev/armada.
IntervenGen: Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning
Hoque, Ryan, Mandlekar, Ajay, Garrett, Caelan, Goldberg, Ken, Fox, Dieter
Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning (i.e., DAgger and variants), where a human operator provides corrective interventions during policy rollouts. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. We propose IntervenGen (I-Gen), a novel data generation system that can autonomously produce a large set of corrective interventions with rich coverage of the state space from a small number of human interventions. We apply I-Gen to 4 simulated environments and 1 physical environment with object pose estimation error and show that it can increase policy robustness by up to 39x with only 10 human interventions. Videos and more results are available at https://sites.google.com/view/intervengen2024.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Collaboration, Open X-Embodiment, Padalkar, Abhishek, Pooley, Acorn, Mandlekar, Ajay, Jain, Ajinkya, Tung, Albert, Bewley, Alex, Herzog, Alex, Irpan, Alex, Khazatsky, Alexander, Rai, Anant, Singh, Anikait, Garg, Animesh, Brohan, Anthony, Raffin, Antonin, Wahid, Ayzaan, Burgess-Limerick, Ben, Kim, Beomjoon, Schรถlkopf, Bernhard, Ichter, Brian, Lu, Cewu, Xu, Charles, Finn, Chelsea, Xu, Chenfeng, Chi, Cheng, Huang, Chenguang, Chan, Christine, Pan, Chuer, Fu, Chuyuan, Devin, Coline, Driess, Danny, Pathak, Deepak, Shah, Dhruv, Bรผchler, Dieter, Kalashnikov, Dmitry, Sadigh, Dorsa, Johns, Edward, Ceola, Federico, Xia, Fei, Stulp, Freek, Zhou, Gaoyue, Sukhatme, Gaurav S., Salhotra, Gautam, Yan, Ge, Schiavi, Giulio, Kahn, Gregory, Su, Hao, Fang, Hao-Shu, Shi, Haochen, Amor, Heni Ben, Christensen, Henrik I, Furuta, Hiroki, Walke, Homer, Fang, Hongjie, Mordatch, Igor, Radosavovic, Ilija, Leal, Isabel, Liang, Jacky, Abou-Chakra, Jad, Kim, Jaehyung, Peters, Jan, Schneider, Jan, Hsu, Jasmine, Bohg, Jeannette, Bingham, Jeffrey, Wu, Jiajun, Wu, Jialin, Luo, Jianlan, Gu, Jiayuan, Tan, Jie, Oh, Jihoon, Malik, Jitendra, Booher, Jonathan, Tompson, Jonathan, Yang, Jonathan, Lim, Joseph J., Silvรฉrio, Joรฃo, Han, Junhyek, Rao, Kanishka, Pertsch, Karl, Hausman, Karol, Go, Keegan, Gopalakrishnan, Keerthana, Goldberg, Ken, Byrne, Kendra, Oslund, Kenneth, Kawaharazuka, Kento, Zhang, Kevin, Rana, Krishan, Srinivasan, Krishnan, Chen, Lawrence Yunliang, Pinto, Lerrel, Fei-Fei, Li, Tan, Liam, Ott, Lionel, Lee, Lisa, Tomizuka, Masayoshi, Spero, Max, Du, Maximilian, Ahn, Michael, Zhang, Mingtong, Ding, Mingyu, Srirama, Mohan Kumar, Sharma, Mohit, Kim, Moo Jin, Kanazawa, Naoaki, Hansen, Nicklas, Heess, Nicolas, Joshi, Nikhil J, Suenderhauf, Niko, Di Palo, Norman, Shafiullah, Nur Muhammad Mahi, Mees, Oier, Kroemer, Oliver, Sanketi, Pannag R, Wohlhart, Paul, Xu, Peng, Sermanet, Pierre, Sundaresan, Priya, Vuong, Quan, Rafailov, Rafael, Tian, Ran, Doshi, Ria, Martรญn-Martรญn, Roberto, Mendonca, Russell, Shah, Rutav, Hoque, Ryan, Julian, Ryan, Bustamante, Samuel, Kirmani, Sean, Levine, Sergey, Moore, Sherry, Bahl, Shikhar, Dass, Shivin, Sonawani, Shubham, Song, Shuran, Xu, Sichun, Haldar, Siddhant, Adebola, Simeon, Guist, Simon, Nasiriany, Soroush, Schaal, Stefan, Welker, Stefan, Tian, Stephen, Dasari, Sudeep, Belkhale, Suneel, Osa, Takayuki, Harada, Tatsuya, Matsushima, Tatsuya, Xiao, Ted, Yu, Tianhe, Ding, Tianli, Davchev, Todor, Zhao, Tony Z., Armstrong, Travis, Darrell, Trevor, Jain, Vidhi, Vanhoucke, Vincent, Zhan, Wei, Zhou, Wenxuan, Burgard, Wolfram, Chen, Xi, Wang, Xiaolong, Zhu, Xinghao, Li, Xuanlin, Lu, Yao, Chebotar, Yevgen, Zhou, Yifan, Zhu, Yifeng, Xu, Ying, Wang, Yixuan, Bisk, Yonatan, Cho, Yoonyoung, Lee, Youngwoon, Cui, Yuchen, Wu, Yueh-Hua, Tang, Yujin, Zhu, Yuke, Li, Yunzhu, Iwasawa, Yusuke, Matsuo, Yutaka, Xu, Zhuo, Cui, Zichen Jeff
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website $\href{https://robotics-transformer-x.github.io}{\text{robotics-transformer-x.github.io}}$.
Semantic Mechanical Search with Large Vision and Language Models
Sharma, Satvik, Huang, Huang, Shivakumar, Kaushik, Chen, Lawrence Yunliang, Hoque, Ryan, Ichter, Brian, Goldberg, Ken
Moving objects to find a fully-occluded target object, known as mechanical search, is a challenging problem in robotics. As objects are often organized semantically, we conjecture that semantic information about object relationships can facilitate mechanical search and reduce search time. Large pretrained vision and language models (VLMs and LLMs) have shown promise in generalizing to uncommon objects and previously unseen real-world environments. In this work, we propose a novel framework called Semantic Mechanical Search (SMS). SMS conducts scene understanding and generates a semantic occupancy distribution explicitly using LLMs. Compared to methods that rely on visual similarities offered by CLIP embeddings, SMS leverages the deep reasoning capabilities of LLMs. Unlike prior work that uses VLMs and LLMs as end-to-end planners, which may not integrate well with specialized geometric planners, SMS can serve as a plug-in semantic module for downstream manipulation or navigation policies. For mechanical search in closed-world settings such as shelves, we compare with a geometric-based planner and show that SMS improves mechanical search performance by 24% across the pharmacy, kitchen, and office domains in simulation and 47.1% in physical experiments. For open-world real environments, SMS can produce better semantic distributions compared to CLIP-based methods, with the potential to be integrated with downstream navigation policies to improve object navigation tasks. Code, data, videos, and the appendix are available: https://sites.google.com/view/semantic-mechanical-search
IIFL: Implicit Interactive Fleet Learning from Heterogeneous Human Supervisors
Datta, Gaurav, Hoque, Ryan, Gu, Anrui, Solowjow, Eugen, Goldberg, Ken
Imitation learning has been applied to a range of robotic tasks, but can struggle when robots encounter edge cases that are not represented in the training data (i.e., distribution shift). Interactive fleet learning (IFL) mitigates distribution shift by allowing robots to access remote human supervisors during task execution and learn from them over time, but different supervisors may demonstrate the task in different ways. Recent work proposes Implicit Behavior Cloning (IBC), which is able to represent multimodal demonstrations using energy-based models (EBMs). In this work, we propose Implicit Interactive Fleet Learning (IIFL), an algorithm that builds on IBC for interactive imitation learning from multiple heterogeneous human supervisors. A key insight in IIFL is a novel approach for uncertainty quantification in EBMs using Jeffreys divergence. While IIFL is more computationally expensive than explicit methods, results suggest that IIFL achieves a 2.8x higher success rate in simulation experiments and a 4.5x higher return on human effort in a physical block pushing task over (Explicit) IFL, IBC, and other baselines.
Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features
Kerr, Justin, Huang, Huang, Wilcox, Albert, Hoque, Ryan, Ichnowski, Jeffrey, Calandra, Roberto, Goldberg, Ken
Humans make extensive use of vision and touch as complementary senses, with vision providing global information about the scene and touch measuring local information during manipulation without suffering from occlusions. While prior work demonstrates the efficacy of tactile sensing for precise manipulation of deformables, they typically rely on supervised, human-labeled datasets. We propose Self-Supervised Visuo-Tactile Pretraining (SSVTP), a framework for learning multi-task visuo-tactile representations in a self-supervised manner through cross-modal supervision. We design a mechanism that enables a robot to autonomously collect precisely spatially-aligned visual and tactile image pairs, then train visual and tactile encoders to embed these pairs into a shared latent space using cross-modal contrastive loss. We apply this latent space to downstream perception and control of deformable garments on flat surfaces, and evaluate the flexibility of the learned representations without fine-tuning on 5 tasks: feature classification, contact localization, anomaly detection, feature search from a visual query (e.g., garment feature localization under occlusion), and edge following along cloth edges. The pretrained representations achieve a 73-100% success rate on these 5 tasks.
FogROS2-SGC: A ROS2 Cloud Robotics Platform for Secure Global Connectivity
Chen, Kaiyuan, Hoque, Ryan, Dharmarajan, Karthik, LLontop, Edith, Adebola, Simeon, Ichnowski, Jeffrey, Kubiatowicz, John, Goldberg, Ken
The Robot Operating System (ROS2) is the most widely used software platform for building robotics applications. FogROS2 extends ROS2 to allow robots to access cloud computing on demand. However, ROS2 and FogROS2 assume that all robots are locally connected and that each robot has full access and control of the other robots. With applications like distributed multi-robot systems, remote robot control, and mobile robots, robotics increasingly involves the global Internet and complex trust management. Existing approaches for connecting disjoint ROS2 networks lack key features such as security, compatibility, efficiency, and ease of use. We introduce FogROS2-SGC, an extension of FogROS2 that can effectively connect robot systems across different physical locations, networks, and Data Distribution Services (DDS). With globally unique and location-independent identifiers, FogROS2-SGC securely and efficiently routes data between robotics components around the globe. FogROS2-SGC is agnostic to the ROS2 distribution and configuration, is compatible with non-ROS2 software, and seamlessly extends existing ROS2 applications without any code modification. Experiments suggest FogROS2-SGC is 19x faster than rosbridge (a ROS2 package with comparable features, but lacking security). We also apply FogROS2-SGC to 4 robots and compute nodes that are 3600km apart. Videos and code are available on the project website https://sites.google.com/view/fogros2-sgc.
Fleet-DAgger: Interactive Robot Fleet Learning with Scalable Human Supervision
Hoque, Ryan, Chen, Lawrence Yunliang, Sharma, Satvik, Dharmarajan, Karthik, Thananjeyan, Brijen, Abbeel, Pieter, Goldberg, Ken
Amazon, Nimble, Plus One, Waymo, and Zoox use remote human supervision of robot fleets in applications ranging from self-driving taxis to automated warehouse fulfillment [1, 2, 3, 4, 5]. These robots intermittently cede control during task execution to remote human supervisors for corrective interventions. The interventions take place either during learning, when they are used to improve the robot policy, or during execution, when the policy is no longer updated but robots can still request human assistance when needed to improve reliability. In the continual learning setting, these occur simultaneously: the robot policy has been deployed but continues to be updated indefinitely with additional intervention data. Furthermore, any individual robot can share its intervention data with the rest of the fleet. As opposed to robot swarms that must coordinate with each other to achieve a common objective, a robot fleet is a set of independent robots simultaneously executing the same control policy in parallel environments. We refer to the setting of a robot fleet learning via interactive requests for human supervision (see Figure 1) as Interactive Fleet Learning (IFL). Of central importance in IFL is the supervisor allocation problem: how should limited human supervision be allocated to robots in a manner that maximizes the throughput of the fleet?
LazyDAgger: Reducing Context Switching in Interactive Imitation Learning
Hoque, Ryan, Balakrishna, Ashwin, Putterman, Carl, Luo, Michael, Brown, Daniel S., Seita, Daniel, Thananjeyan, Brijen, Novoseller, Ellen, Goldberg, Ken
Corrective interventions while a robot is learning to automate a task provide an intuitive method for a human supervisor to assist the robot and convey information about desired behavior. However, these interventions can impose significant burden on a human supervisor, as each intervention interrupts other work the human is doing, incurs latency with each context switch between supervisor and autonomous control, and requires time to perform. We present LazyDAgger, which extends the interactive imitation learning (IL) algorithm SafeDAgger to reduce context switches between supervisor and autonomous control. We find that LazyDAgger improves the performance and robustness of the learned policy during both learning and execution while limiting burden on the supervisor. Simulation experiments suggest that LazyDAgger can reduce context switches by an average of 60% over SafeDAgger on 3 continuous control tasks while maintaining state-of-the-art policy performance. In physical fabric manipulation experiments with an ABB YuMi robot, LazyDAgger reduces context switches by 60% while achieving a 60% higher success rate than SafeDAgger at execution time.
VisuoSpatial Foresight for Physical Sequential Fabric Manipulation
Hoque, Ryan, Seita, Daniel, Balakrishna, Ashwin, Ganapathi, Aditya, Tanwani, Ajay Kumar, Jamali, Nawid, Yamane, Katsu, Iba, Soshi, Goldberg, Ken
Robotic fabric manipulation has applications in home robotics, textiles, senior care and surgery. Existing fabric manipulation techniques, however, are designed for specific tasks, making it difficult to generalize across different but related tasks. We build upon the Visual Foresight framework to learn fabric dynamics that can be efficiently reused to accomplish different sequential fabric manipulation tasks with a single goal-conditioned policy. We extend our earlier work on VisuoSpatial Foresight (VSF), which learns visual dynamics on domain randomized RGB images and depth maps simultaneously and completely in simulation. In this earlier work, we evaluated VSF on multi-step fabric smoothing and folding tasks against 5 baseline methods in simulation and on the da Vinci Research Kit (dVRK) surgical robot without any demonstrations at train or test time. A key finding was that depth sensing significantly improves performance: RGBD data yields an 80% improvement in fabric folding success rate in simulation over pure RGB data. In this work, we vary 4 components of VSF, including data generation, the choice of visual dynamics model, cost function, and optimization procedure. Results suggest that training visual dynamics models using longer, corner-based actions can improve the efficiency of fabric folding by 76% and enable a physical sequential fabric folding task that VSF could not previously perform with 90% reliability. Code, data, videos, and supplementary material are available at https://sites.google.com/view/fabric-vsf/.