Puig, Xavier
RobotMover: Learning to Move Large Objects by Imitating the Dynamic Chain
Li, Tianyu, Truong, Joanne, Yang, Jimmy, Clegg, Alexander, Rai, Akshara, Ha, Sehoon, Puig, Xavier
Moving large objects, such as furniture, is a critical capability for robots operating in human environments. This task presents significant challenges due to two key factors: the need to synchronize whole-body movements to prevent collisions between the robot and the object, and the under-actuated dynamics arising from the substantial size and weight of the objects. These challenges also complicate performing these tasks via teleoperation. In this work, we introduce \method, a generalizable learning framework that leverages human-object interaction demonstrations to enable robots to perform large object manipulation tasks. Central to our approach is the Dynamic Chain, a novel representation that abstracts human-object interactions so that they can be retargeted to robotic morphologies. The Dynamic Chain is a spatial descriptor connecting the human and object root position via a chain of nodes, which encode the position and velocity of different interaction keypoints. We train policies in simulation using Dynamic-Chain-based imitation rewards and domain randomization, enabling zero-shot transfer to real-world settings without fine-tuning. Our approach outperforms both learning-based methods and teleoperation baselines across six evaluation metrics when tested on three distinct object types, both in simulation and on physical hardware. Furthermore, we successfully apply the learned policies to real-world tasks, such as moving a trash cart and rearranging chairs.
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
Chang, Matthew, Chhablani, Gunjan, Clegg, Alexander, Cote, Mikael Dallaire, Desai, Ruta, Hlavac, Michal, Karashchuk, Vladimir, Krantz, Jacob, Mottaghi, Roozbeh, Parashar, Priyam, Patki, Siddharth, Prasad, Ishita, Puig, Xavier, Rai, Akshara, Ramrakhya, Ram, Tran, Daniel, Truong, Joanne, Turner, John M., Undersander, Eric, Yang, Tsung-Yen
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
Puig, Xavier, Undersander, Eric, Szot, Andrew, Cote, Mikael Dallaire, Yang, Tsung-Yen, Partsey, Ruslan, Desai, Ruta, Clegg, Alexander William, Hlavac, Michal, Min, So Yeon, Vondruลก, Vladimรญr, Gervet, Theophile, Berges, Vincent-Pierre, Turner, John M., Maksymets, Oleksandr, Kira, Zsolt, Kalakrishnan, Mrinal, Malik, Jitendra, Chaplot, Devendra Singh, Jain, Unnat, Batra, Dhruv, Rai, Akshara, Mottaghi, Roozbeh
We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities.
NOPA: Neurally-guided Online Probabilistic Assistance for Building Socially Intelligent Home Assistants
Puig, Xavier, Shu, Tianmin, Tenenbaum, Joshua B., Torralba, Antonio
In this work, we study how to build socially intelligent robots to assist people in their homes. In particular, we focus on assistance with online goal inference, where robots must simultaneously infer humans' goals and how to help them achieve those goals. Prior assistance methods either lack the adaptivity to adjust helping strategies (i.e., when and how to help) in response to uncertainty about goals or the scalability to conduct fast inference in a large goal space. Our NOPA (Neurally-guided Online Probabilistic Assistance) method addresses both of these challenges. NOPA consists of (1) an online goal inference module combining neural goal proposals with inverse planning and particle filtering for robust inference under uncertainty, and (2) a helping planner that discovers valuable subgoals to help with and is aware of the uncertainty in goal inference. We compare NOPA against multiple baselines in a new embodied AI assistance challenge: Online Watch-And-Help, in which a helper agent needs to simultaneously watch a main agent's action, infer its goal, and help perform a common household task faster in realistic virtual home environments. Experiments show that our helper agent robustly updates its goal inference and adapts its helping plans to the changing level of uncertainty.
Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration
Puig, Xavier, Shu, Tianmin, Li, Shuang, Wang, Zilin, Tenenbaum, Joshua B., Fidler, Sanja, Torralba, Antonio
In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents. In WAH, an AI agent needs to help a humanlike agent perform a complex household task efficiently. To succeed, the AI agent needs to i) understand the underlying goal of the task by watching a single demonstration of the humanlike agent performing the same task (social perception), and ii) coordinate with the humanlike agent to solve the task in an unseen environment as fast as possible (human-AI collaboration). For this challenge, we build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines. We evaluate the performance of AI agents with the humanlike agent as well as with real humans using objective metrics and subjective user ratings. Experimental results demonstrate that the proposed challenge and virtual environment enable a systematic evaluation on the important aspects of machine social intelligence at scale. Without much prior experience, children can robustly recognize goals of other people by simply watching them act in an environment, and are able to come up with plans to help them, even in novel scenarios. In contrast, the most advanced AI systems to date still struggle with such basic social skills. In order to achieve the level of social intelligence required to effectively help humans, an AI agent should acquire two key abilities: i) social perception, i.e., the ability to understand human behavior, and ii) collaborative planning, i.e., the ability to reason about the physical environment and plan its actions to coordinate with humans. In this paper, we are interested in developing AI agents with these two abilities. Towards this goal, we introduce a new AI challenge, Watch-And-Help (WAH), which focuses on social perception and human-AI collaboration. In this challenge, an AI agent needs to collaborate with a humanlike agent to enable it to achieve the goal faster. In particular, we present a 2-stage framework as shown in Figure 1. In the first, Watch stage, an AI agent (Bob) watches a humanlike agent (Alice) performing a task once and infers Alice's goal from her actions.
VirtualHome: Simulating Household Activities via Programs
Puig, Xavier, Ra, Kevin, Boben, Marko, Li, Jiaman, Wang, Tingwu, Fidler, Sanja, Torralba, Antonio
In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.