Srivastava, Sanjana, Li, Chengshu, Lingelbach, Michael, Martín-Martín, Roberto, Xia, Fei, Vainio, Kent, Lian, Zheng, Gokmen, Cem, Buch, Shyamal, Liu, C. Karen, Savarese, Silvio, Gweon, Hyowon, Wu, Jiajun, Fei-Fei, Li
Embodied AI refers to the study and development of artificial agents that can perceive, reason, and interact with the environment with the capabilities and limitations of a physical body. Recently, significant progress has been made in developing solutions to embodied AI problems such as (visual) navigation [1-5], interactive Q&A [6-10], instruction following [11-15], and manipulation [16-22]. To calibrate the progress, several lines of pioneering efforts have been made towards benchmarking embodied AI in simulated environments, including Rearrangement [23, 24], TDW Transport Challenge , VirtualHome , ALFRED , Interactive Gibson Benchmark , MetaWorld , and RLBench , among others [30-32]). These efforts are inspiring, but their activities represent only a fraction of challenges that humans face in their daily lives. To develop artificial agents that can eventually perform and assist with everyday activities with human-level robustness and flexibility, we need a comprehensive benchmark with activities that are more realistic, diverse, and complex. But this is easier said than done. There are three major challenges that have prevented existing benchmarks to accommodate more realistic, diverse, and complex activities: - Definition: Identifying and defining meaningful activities for benchmarking; - Realization: Developing simulated environments that realistically support such activities; - Evaluation: Defining success and objective metrics for evaluating performance.
We introduce a visually-guided and physics-driven task-and-motion planning benchmark, which we call the ThreeDWorld Transport Challenge. In this challenge, an embodied agent equipped with two 9-DOF articulated arms is spawned randomly in a simulated physical home environment. The agent is required to find a small set of objects scattered around the house, pick them up, and transport them to a desired final location. We also position containers around the house that can be used as tools to assist with transporting objects efficiently. To complete the task, an embodied agent must plan a sequence of actions to change the state of a large number of objects in the face of realistic physical constraints. We build this benchmark challenge using the ThreeDWorld simulation: a virtual 3D environment where all objects respond to physics, and where can be controlled using fully physics-driven navigation and interaction API. We evaluate several existing agents on this benchmark. Experimental results suggest that: 1) a pure RL model struggles on this challenge; 2) hierarchical planning-based agents can transport some objects but still far from solving this task. We anticipate that this benchmark will empower researchers to develop more intelligent physics-driven robots for the physical world.
Li, Chengshu, Xia, Fei, Martín-Martín, Roberto, Lingelbach, Michael, Srivastava, Sanjana, Shen, Bokui, Vainio, Kent, Gokmen, Cem, Dharan, Gokul, Jain, Tanish, Kurenkov, Andrey, Liu, C. Karen, Gweon, Hyowon, Wu, Jiajun, Fei-Fei, Li, Savarese, Silvio
Recent research in embodied AI has been boosted by the use of simulation environments to develop and train robot learning approaches. However, the use of simulation has skewed the attention to tasks that only require what robotics simulators can simulate: motion and physical contact. We present iGibson 2.0, an open-source simulation environment that supports the simulation of a more diverse set of household tasks through three key innovations. First, iGibson 2.0 supports object states, including temperature, wetness level, cleanliness level, and toggled and sliced states, necessary to cover a wider range of tasks. Second, iGibson 2.0 implements a set of predicate logic functions that map the simulator states to logic states like Cooked or Soaked. Additionally, given a logic state, iGibson 2.0 can sample valid physical states that satisfy it. This functionality can generate potentially infinite instances of tasks with minimal effort from the users. The sampling mechanism allows our scenes to be more densely populated with small objects in semantically meaningful locations. Third, iGibson 2.0 includes a virtual reality (VR) interface to immerse humans in its scenes to collect demonstrations. As a result, we can collect demonstrations from humans on these new types of tasks, and use them for imitation learning. We evaluate the new capabilities of iGibson 2.0 to enable robot learning of novel tasks, in the hope of demonstrating the potential of this new simulator to support new research in embodied AI. iGibson 2.0 and its new dataset will be publicly available at http://svl.stanford.edu/igibson/.
The domain of Embodied AI has recently witnessed substantial progress, particularly in navigating agents within their environments. These early successes have laid the building blocks for the community to tackle tasks that require agents to actively interact with objects in their environment. Object manipulation is an established research domain within the robotics community and poses several challenges including manipulator motion, grasping and long-horizon planning, particularly when dealing with oft-overlooked practical setups involving visually rich and complex scenes, manipulation using mobile agents (as opposed to tabletop manipulation), and generalization to unseen environments and objects. We propose a framework for object manipulation built upon the physics-enabled, visually rich AI2-THOR framework and present a new challenge to the Embodied AI community known as ArmPointNav. This task extends the popular point navigation task to object manipulation and offers new challenges including 3D obstacle avoidance, manipulating objects in the presence of occlusion, and multi-object manipulation that necessitates long term planning. Popular learning paradigms that are successful on PointNav challenges show promise, but leave a large room for improvement.
Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study empirical methodology in navigation research. The present document summarizes the consensus recommendations of this working group. We discuss different problem statements and the role of generalization, present evaluation measures, and provide standard scenarios that can be used for benchmarking.