ai2-thor
GRIP: A Unified Framework for Grid-Based Relay and Co-Occurrence-Aware Planning in Dynamic Environments
Alanazi, Ahmed, Ho, Duy, Lee, Yugyung
Robots navigating dynamic, cluttered, and semantically complex environments must integrate perception, symbolic reasoning, and spatial planning to generalize across diverse layouts and object categories. Existing methods often rely on static priors or limited memory, constraining adaptability under partial observability and semantic ambiguity. We present GRIP, Grid-based Relay with Intermediate Planning, a unified, modular framework with three scalable variants: GRIP-L (Lightweight), optimized for symbolic navigation via semantic occupancy grids; GRIP-F (Full), supporting multi-hop anchor chaining and LLM-based introspection; and GRIP-R (Real-World), enabling physical robot deployment under perceptual uncertainty. GRIP integrates dynamic 2D grid construction, open-vocabulary object grounding, co-occurrence-aware symbolic planning, and hybrid policy execution using behavioral cloning, D* search, and grid-conditioned control. Empirical results on AI2-THOR and RoboTHOR benchmarks show that GRIP achieves up to 9.6% higher success rates and over $2\times$ improvement in path efficiency (SPL and SAE) on long-horizon tasks. Qualitative analyses reveal interpretable symbolic plans in ambiguous scenes. Real-world deployment on a Jetbot further validates GRIP's generalization under sensor noise and environmental variation. These results position GRIP as a robust, scalable, and explainable framework bridging simulation and real-world navigation.
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
Yue, Junpeng, Xu, Xinru, Karlsson, Börje F., Lu, Zongqing
MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM as ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritize them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs' summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. All benchmark task sets and simulator code modifications for action and observation spaces will be released.
ManipulaTHOR: a framework for visual object manipulation
The Allen Institute for AI (AI2) announced the 3.0 release of its embodied artificial intelligence framework AI2-THOR, which adds active object manipulation to its testing framework. ManipulaTHOR is a first of its kind virtual agent with a highly articulated robot arm equipped with three joints of equal limb length and composed entirely of swivel joints to bring a more human-like approach to object manipulation. AI2-THOR is the first testing framework to study the problem of object manipulation in more than 100 visually rich, physics-enabled rooms. By enabling the training and evaluation of generalized capabilities in manipulation models, ManipulaTHOR allows for much faster training in more complex environments as compared to current real-world training methods, while also being far safer and more cost-effective. "Imagine a robot being able to navigate a kitchen, open a refrigerator and pull out a can of soda. This is one of the biggest and yet often overlooked challenges in robotics and AI2-THOR is the first to design a benchmark for the task of moving objects to various locations in virtual rooms, enabling reproducibility and measuring progress," said Dr. Oren Etzioni, CEO at AI2. "After five years of hard work, we can now begin to train robots to perceive and navigate the world more like we do, making real-world usage models more attainable than ever before."
A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks
Jain, Unnat, Weihs, Luca, Kolve, Eric, Farhadi, Ali, Lazebnik, Svetlana, Kembhavi, Aniruddha, Schwing, Alexander
Autonomous agents must learn to collaborate. It is not scalable to develop a new centralized agent every time a task's difficulty outpaces a single agent's abilities. While multi-agent collaboration research has flourished in gridworld-like environments, relatively little work has considered visually rich domains. Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies and, in tasks requiring close coordination, the number of failed actions dominates successful actions. To confront these challenges we introduce SYNC-policies (synchronize your actions coherently) and CORDIAL (coordination loss). Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines. Our dataset, code, and pretrained models are available at https://unnat.github.io/cordial-sync .
Artificial Agents Learn Flexible Visual Representations by Playing a Hiding Game
Weihs, Luca, Kembhavi, Aniruddha, Han, Winson, Herrasti, Alvaro, Kolve, Eric, Schwenk, Dustin, Mottaghi, Roozbeh, Farhadi, Ali
The ubiquity of embodied gameplay, observed in a wide variety of animal species including turtles and ravens, has led researchers to question what advantages play provides to the animals engaged in it. Mounting evidence suggests that play is critical in developing the neural flexibility for creative problem solving, socialization, and can improve the plasticity of the medial prefrontal cortex. Comparatively little is known regarding the impact of gameplay upon embodied artificial agents. While recent work has produced artificial agents proficient in abstract games, the environments these agents act within are far removed the real world and thus these agents provide little insight into the advantages of embodied play. Hiding games have arisen in multiple cultures and species, and provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn representations of their observations encoding information such as occlusion, object permanence, free space, and containment; on par with representations learnt by the most popular modern paradigm for visual representation learning which requires large datasets independently labeled for each new task. Our representations are enhanced by intent and memory, through interaction and play, moving closer to biologically motivated learning strategies. These results serve as a model for studying how facets of vision and perspective taking develop through play, provide an experimental framework for assessing what is learned by artificial agents, and suggest that representation learning should move from static datasets and towards experiential, interactive, learning.
Two Body Problem: Collaborative Visual Task Completion
Jain, Unnat, Weihs, Luca, Kolve, Eric, Rastegari, Mohammad, Lazebnik, Svetlana, Farhadi, Ali, Schwing, Alexander, Kembhavi, Aniruddha
Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities. Addressed extensively in both conventional and modern AI, multi-agent collaboration has often been studied in the context of simple grid worlds. We argue that there are inherently visual aspects to collaboration which should be studied in visually rich environments. A key element in collaboration is communication that can be either explicit, through messages, or implicit, through perception of the other agents and the visual world. Learning to collaborate in a visual environment entails learning (1) to perform the task, (2) when and what to communicate, and (3) how to act based on these communications and the perception of the visual world. In this paper we study the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrate the benefits of explicit and implicit modes of communication to perform visual tasks. Refer to our project page for more details: https://prior.allenai.org/projects/two-body-problem
AI2-THOR: An Interactive 3D Environment for Visual AI
Kolve, Eric, Mottaghi, Roozbeh, Han, Winson, VanderBilt, Eli, Weihs, Luca, Herrasti, Alvaro, Gordon, Daniel, Zhu, Yuke, Gupta, Abhinav, Farhadi, Ali
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.
On Evaluation of Embodied Navigation Agents
Anderson, Peter, Chang, Angel, Chaplot, Devendra Singh, Dosovitskiy, Alexey, Gupta, Saurabh, Koltun, Vladlen, Kosecka, Jana, Malik, Jitendra, Mottaghi, Roozbeh, Savva, Manolis, Zamir, Amir R.
Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study empirical methodology in navigation research. The present document summarizes the consensus recommendations of this working group. We discuss different problem statements and the role of generalization, present evaluation measures, and provide standard scenarios that can be used for benchmarking.
AI2-THOR Interactive Simulation Teaches AI About Real World
Training a robot butler to make the perfect omlette could require breaking a lot of eggs and throwing out many imperfect attempts in a real-life kitchen. That's why researchers have been rolling out virtual training grounds as a more efficient alternative to putting AI agents through costly and time-consuming experiments in the real world. Virtual environments could prove especially useful in training the most popular AI based on machine learning algorithms that often require thousands of trial-and-error runs to learn new skills. Companies such as Waymo have already built their own internal simulators with virtual roads and traffic intersections to train their AI to safely take the wheel of self-driving cars. But a new, open-source virtual training ground called AI2-THOR enables AI agents to learn how to interact with objects in familiar home settings such as kitchens and bedrooms.