Chaplot, Devendra Singh
Navigating to Objects Specified by Images
Krantz, Jacob, Gervet, Theophile, Yadav, Karmesh, Wang, Austin, Paxton, Chris, Mottaghi, Roozbeh, Batra, Dhruv, Malik, Jitendra, Lee, Stefan, Chaplot, Devendra Singh
Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We deploy this system to a mobile robot platform and demonstrate effective real-world performance, achieving an 88% success rate across a home and an office environment.
Navigating to Objects in the Real World
Gervet, Theophile, Chintala, Soumith, Batra, Dhruv, Malik, Jitendra, Chaplot, Devendra Singh
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning
Ramakrishnan, Santhosh Kumar, Chaplot, Devendra Singh, Al-Halah, Ziad, Malik, Jitendra, Grauman, Kristen
State-of-the-art approaches to ObjectGoal navigation Prior work has made good progress on this task by rely on reinforcement learning and typically require significant formulating it as a reinforcement learning (RL) problem computational resources and time for learning. We and developing useful representations [20, 60], auxiliary propose Potential functions for ObjectGoal Navigation with tasks [61], data augmentation techniques [37], and improved Interaction-free learning (PONI), a modular approach that reward functions [37]. Despite this progress, end-toend disentangles the skills of'where to look?' for an object and RL incurs high computational cost, has poor sample efficiency, 'how to navigate to (x, y)?'. Our key insight is that'where and tends to generalize poorly to new scenes [7,12, to look?' can be treated purely as a perception problem, 37] since skills like moving without collisions, exploration, and learned without environment interactions. To address and stopping near the object are all learned from scratch this, we propose a network that predicts two complementary purely using RL. Modular navigation methods aim to address potential functions conditioned on a semantic map and uses these issues by disentangling'where to look for an object?' them to decide where to look for an unseen object. We train and'how to navigate to (x, y)?' [12,36]. These methods the potential function network using supervised learning on have emerged as strong competitors to end-to-end RL a passive dataset of top-down semantic maps, and integrate with good sample efficiency, better generalization to new it into a modular framework to perform ObjectGoal navigation.
Differentiable Spatial Planning using Transformers
Chaplot, Devendra Singh, Pathak, Deepak, Malik, Jitendra
We consider the problem of spatial path planning. In contrast to the classical solutions which optimize a new plan from scratch and assume access to the full map with ground truth obstacle locations, we learn a planner from the data in a differentiable manner that allows us to leverage statistical regularities from past data. We propose Spatial Planning Transformers (SPT), which given an obstacle map learns to generate actions by planning over long-range spatial dependencies, unlike prior data-driven planners that propagate information locally via convolutional structure in an iterative manner. In the setting where the ground truth map is not known to the agent, we leverage pre-trained SPTs in an end-to-end framework that has the structure of mapper and planner built into it which allows seamless generalization to out-of-distribution maps and goals. SPTs outperform prior state-of-the-art differentiable planners across all the setups for both manipulation and navigation tasks, leading to an absolute improvement of 7-19%.
SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency
Chaplot, Devendra Singh, Dalal, Murtaza, Gupta, Saurabh, Malik, Jitendra, Salakhutdinov, Ruslan
In this paper, we explore how we can build upon the data and models of Internet images and use them to adapt to robot vision without requiring any extra labels. We present a framework called Self-supervised Embodied Active Learning (SEAL). It utilizes perception models trained on internet images to learn an active exploration policy. The observations gathered by this exploration policy are labelled using 3D consistency and used to improve the perception model. We build and utilize 3D semantic maps to learn both action and perception in a completely self-supervised manner. The semantic map is used to compute an intrinsic motivation reward for training the exploration policy and for labelling the agent observations using spatio-temporal 3D consistency and label propagation. We demonstrate that the SEAL framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by just moving around in training environments and the improved perception model can be used to improve Object Goal Navigation.
Building Intelligent Autonomous Navigation Agents
Chaplot, Devendra Singh
Breakthroughs in machine learning in the last decade have led to `digital intelligence', i.e. machine learning models capable of learning from vast amounts of labeled data to perform several digital tasks such as speech recognition, face recognition, machine translation and so on. The goal of this thesis is to make progress towards designing algorithms capable of `physical intelligence', i.e. building intelligent autonomous navigation agents capable of learning to perform complex navigation tasks in the physical world involving visual perception, natural language understanding, reasoning, planning, and sequential decision making. Despite several advances in classical navigation methods in the last few decades, current navigation agents struggle at long-term semantic navigation tasks. In the first part of the thesis, we discuss our work on short-term navigation using end-to-end reinforcement learning to tackle challenges such as obstacle avoidance, semantic perception, language grounding, and reasoning. In the second part, we present a new class of navigation methods based on modular learning and structured explicit map representations, which leverage the strengths of both classical and end-to-end learning methods, to tackle long-term navigation tasks. We show that these methods are able to effectively tackle challenges such as localization, mapping, long-term planning, exploration and learning semantic priors. These modular learning methods are capable of long-term spatial and semantic understanding and achieve state-of-the-art results on various navigation tasks.
Planning with Submodular Objective Functions
Wang, Ruosong, Zhang, Hanrui, Chaplot, Devendra Singh, Garagiฤ, Denis, Salakhutdinov, Ruslan
Modern reinforcement learning and planning algorithms have achieved tremendous successes on various tasks [Mnih et al., 2015, Silver et al., 2017]. However, most of these algorithms work in the standard Markov decision process (MDP) framework where the goal is to maximize the cumulative reward and thus it can be difficult to apply them to various practical sequential decision-making problems. In this paper, we study planning in generalized MDPs, where instead of maximizing the cumulative reward, the goal is to maximize the objective value induced by a submodular function. To motivate our approach, let us consider the following scenario: a company manufactures cars, and as part of its customer service, continuously monitors the status of all cars produced by the company. Each car is equipped with a number of sensors, each of which constantly produces noisy measurements of some attribute of the car, e.g., speed, location, temperature, etc. Due to bandwidth constraints, at any moment, each car may choose to transmit data generated by a single sensor to the company. The goal is to combine the statistics collected over a fixed period of time, presumably from multiple sensors, to gather as much information about the car as possible. Perhaps one seemingly natural strategy is to transmit only data generated by the most "informative" sensor. However, as the output of a sensor remains the same between two samples, it is pointless to transmit the same data multiple times. One may alternatively try to order sensors by their "informativity" and always choose the most informative sensor that has not yet transmitted data since the last sample was generated.
Semantic Curiosity for Active Visual Learning
Chaplot, Devendra Singh, Jiang, Helen, Gupta, Saurabh, Gupta, Abhinav
In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector's failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation -- the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration.
Neural Topological SLAM for Visual Navigation
Chaplot, Devendra Singh, Salakhutdinov, Ruslan, Gupta, Abhinav, Gupta, Saurabh
This paper studies the problem of image-goal navigation which involves navigating to the location indicated by a goal image in a novel previously unseen environment. To tackle this problem, we design topological representations for space that effectively leverage semantics and afford approximate geometric reasoning. At the heart of our representations are nodes with associated semantic features, that are interconnected using coarse geometric information. We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation. Experimental study in visually and physically realistic simulation suggests that our method builds effective representations that capture structural regularities and efficiently solve long-horizon navigation problems. We observe a relative improvement of more than 50% over existing methods that study this task.
Embodied Multimodal Multitask Learning
Chaplot, Devendra Singh, Lee, Lisa, Salakhutdinov, Ruslan, Parikh, Devi, Batra, Dhruv
Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for different multimodal tasks, such as semantic goal navigation and embodied question answering. In this paper, we propose a multitask model capable of jointly learning these multimodal tasks, and transferring knowledge of words and their grounding in visual objects across the tasks. The proposed model uses a novel Dual-Attention unit to disentangle the knowledge of words in the textual representations and visual concepts in the visual representations, and align them with each other. This disentangled task-invariant alignment of representations facilitates grounding and knowledge transfer across both tasks. We show that the proposed model outperforms a range of baselines on both tasks in simulated 3D environments. We also show that this disentanglement of representations makes our model modular, interpretable, and allows for transfer to instructions containing new words by leveraging object detectors.