Goto

Collaborating Authors

 objectnav


ZSON: Zero-ShotObject-GoalNavigationusing MultimodalGoalEmbeddings

Neural Information Processing Systems

We present a scalable approach for learningopen-world object-goal navigation (ObjectNav) - the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g.,"find a sink").


9fc664916bce863561527f06a96f5ff3-Paper.pdf

Neural Information Processing Systems

Suppose N 3doorsd illustrated N =4), openingd1 requires Successful 1, otherwise 0. Since totheagent, acode. ExpertsFast simulation enables extensive experimentation and a robustness studyDemonstrate ADVISOR can be applied in continuous, multi-agent, environmentsStudy ADVISOR' s performance within a rich visual environmentDemonstrate that ADVISOR succeeds in diverse 3D environmentsStudy how the size of the imitation gap influences performanceObjectiveObjective: Cover black landmarks and avoid collisions Inparticular, see Tab. 1 ontheand Tab. 2 forourresultsonthe D - LHresultsaredeferredtothe Appendix.



ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Neural Information Processing Systems

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) - the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot - i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink," "bathroom sink," etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").


PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory

Jin, Qunchao, Wu, Yilin, Chen, Changhao

arXiv.org Artificial Intelligence

Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Map-less ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.



When Engineering Outruns Intelligence: Rethinking Instruction-Guided Navigation

Aghaei, Matin, Zhang, Lingfeng, Alomrani, Mohammad Ali, Biparva, Mahdi, Zhang, Yingxue

arXiv.org Artificial Intelligence

Recent ObjectNav systems credit large language models (LLMs) for sizable zero-shot gains, yet it remains unclear how much comes from language versus geometry. We conduct a controlled study on HM3D and MP3D that revisits language-for-navigation through the lens of geometry-first exploration. Beyond ObjectNav, large foundation models are increasingly being employed in various other embodied tasks. ObjectNav asks an agent to reach any instance of a named object category (e.g., Find a At each time step, RGB-D and pose are fused into a 2D navigability map; free space vs. obstacles Islands will later serve as anchor sets for scoring or selection. InstructNav (Long et al., 2024) turns the instruction and When the named goal object is observed, InstructNav's LFG (Shah et al., 2023) is a complementary paradigm: instead of composing multiple value maps, it LFG does not assume open-vocabulary detectors or a VLM "intuition" map; its only learned SHF's prompt templates are included in Appendix B. All experiments run in Habitat (release 3) with default navigation mesh and physics (Puig et al., Success is declared when the goal object is visible and the agent is within 0.25 m.


FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning

Yokoyama, Naoki, Ha, Sehoon

arXiv.org Artificial Intelligence

Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.



Open Scene Graphs for Open-World Object-Goal Navigation

Loo, Joel, Wu, Zhanxin, Hsu, David

arXiv.org Artificial Intelligence

How can we build general-purpose robot systems for open-world semantic navigation, e.g., searching a novel environment for a target object specified in natural language? To tackle this challenge, we introduce OSG Navigator, a modular system composed of foundation models, for open-world Object-Goal Navigation (ObjectNav). Foundation models provide enormous semantic knowledge about the world, but struggle to organise and maintain spatial information effectively at scale. Key to OSG Navigator is the Open Scene Graph representation, which acts as spatial memory for OSG Navigator. It organises spatial information hierarchically using OSG schemas, which are templates, each describing the common structure of a class of environments. OSG schemas can be automatically generated from simple semantic labels of a given environment, e.g., "home" or "supermarket". They enable OSG Navigator to adapt zero-shot to new environment types. We conducted experiments using both Fetch and Spot robots in simulation and in the real world, showing that OSG Navigator achieves state-of-the-art performance on ObjectNav benchmarks and generalises zero-shot over diverse goals, environments, and robot embodiments.