AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.34)

Neural Information Processing SystemsFeb-10-2026, 07:32:44 GMT

9fc664916bce863561527f06a96f5ff3-Paper.pdf

Suppose N 3doorsd illustrated N =4), openingd1 requires Successful 1, otherwise 0. Since totheagent, acode. ExpertsFast simulation enables extensive experimentation and a robustness studyDemonstrate ADVISOR can be applied in continuous, multi-agent, environmentsStudy ADVISOR' s performance within a rich visual environmentDemonstrate that ADVISOR succeeds in diverse 3D environmentsStudy how the size of the imitation gap influences performanceObjectiveObjective: Cover black landmarks and avoid collisions Inparticular, see Tab. 1 ontheand Tab. 2 forourresultsonthe D - LHresultsaredeferredtothe Appendix.

artificial intelligence, latexit sha1, machine learning, (15 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Illinois (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsDec-27-2025, 16:17:10 GMT

022ca1bed6b574b962c48a2856eb207b-Paper-Conference.pdf

GPU-hours to train) for the benefit of the research community.

arxiv preprint arxiv, benchmark, dataset, (14 more...)

Country:

North America > United States (0.14)
Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Neural Information Processing SystemsDec-25-2025, 08:43:39 GMT

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) - the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot - i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink," "bathroom sink," etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

agent, multimodal goal embedding, zero-shot object-goal navigation, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Robots (0.59)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)

arXiv.org Artificial IntelligenceNov-11-2025

PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory

Jin, Qunchao, Wu, Yilin, Chen, Changhao

Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Map-less ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.

large language model, natural language, navigation, (14 more...)

2511.0684

Country: Asia > China (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Aghaei, Matin, Zhang, Lingfeng, Alomrani, Mohammad Ali, Biparva, Mahdi, Zhang, Yingxue

When Engineering Outruns Intelligence: Rethinking Instruction-Guided Navigation

arXiv.org Artificial IntelligenceSep-30-2025

Recent ObjectNav systems credit large language models (LLMs) for sizable zero-shot gains, yet it remains unclear how much comes from language versus geometry. We conduct a controlled study on HM3D and MP3D that revisits language-for-navigation through the lens of geometry-first exploration. Beyond ObjectNav, large foundation models are increasingly being employed in various other embodied tasks. ObjectNav asks an agent to reach any instance of a named object category (e.g., Find a At each time step, RGB-D and pose are fused into a 2D navigability map; free space vs. obstacles Islands will later serve as anchor sets for scoring or selection. InstructNav (Long et al., 2024) turns the instruction and When the named goal object is observed, InstructNav's LFG (Shah et al., 2023) is a complementary paradigm: instead of composing multiple value maps, it LFG does not assume open-vocabulary detectors or a VLM "intuition" map; its only learned SHF's prompt templates are included in Appendix B. All experiments run in Habitat (release 3) with default navigation mesh and physics (Puig et al., Success is declared when the goal object is visible and the agent is within 0.25 m.

artificial intelligence, large language model, natural language, (16 more...)

2507.20021

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Yokoyama, Naoki, Ha, Sehoon

FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning

arXiv.org Artificial IntelligenceSep-23-2025

Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.

large language model, machine learning, natural language, (20 more...)

2509.16445

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.49)

Neural Information Processing SystemsAug-19-2025, 03:21:03 GMT

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Imagine asking a home assistant robot to find a "flat-head screwdriver" or the "medicine case near

large language model, machine learning, natural language, (19 more...)

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)

arXiv.org Artificial IntelligenceAug-7-2025

Open Scene Graphs for Open-World Object-Goal Navigation

Loo, Joel, Wu, Zhanxin, Hsu, David

How can we build general-purpose robot systems for open-world semantic navigation, e.g., searching a novel environment for a target object specified in natural language? To tackle this challenge, we introduce OSG Navigator, a modular system composed of foundation models, for open-world Object-Goal Navigation (ObjectNav). Foundation models provide enormous semantic knowledge about the world, but struggle to organise and maintain spatial information effectively at scale. Key to OSG Navigator is the Open Scene Graph representation, which acts as spatial memory for OSG Navigator. It organises spatial information hierarchically using OSG schemas, which are templates, each describing the common structure of a class of environments. OSG schemas can be automatically generated from simple semantic labels of a given environment, e.g., "home" or "supermarket". They enable OSG Navigator to adapt zero-shot to new environment types. We conducted experiments using both Fetch and Spot robots in simulation and in the real world, showing that OSG Navigator achieves state-of-the-art performance on ObjectNav benchmarks and generalises zero-shot over diverse goals, environments, and robot embodiments.

large language model, machine learning, node, (22 more...)