FOM-Nav: Frontier-Object Maps for Object Goal Navigation

Chabal, Thomas, Chen, Shizhe, Ponce, Jean, Schmid, Cordelia

Dec-2-2025–arXiv.org Artificial Intelligence

Abstract-- This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. T o address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. T o train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot. Autonomous navigation has been a long-standing challenge in robotics [1], dating back to the pioneering work on the robot Shakey [2] in the 1960s. While early work focused on navigating to specific points [3], [4] with a preconstructed map [5], [6], recent research has progressively shifted towards navigation in unknown environments using textual [7], [8] or visual [9] goals, which is an essential capability for enabling mobile manipulation systems [10], [11] to perform diverse real-world tasks. In this work, we focus on the object goal navigation task (ObjectNav) [8], where an agent must navigate to a target object category in an unknown environment using RGB-D observations. This task requires long-horizon multimodal scene understanding and efficient exploration. The robot should not only recognize objects within its current field of view but also use previous observations to develop more accurate scene understanding.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-2-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found