WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Nie, Dujun, Guo, Xianda, Duan, Yiqun, Zhang, Ruijun, Chen, Long

Mar-3-2025–arXiv.org Artificial Intelligence

-- Object Goal Navigation---requiring an agent to locate a specific object in an unseen environment---remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)--based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. T o retain the predicted state of the environment, WMNav proposes the online maintained Curiosity V alue Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. T o further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. I NTRODUCTION Effective navigation is a fundamental requirement for domestic robots, allowing them to access specific locations and execute assigned operations [1]. Zero-Shot Object Navigation (ZSON) is a critical component of this functionality, which demands that an agent locate and approach a target object from an unseen category through environmental understanding.

large language model, natural language, navigation, (16 more...)

arXiv.org Artificial Intelligence

Mar-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (1.00)
  - Natural Language > Large Language Model (0.94)
  - Representation & Reasoning (1.00)
  - Vision (1.00)