WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Nie, Dujun, Guo, Xianda, Duan, Yiqun, Zhang, Ruijun, Chen, Long

arXiv.org Artificial Intelligence 

-- Object Goal Navigation---requiring an agent to locate a specific object in an unseen environment---remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)--based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. T o retain the predicted state of the environment, WMNav proposes the online maintained Curiosity V alue Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. T o further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. I NTRODUCTION Effective navigation is a fundamental requirement for domestic robots, allowing them to access specific locations and execute assigned operations [1]. Zero-Shot Object Navigation (ZSON) is a critical component of this functionality, which demands that an agent locate and approach a target object from an unseen category through environmental understanding.