When Engineering Outruns Intelligence: Rethinking Instruction-Guided Navigation

Aghaei, Matin, Zhang, Lingfeng, Alomrani, Mohammad Ali, Biparva, Mahdi, Zhang, Yingxue

arXiv.org Artificial Intelligence 

Recent ObjectNav systems credit large language models (LLMs) for sizable zero-shot gains, yet it remains unclear how much comes from language versus geometry. We conduct a controlled study on HM3D and MP3D that revisits language-for-navigation through the lens of geometry-first exploration. Beyond ObjectNav, large foundation models are increasingly being employed in various other embodied tasks. ObjectNav asks an agent to reach any instance of a named object category (e.g., Find a At each time step, RGB-D and pose are fused into a 2D navigability map; free space vs. obstacles Islands will later serve as anchor sets for scoring or selection. InstructNav (Long et al., 2024) turns the instruction and When the named goal object is observed, InstructNav's LFG (Shah et al., 2023) is a complementary paradigm: instead of composing multiple value maps, it LFG does not assume open-vocabulary detectors or a VLM "intuition" map; its only learned SHF's prompt templates are included in Appendix B. All experiments run in Habitat (release 3) with default navigation mesh and physics (Puig et al., Success is declared when the goal object is visible and the agent is within 0.25 m.