ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Jan-18-2025, 23:19:04 GMT–Neural Information Processing Systems

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) – the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot – i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink," "bathroom sink," etc.) by projecting language goals into the same multimodal, semantic embedding space.

agent, multimodal goal embedding, zero-shot object-goal navigation, (4 more...)

Neural Information Processing Systems

Jan-18-2025, 23:19:04 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.67)
  - Robots (0.61)