Find Everything: A General Vision Language Model Approach to Multi-Object Search

Choi, Daniel, Fung, Angus, Wang, Haitong, Tan, Aaron Hao

Oct-1-2024–arXiv.org Artificial Intelligence

In various real-world robot applications, MOS describes the problem of locating multiple objects efficiently [1], in domains such as warehouse management [2, 3], construction inspection [4], or hospitality [5, 6, 7], and retail assistance [8, 9]. Existing MOS methods can be categorized into: 1) probabilistic planning (PP) [1, 10, 11, 12], and 2) deep reinforcement learning (DRL) methods [13, 14, 15, 16, 17, 18, 19, 20]. PP methods utilize Partially Observable Markov Decision Processes (POMDPs) to estimate belief states and plan actions under uncertainty in object locations, while DRL methods optimizes action selection using a reward function [21]. However, both approaches face challenges such as inefficient exploration due to limited semantic modeling between objects and scenes [18], and poor generlization caused by the sim-to-real gap [19]. Recently, Large Foundation Models (LFMs) such as vision-language models (VLMs) and large language models (LLMs) have been applied to single object search (SOS) tasks by using either: 1) VLMs (e.g., CLIP, BLIP, etc.) to generate scene-level embeddings that capture the semantic correlations between the robot's environment and the target object to guide the robot towards regions with high target object likelihood [19, 22, 23, 24, 25]; or, 2) VLMs/LLMs to generate scene captions that describe both the spatial layout and semantic details of the robot's environment which are then used to plan the robot's actions [26, 27, 28, 29, 30, 31, 32]. However, these SOS methods have limitations: 1) they cannot be directly applied to MOS, as they lack explicit mechanisms to track and reason about multiple objects simultaneously, and 2) scene-level embeddings are often noisy and coarse [33], which cannot be effectively applied in object-dense environments. In such cases, fine-grained, object-level embeddings are needed. In this paper, we introduce Finder, the first MOS approach that leverages VLMs to locate multiple target objects in various unknown environments.

large language model, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

Oct-1-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.04)
- North America > Canada
  - Ontario > Toronto (0.14)

Genre:
- Research Report (0.65)

Industry:
- Leisure & Entertainment > Games > Computer Games (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Undirected Networks
      - Markov Models (1.00)
    - Reinforcement Learning (0.87)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)
  - Robots (1.00)