Goto

Collaborating Authors

 Choi, Daniel


Find Everything: A General Vision Language Model Approach to Multi-Object Search

arXiv.org Artificial Intelligence

In various real-world robot applications, MOS describes the problem of locating multiple objects efficiently [1], in domains such as warehouse management [2, 3], construction inspection [4], or hospitality [5, 6, 7], and retail assistance [8, 9]. Existing MOS methods can be categorized into: 1) probabilistic planning (PP) [1, 10, 11, 12], and 2) deep reinforcement learning (DRL) methods [13, 14, 15, 16, 17, 18, 19, 20]. PP methods utilize Partially Observable Markov Decision Processes (POMDPs) to estimate belief states and plan actions under uncertainty in object locations, while DRL methods optimizes action selection using a reward function [21]. However, both approaches face challenges such as inefficient exploration due to limited semantic modeling between objects and scenes [18], and poor generlization caused by the sim-to-real gap [19]. Recently, Large Foundation Models (LFMs) such as vision-language models (VLMs) and large language models (LLMs) have been applied to single object search (SOS) tasks by using either: 1) VLMs (e.g., CLIP, BLIP, etc.) to generate scene-level embeddings that capture the semantic correlations between the robot's environment and the target object to guide the robot towards regions with high target object likelihood [19, 22, 23, 24, 25]; or, 2) VLMs/LLMs to generate scene captions that describe both the spatial layout and semantic details of the robot's environment which are then used to plan the robot's actions [26, 27, 28, 29, 30, 31, 32]. However, these SOS methods have limitations: 1) they cannot be directly applied to MOS, as they lack explicit mechanisms to track and reason about multiple objects simultaneously, and 2) scene-level embeddings are often noisy and coarse [33], which cannot be effectively applied in object-dense environments. In such cases, fine-grained, object-level embeddings are needed. In this paper, we introduce Finder, the first MOS approach that leverages VLMs to locate multiple target objects in various unknown environments.


OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation

arXiv.org Artificial Intelligence

Service robots in human-centered environments such as hospitals, office buildings, and long-term care homes need to navigate while adhering to social norms to ensure the safety and comfortability of the people they are sharing the space with. Furthermore, they need to adapt to new social scenarios that can arise during robot navigation. In this paper, we present a novel Online Lifelong Vision Language architecture, OLiVia-Nav, which uniquely integrates vision-language models (VLMs) with an online lifelong learning framework for robot social navigation. We introduce a unique distillation approach, Social Context Contrastive Language Image Pre-training (SC-CLIP), to transfer the social reasoning capabilities of large VLMs to a lightweight VLM, in order for OLiVia-Nav to directly encode social and environment context during robot navigation. These encoded embeddings are used to generate and select robot social compliant trajectories. The lifelong learning capabilities of SC-CLIP enable OLiVia-Nav to update the lightweight VLM with robot trajectory predictions overtime as new social scenarios are encountered. We conducted extensive real-world experiments in diverse social navigation scenarios. The results showed that OLiVia-Nav outperformed existing state-of-the-art DRL and VLM methods in terms of mean squared error, Hausdorff loss, and personal space violation duration. Ablation studies also verified the design choices for OLiVia-Nav.