AITopics | Choi, Daniel

Collaborating Authors

Choi, Daniel

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Find Everything: A General Vision Language Model Approach to Multi-Object Search

Choi, Daniel, Fung, Angus, Wang, Haitong, Tan, Aaron Hao

arXiv.org Artificial IntelligenceOct-1-2024

In various real-world robot applications, MOS describes the problem of locating multiple objects efficiently [1], in domains such as warehouse management [2, 3], construction inspection [4], or hospitality [5, 6, 7], and retail assistance [8, 9]. Existing MOS methods can be categorized into: 1) probabilistic planning (PP) [1, 10, 11, 12], and 2) deep reinforcement learning (DRL) methods [13, 14, 15, 16, 17, 18, 19, 20]. PP methods utilize Partially Observable Markov Decision Processes (POMDPs) to estimate belief states and plan actions under uncertainty in object locations, while DRL methods optimizes action selection using a reward function [21]. However, both approaches face challenges such as inefficient exploration due to limited semantic modeling between objects and scenes [18], and poor generlization caused by the sim-to-real gap [19]. Recently, Large Foundation Models (LFMs) such as vision-language models (VLMs) and large language models (LLMs) have been applied to single object search (SOS) tasks by using either: 1) VLMs (e.g., CLIP, BLIP, etc.) to generate scene-level embeddings that capture the semantic correlations between the robot's environment and the target object to guide the robot towards regions with high target object likelihood [19, 22, 23, 24, 25]; or, 2) VLMs/LLMs to generate scene captions that describe both the spatial layout and semantic details of the robot's environment which are then used to plan the robot's actions [26, 27, 28, 29, 30, 31, 32]. However, these SOS methods have limitations: 1) they cannot be directly applied to MOS, as they lack explicit mechanisms to track and reason about multiple objects simultaneously, and 2) scene-level embeddings are often noisy and coarse [33], which cannot be effectively applied in object-dense environments. In such cases, fine-grained, object-level embeddings are needed. In this paper, we introduce Finder, the first MOS approach that leverages VLMs to locate multiple target objects in various unknown environments.

large language model, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2410.00388

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.65)

Industry: Leisure & Entertainment > Games > Computer Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation

Narasimhan, Siddarth, Tan, Aaron Hao, Choi, Daniel, Nejat, Goldie

arXiv.org Artificial IntelligenceSep-20-2024

Service robots in human-centered environments such as hospitals, office buildings, and long-term care homes need to navigate while adhering to social norms to ensure the safety and comfortability of the people they are sharing the space with. Furthermore, they need to adapt to new social scenarios that can arise during robot navigation. In this paper, we present a novel Online Lifelong Vision Language architecture, OLiVia-Nav, which uniquely integrates vision-language models (VLMs) with an online lifelong learning framework for robot social navigation. We introduce a unique distillation approach, Social Context Contrastive Language Image Pre-training (SC-CLIP), to transfer the social reasoning capabilities of large VLMs to a lightweight VLM, in order for OLiVia-Nav to directly encode social and environment context during robot navigation. These encoded embeddings are used to generate and select robot social compliant trajectories. The lifelong learning capabilities of SC-CLIP enable OLiVia-Nav to update the lightweight VLM with robot trajectory predictions overtime as new social scenarios are encountered. We conducted extensive real-world experiments in diverse social navigation scenarios. The results showed that OLiVia-Nav outperformed existing state-of-the-art DRL and VLM methods in terms of mean squared error, Hausdorff loss, and personal space violation duration. Ablation studies also verified the design choices for OLiVia-Nav.

large language model, machine learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

2409.13675

Country: North America > Canada > Ontario > Toronto (0.14)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.68)

Industry: Health & Medicine > Health Care Providers & Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.56)
Information Technology > Artificial Intelligence > Robots > Robots in the Home (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback