CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

Agrawal, Ayush, Arora, Raghav, Datta, Ahana, Banerjee, Snehasis, Bhowmick, Brojeshwar, Jatavallabhula, Krishna Murthy, Sridharan, Mohan, Krishna, Madhava

Jun-2-2023–arXiv.org Artificial Intelligence

This paper introduces a novel method for determining the best room to place an object in, for embodied scene rearrangement. While state-of-the-art approaches rely on large language models (LLMs) or reinforcement learned (RL) policies for this task, our approach, CLIPGraphs, efficiently combines commonsense domain knowledge, data-driven methods, and recent advances in multimodal learning. Specifically, it (a)encodes a knowledge graph of prior human preferences about the room location of different objects in home environments, (b) incorporates vision-language features to support multimodal queries based on images or text, and (c) uses a graph network to learn object-room affinities based on embeddings of the prior knowledge and the vision-language features. We demonstrate that our approach provides better estimates of the most appropriate location of objects from a benchmark set of object categories in comparison with state-of-the-art baselines

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

Jun-2-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts (0.04)
- Europe > United Kingdom
  - England > West Midlands > Birmingham (0.04)
- Asia > India
  - Telangana > Hyderabad (0.04)

Genre:
- Research Report > Promising Solution (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found