GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
Saxena, Saumya, Buchanan, Blake, Paxton, Chris, Chen, Bingqing, Vaskevicius, Narunas, Palmieri, Luigi, Francis, Jonathan, Kroemer, Oliver
–arXiv.org Artificial Intelligence
For example, to answer explore and develop a semantic understanding of an unseen the question "How many chairs are there at the dining environment in order to answer a situated question table?", the agent might rely on commonsense knowledge with confidence. This remains a challenging problem in to understand that dining tables are often associated with robotics, due to the difficulties in obtaining useful semantic dining rooms and dining rooms are usually near the kitchen representations, updating these representations online, and towards the back of a home. A reasonable navigation strategy leveraging prior world knowledge for efficient exploration would involve navigating to the back of the house to and planning. Aiming to address these limitations, we propose locate a kitchen. To ground this search in the current environment, GraphEQA, a novel approach that utilizes real-time however, requires the agent to continually maintain 3D metric-semantic scene graphs (3DSGs) and task relevant an understanding of where it is, memory of where it images as multi-modal memory for grounding Vision-has been, and what further exploratory actions will lead it Language Models (VLMs) to perform EQA tasks in unseen to relevant regions. Finally, the agent needs to observe the environments. We employ a hierarchical planning approach target object(s) and perform visual grounding, in order to that exploits the hierarchical nature of 3DSGs for structured reason about the number of chairs around the dining table, planning and semantic-guided exploration. Through experiments and confidently answer the question correctly.
arXiv.org Artificial Intelligence
Dec-18-2024
- Genre:
- Research Report > New Finding (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language
- Large Language Model (0.69)
- Text Processing (0.46)
- Representation & Reasoning (1.00)
- Robots (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence