SQA3D: Situated Question Answering in 3D Scenes

Ma, Xiaojian, Yong, Silong, Zheng, Zilong, Li, Qing, Liang, Yitao, Zhu, Song-Chun, Huang, Siyuan

Apr-12-2023–arXiv.org Artificial Intelligence

The categories listed here do not mean to be exhaustive and a question could fall into multiple categories. Playing computer games sink and facing the towels. Albeit these promising advances, their actual performances in real-world embodied environments could still fall short of human expectations, especially in generalization to different situations (scenes and locations) and tasks that require substantial, knowledge-intensive reasoning. To diagnose the fundamental capability of realistic embodied agents, we investigate the problem of embodied scene understanding, where the agent needs to understand its situation and the surroundings in the environment from a dynamic egocentric view, then perceive, reason, and act accordingly, to accomplish complex tasks. What is at the core of embodied scene understanding? Drawing inspirations from situated cognition (Greeno, 1998; Anderson et al., 2000), a seminal theory of embodiment, we anticipate it to be two-fold: Situation understanding. The ability to imagine what the agent will see from arbitrary situations (position, orientations, etc.) in a 3D scene and understand the surroundings anchored to the situation, therefore generalize to novel positions or scenes; Situated reasoning. The ability to acquire knowledge about the environment based on the agents' current situation and reason with the knowledge, therefore further facilitates accomplishing complex action planning tasks. To step towards embodied scene understanding, we introduce SQA3D, a new task that reconciles the best of both parties, situation understanding, and situated reasoning, into embodied 3D scene understanding. Figure 1 sketches our task: given a 3D scene context (e.g., 3D scan, ego-centric video, or bird-eye view (BEV) picture), the agent in the 3D scene needs to first comprehend and localize its situation (position, orientation, etc.) from a textual description, then answer a question that requires substantial situated reasoning from that perspective. We crowd-sourced the situation descriptions from Amazon MTurk (AMT), where participants are instructed to select diverse locations and orientations in 3D scenes. To systematically examine the agent's ability in situated reasoning, we collect questions that cover a wide spectrum of knowledge, ranging from spatial relations to navigation, common sense reasoning, and multi-hop reasoning.

artificial intelligence, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

Apr-12-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.04)
- North America > United States
  - Washington > King County > Seattle (0.04)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Leisure & Entertainment > Games (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.95)
  - Representation & Reasoning
    - Commonsense Reasoning (0.86)
    - Agents (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found