AITopics | scanqa

Collaborating Authors

scanqa

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?

Jin, Jiahe, He, Yanheng, Yang, Mingyan

arXiv.org Artificial IntelligenceFeb-12-2025

In this work, we identify the "2D-Cheating" problem in 3D LLM evaluation, where these tasks might be easily solved by VLMs with rendered images of point clouds, exposing ineffective evaluation of 3D LLMs' unique 3D capabilities. We test VLM performance across multiple 3D LLM benchmarks and, using this as a reference, propose principles for Figure 1: Example of 2D-Cheating. With rendered better assessing genuine 3D understanding. We images of the point cloud, VLMs could easily solve also advocate explicitly separating 3D abilities some 3D tasks, and even outperform 3D LLMs.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.08503

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

Singh, Simranjit, Pavlakos, Georgios, Stamoulis, Dimitrios

arXiv.org Artificial IntelligenceMay-29-2024

As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community's ongoing efforts to refine multi-modal 3D benchmarks.

agent, benchmark, caption, (13 more...)

arXiv.org Artificial Intelligence

2405.18831

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Mo, Wentao, Liu, Yang

arXiv.org Artificial IntelligenceFeb-24-2024

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{https://github.com/matthewdm0816/BridgeQA}{\text{this URL}}$.

dataset, modality, scanqa, (17 more...)

arXiv.org Artificial Intelligence

2402.15933

Country: Asia > China (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.63)

Add feedback

SQA3D: Situated Question Answering in 3D Scenes

Ma, Xiaojian, Yong, Silong, Zheng, Zilong, Li, Qing, Liang, Yitao, Zhu, Song-Chun, Huang, Siyuan

arXiv.org Artificial IntelligenceApr-12-2023

The categories listed here do not mean to be exhaustive and a question could fall into multiple categories. Playing computer games sink and facing the towels. Albeit these promising advances, their actual performances in real-world embodied environments could still fall short of human expectations, especially in generalization to different situations (scenes and locations) and tasks that require substantial, knowledge-intensive reasoning. To diagnose the fundamental capability of realistic embodied agents, we investigate the problem of embodied scene understanding, where the agent needs to understand its situation and the surroundings in the environment from a dynamic egocentric view, then perceive, reason, and act accordingly, to accomplish complex tasks. What is at the core of embodied scene understanding? Drawing inspirations from situated cognition (Greeno, 1998; Anderson et al., 2000), a seminal theory of embodiment, we anticipate it to be two-fold: Situation understanding. The ability to imagine what the agent will see from arbitrary situations (position, orientations, etc.) in a 3D scene and understand the surroundings anchored to the situation, therefore generalize to novel positions or scenes; Situated reasoning. The ability to acquire knowledge about the environment based on the agents' current situation and reason with the knowledge, therefore further facilitates accomplishing complex action planning tasks. To step towards embodied scene understanding, we introduce SQA3D, a new task that reconciles the best of both parties, situation understanding, and situated reasoning, into embodied 3D scene understanding. Figure 1 sketches our task: given a 3D scene context (e.g., 3D scan, ego-centric video, or bird-eye view (BEV) picture), the agent in the 3D scene needs to first comprehend and localize its situation (position, orientation, etc.) from a textual description, then answer a question that requires substantial situated reasoning from that perspective. We crowd-sourced the situation descriptions from Amazon MTurk (AMT), where participants are instructed to select diverse locations and orientations in 3D scenes. To systematically examine the agent's ability in situated reasoning, we collect questions that cover a wide spectrum of knowledge, ranging from spatial relations to navigation, common sense reasoning, and multi-hop reasoning.

artificial intelligence, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2210.07474

Country:

Asia > China (0.04)
North America > United States > Washington > King County > Seattle (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Leisure & Entertainment > Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.66)

Add feedback