Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports

Watahiki, Amane, Doi, Tomoki, Shinozaki, Taiga, Nishida, Satoshi, Niikawa, Takuya, Miyahara, Katsunori, Yanaka, Hitomi

Jul-9-2025–arXiv.org Artificial Intelligence

One of the main objectives in developing large vision-language models (L VLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of L VLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many L VLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaV A-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.

category, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jul-9-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.28)
- Europe > United Kingdom
  - England (0.28)

Genre:
- Research Report > New Finding (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.90)
  - Representation & Reasoning > Ontologies (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found