Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports

Watahiki, Amane, Doi, Tomoki, Shinozaki, Taiga, Nishida, Satoshi, Niikawa, Takuya, Miyahara, Katsunori, Yanaka, Hitomi

arXiv.org Artificial Intelligence 

One of the main objectives in developing large vision-language models (L VLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of L VLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many L VLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaV A-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.