How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark
Wen, Ximing, Mainali, Mallika, Sen, Anik
–arXiv.org Artificial Intelligence
Understanding human intentions through visual cues is a fundamental aspect of social intelligence, allowing effective communication, collaboration, and interaction [2]. This capability, often referred to as the Theory of Mind (ToM), involves the ability to infer the beliefs, desires, and intentions of others based on observable behaviors and environmental contexts [9, 7, 12]. Recent advances in VLMs have demonstrated impressive abilities in multimodal reasoning, combining visual and textual information to perform complex tasks [5, 10, 13]. However, their capability to perform ToM-like reasoning, specifically in interpreting intentions from visual cues, remains underexplored. For example, Etesam et al. [4] only investigate the emotional component of ToM, instead of exploring more broad categories such as intentions, religions, etc. Jin et al. [6] frame the ToM task as a binary choice question, without requiring VLMs to engage in open-ended reasoning. Consequently, this approach may not fully capture the VLMs' capability to perform ToM tasks. To further highlight, ToM tasks present unique challenges for VLMs, requiring both visual feature extraction and contextual reasoning to infer hidden mental states. Thus, our study, which evaluates VLM performance on ToM tasks through an open-ended question framework, is pivotal to assessing VLMs' capacity for advanced multimodal understanding and social intelligence.
arXiv.org Artificial Intelligence
Apr-25-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- New York (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.70)
- Technology: