How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

Wen, Ximing, Mainali, Mallika, Sen, Anik

Apr-25-2025–arXiv.org Artificial Intelligence

Understanding human intentions through visual cues is a fundamental aspect of social intelligence, allowing effective communication, collaboration, and interaction [2]. This capability, often referred to as the Theory of Mind (ToM), involves the ability to infer the beliefs, desires, and intentions of others based on observable behaviors and environmental contexts [9, 7, 12]. Recent advances in VLMs have demonstrated impressive abilities in multimodal reasoning, combining visual and textual information to perform complex tasks [5, 10, 13]. However, their capability to perform ToM-like reasoning, specifically in interpreting intentions from visual cues, remains underexplored. For example, Etesam et al. [4] only investigate the emotional component of ToM, instead of exploring more broad categories such as intentions, religions, etc. Jin et al. [6] frame the ToM task as a binary choice question, without requiring VLMs to engage in open-ended reasoning. Consequently, this approach may not fully capture the VLMs' capability to perform ToM tasks. To further highlight, ToM tasks present unique challenges for VLMs, requiring both visual feature extraction and contextual reasoning to infer hidden mental states. Thus, our study, which evaluates VLM performance on ToM tasks through an open-ended question framework, is pivotal to assessing VLMs' capacity for advanced multimodal understanding and social intelligence.

intention, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Apr-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.69)

Genre:
- Research Report > New Finding (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.39)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found