Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Zhang, Zory, Feng, Pinyuan, Wang, Bingyang, Zhao, Tianwei, Yu, Suyang, Gao, Qingying, Deng, Hokin, Ma, Ziqiao, Li, Yijiang, Luo, Dezhi

Oct-10-2025–arXiv.org Artificial Intelligence

The ability to infer what others are looking at is a critical component of a theory of mind that underpins natural human-AI interaction. We characterized this skill in 111 Vision Language Models (VLMs) and human participants (N = 65) using photos taken with manipulated difficulty and variability. We found that 94 of the 111 VLMs were not better than random guessing, while humans achieved near-ceiling accuracy. VLMs respond with each choice almost equally frequently. Are they randomly guessing? At least for five top-tier VLMs, their performance was above chance, declined with increasing task difficulty, but barely varied across different prompts and scene objects. These behavioral patterns cannot be explained by considering VLMs as random guessers. Instead, they likely utilize head orientation but not eye appearance to infer gaze direction, such that their performance is imperfect, subject to the task difficulty, but robust to superficial perceptual variations. This suggests that VLMs, lacking effective gaze inference skills, have yet to become technologies that can naturally interact with humans, but the potential remains.

large language model, machine learning, vlm, (23 more...)

arXiv.org Artificial Intelligence

Oct-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States
  - California > San Diego County
    - San Diego (0.04)
  - District of Columbia > Washington (0.04)
  - Michigan (0.04)
  - New York (0.04)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)
  - Vision (1.00)