LLMs achieve adult human performance on higher-order theory of mind tasks

Street, Winnie, Siy, John Oliver, Keeling, Geoff, Baranes, Adrien, Barnett, Benjamin, McKibben, Michael, Kanyere, Tatenda, Lentz, Alison, Arcas, Blaise Aguera y, Dunbar, Robin I. M.

May-31-2024–arXiv.org Artificial Intelligence

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

flan-palm, gpt-4, prompt condition, (15 more...)

arXiv.org Artificial Intelligence

May-31-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts (0.04)
- Europe > United Kingdom
  - England > Oxfordshire > Oxford (0.04)
- Africa > Eswatini
  - Manzini > Manzini (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology (0.68)
- Health & Medicine > Therapeutic Area
  - Neurology (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found