Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

Ryan, Yuriel, Tan, Rui Yang, Choo, Kenny Tsu Wei, Lee, Roy Ka-Wei

Sep-18-2025–arXiv.org Artificial Intelligence

Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia > China (0.28)
- North America
  - United States (0.28)
  - Canada (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine > Therapeutic Area (0.46)

Technology:
- Information Technology
  - Communications > Social Media (0.67)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Cognitive Science (1.00)
    - Vision (0.93)
    - Representation & Reasoning (0.93)
    - Machine Learning
      - Neural Networks > Deep Learning (0.96)
      - Performance Analysis > Accuracy (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found