SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Tang, Zhenwei, Jiao, Difan, Yang, Blair, Anderson, Ashton

Aug-26-2025–arXiv.org Artificial Intelligence

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Aug-26-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Media > Music (0.93)
- Leisure & Entertainment > Games
  - Chess (0.49)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Cognitive Science > Problem Solving (0.88)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.70)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)