EasyARC: Evaluating Vision Language Models on True Visual Reasoning

Jun-16-2025–arXiv.org Artificial Intelligence

Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. W e benchmark state-of-the-art vision-language models and analyze their failure modes. W e argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jun-16-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found