EasyARC: Evaluating Vision Language Models on True Visual Reasoning

Unsal, Mert, Akkus, Aylin

arXiv.org Artificial Intelligence 

Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. W e benchmark state-of-the-art vision-language models and analyze their failure modes. W e argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models.