EasyARC: Evaluating Vision Language Models on True Visual Reasoning
–arXiv.org Artificial Intelligence
Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. W e benchmark state-of-the-art vision-language models and analyze their failure modes. W e argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models.
arXiv.org Artificial Intelligence
Jun-16-2025
- Country:
- Europe > Switzerland > Zürich > Zürich (0.40)
- Genre:
- Research Report > New Finding (0.69)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (0.49)
- Vision (1.00)
- Information Technology > Artificial Intelligence