Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Pawar, Pranav, Shah, Kavish, Bhalani, Akshat, Kasat, Komal, Mittal, Dev, Gala, Hadi, Patil, Deepali, Raichada, Nikita, Deshmukh, Monali

arXiv.org Artificial Intelligence 

In recent years, VLMs have captured the imagination of the Artificial Intelligence(AI) community, demonstrating an impressive ability to interpret, reason about, and generate content that covers both text and image handling. From answering questions about visual scenes to engaging in multi-modal dialogue, models such as Flamingo [1], PaLI [25], and BLIP-2 [14] are redefining the frontier of vision intelligence. Y et, as these models are widening their application capabilities, a fundamental question emerges: can they truly reason, or are they sophisticated pattern matchers? To explore this question, we turn to the domain of physics--a field that serves as a universal benchmark for logical thoughts of a human being. Physics problems are an ideal testbed for VLMs, as they are multi-modal, combining textual descriptions, mathematical equations, and often clarifying diagrams. A model that can successfully solve these problems must not only understand language and images but also grasp the underlying relationships and principles that govern the physical realm. The challenge, uptil now, has been the lack of accessible tools for this kind of evaluation. Existing benchmarks for scientific reasoning, such as ARC [7] and ScienceQA [17], are often limited to basic text-only question sets, while those that incorporate visual elements, like MathVista [18], frequently depend on complex physics simulators that are computationally expensive for many researchers to deploy, thereby restricting reproducibility.