VisMin: Visual Minimal-Change Understanding Saba Ahmadi Le Zhang
–Neural Information Processing Systems
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). To evaluate VLMs' fine-grained understanding, existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, our focus is on evaluating VLMs' capability to distinguish between two very similar images give a caption. To this end, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal-changes, i.e., between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation.
Neural Information Processing Systems
Mar-27-2025, 06:53:09 GMT
- Country:
- Asia > Middle East
- UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Switzerland
- North America > Canada
- Quebec (0.14)
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (1.00)
- Research Report
- Industry:
- Education (0.67)
- Information Technology > Security & Privacy (0.67)
- Leisure & Entertainment > Sports
- Tennis (0.46)
- Technology: