Fixing Hackable Benchmarks for Vision-Language Compositionality
–Neural Information Processing Systems
In the last year alone, a surge of new benchmarks to measure compositional understanding of vision-language models have permeated the machine learning ecosystem. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. Surprisingly, we find significant biases in all these benchmarks rendering them hackable. This hackability is so dire that blind models with no access to the image outperform state-of-the-art vision-language models.
Neural Information Processing Systems
Feb-12-2025, 01:18:34 GMT
- Country:
- Europe > Switzerland (0.28)
- Genre:
- Overview (0.46)
- Research Report (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.48)
- Natural Language > Large Language Model (0.71)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence