vision-language compositionality
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Research Report (0.67)
- Workflow (0.46)
- Overview (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
In the last year alone, a surge of new benchmarks to measure $\textit{compositional}$ understanding of vision-language models have permeated the machine learning ecosystem.Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors.Surprisingly, we find significant biases in $\textit{all}$ these benchmarks rendering them hackable. This hackability is so dire that blind models with no access to the image outperform state-of-the-art vision-language models.To remedy this rampant vulnerability, we introduce $\textit{SugarCrepe}$, a new benchmark for vision-language compositionality evaluation.We employ large language models, instead of rule-based templates used in previous benchmarks, to generate fluent and sensical hard negatives, and utilize an adversarial refinement mechanism to maximally reduce biases.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Research Report (0.67)
- Workflow (0.46)
- Overview (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
In the last year alone, a surge of new benchmarks to measure \textit{compositional} understanding of vision-language models have permeated the machine learning ecosystem.Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors.Surprisingly, we find significant biases in \textit{all} these benchmarks rendering them hackable. This hackability is so dire that blind models with no access to the image outperform state-of-the-art vision-language models.To remedy this rampant vulnerability, we introduce \textit{SugarCrepe}, a new benchmark for vision-language compositionality evaluation.We employ large language models, instead of rule-based templates used in previous benchmarks, to generate fluent and sensical hard negatives, and utilize an adversarial refinement mechanism to maximally reduce biases.